Binary header values are printed in their base64 encoded form.
The GrpcHttpOutboundHeaders, as mentioned in the issue, don't seem to be affected by this regression. The toString() method seems fine.
Highlights
==========
StatsTraceContext
-----------------
The bridge between gRPC library and Census. It keeps track of the total
payload sizes and the elapsed time of a Call. The rest of the gRPC code
doesn't invoke Census directly.
Context propagation
-------------------
StatsTraceContext carries CensusContext (and the upcoming TraceContext)
and is attached to the gRPC Context.
1. StatsTraceContext is created by ManagedChannelImpl, by calling
createClientContext(), which inherits the current CensusContext if available.
2. ManagedChannelImpl passes StatsTraceContext to ClientCallImpl, then
to the stream, then to the framer and deframer explicitly.
3. ClientCallImpl propagates the CensusContext to the headers.
1. ServerImpl creates a StatsTraceContext by implementing a new callback
method StatsTraceContext methodDetermined(MethodDescriptor, Metadata) on
ServerTransportListener.
2. NettyServerHandler calls methodDetermined() before creating the
stream, and passes the StatsTraceContext to the stream.
3. When ServerImpl creates the gRPC Context for the new ServerCall, it
calls the new method statsTraceContext() on ServerStream and puts the
StatsTraceContext in the Context.
Metrics recording
-----------------
1. Client-side start time: when ClientCallImpl is created
2. Server-side start time: when methodDetermined() is called
3. Server-side end time: in ServerStreamListener.closed(), but before
calling onComplete() or onCancel() on ServerCall.Listener.
4. Client-side end time: in ClientStreamListener.closed(), but before
calling onClonse() on ClientCall.Listener
Message sizes are recorded in MessageFramer and MessageDeframer. Both
the uncompressed and wire (possibly compressed) payload sizes are
counted.
TODOs
=====
The CensusContext created from headers on the server side should be
attached to the gRPC Context for the call. It's not done at this moment
because Census lacks the proper API to do it. It only affects tracing
and resource accounting, but doesn't affect stats functionality
Our API allows pings to be send even after the transport has been shutdown. We currently
don't handle the case, where the Netty channel has been closed but the NettyClientHandler
has not yet been removed from the pipeline, correctly. That is, we need to query the shutdown
status whenever we receive a ClosedChannelException.
Also, some cleanup.
The DefaultHttp2Headers class is a general-purpose Http2Headers implementation
and provides much more functionality than we need in gRPC. In gRPC, when reading
headers off the wire, we only inspect a handful of them, before converting to
Metadata.
This commit introduces a Http2Headers implementation that aims for insertion
efficiency, a low memory footprint and fast conversion to Metadata.
- Header names and values are stored in plain byte[].
- Insertion is O(1), while lookup is now O(n).
- Binary header values are base64 decoded as they are inserted.
- The byte[][] returned by namesAndValues() can directly be used to construct
a new Metadata object.
- For HTTP/2 request headers, the pseudo headers are no longer carried over to
Metadata.
A microbenchmark aiming to replicate the usage of Http2Headers in NettyClientHandler
and NettyServerHandler shows decent throughput gains when compared to DefaultHttp2Headers.
Benchmark Mode Cnt Score Error Units
InboundHeadersBenchmark.defaultHeaders_clientHandler avgt 10 283.830 ± 4.063 ns/op
InboundHeadersBenchmark.defaultHeaders_serverHandler avgt 10 1179.975 ± 21.810 ns/op
InboundHeadersBenchmark.grpcHeaders_clientHandler avgt 10 190.108 ± 3.510 ns/op
InboundHeadersBenchmark.grpcHeaders_serverHandler avgt 10 561.426 ± 9.079 ns/op
Additionally, the memory footprint is reduced by more than 50%!
gRPC Request Headers: 864 bytes
Netty Request Headers: 1728 bytes
gRPC Response Headers: 216 bytes
Netty Response Headers: 528 bytes
Furthermore, this change does most of the gRPC groundwork necessary to be able
to cache higher ordered objects in HPACK's dynamic table, as discussed in [1].
[1] https://github.com/grpc/grpc-java/issues/2217
Metadata.removeAll creates an iterator for looking through removed
values even if the call doens't use it. This change adds a similar
method which doesn't create garbage.
This change makes it easier in the future to alter the internals
of Metadata where it may be expensive to return removed values.
Called whenever a ServerTransport is ready and terminated. Has the
ability to modify transport attributes, which ServerCall.attributes()
are based on.
Related changes:
- Attribute keys for remote address and SSL session are now moved from
ServerCall to a neutral place io.grpc.Grpc, because they can also be
used from ServerTransportFilter, and probably will be used on the
client-side too. The old keys on ServerCall is marked deprecated and
are equivalent to the new keys.
- Added transportReady() to ServerTransportListener.
Resolves#2132
After debugging #2153, it would have been nice to know what the exact
parameter was that was null. This change adds a name for each
checkNotNull (and tries to normalized on static imports in order to
shorten lines)
Implementations of ManagedClientTransport.start() are restricted from
calling the passed listener until start() returns, in order to avoid
reentrency problems with locks. For most transports this isn't a
problem, because they need additional threads anyway. InProcess uses no
additional threads naturally so ends up needing a thread just to
notifyReady. Now transports can just return a Runnable that can be run
after locks are dropped.
This was originally intended to be a performance optimization, but the
thread also causes nondeterminism because RPCs are delayed until
notifyReady is called. So avoiding the thread reduces needless fakes
during tests.
WriteQueue uses LinkedBlockingQueue, which has stronger synchronization
semantics than we need. It also requires that we batch reads from it
in order to get reasonable performance. After profiling the delay
between writing to LBQ and reading from it, there was a ~10us delay.
This change switches to using ConcurrentLinkedQueue as the underlying
queue, and removes the batching (reads). Using CLQ with batching is
slightly slower.
Benchmarks show favorable numbers for both latency and throughput.
Each of the following results were run serveral times:
Before:
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.unaryCall1024 true NETTY sample 321575 124185.027 ± 406.112 ns/op
TransportBenchmark.unaryCall1024 false NETTY sample 237400 168232.991 ± 548.043 ns/op
After:
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.unaryCall1024 true NETTY sample 354773 112552.339 ± 362.471 ns/op
TransportBenchmark.unaryCall1024 false NETTY sample 263297 151660.490 ± 507.463 ns/op
Qps with 10 outstanding RPCs per channel:
Before:
Channels: 4
Outstanding RPCs per Channel: 10
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 396
90%ile Latency (in micros): 680
95%ile Latency (in micros): 838
99%ile Latency (in micros): 1476
99.9%ile Latency (in micros): 5231
Maximum Latency (in micros): 43327
QPS: 85761
After:
Channels: 4
Outstanding RPCs per Channel: 10
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 384
90%ile Latency (in micros): 612
95%ile Latency (in micros): 725
99%ile Latency (in micros): 1080
99.9%ile Latency (in micros): 3107
Maximum Latency (in micros): 30447
QPS: 93353
The results are even better when under heavy load. Qps with 100
outstanding RPCs per channel:
Before:
Channels: 4
Outstanding RPCs per Channel: 100
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 2735
90%ile Latency (in micros): 5051
95%ile Latency (in micros): 6219
99%ile Latency (in micros): 9271
99.9%ile Latency (in micros): 13759
Maximum Latency (in micros): 44831
QPS: 125775
After:
Channels: 4
Outstanding RPCs per Channel: 100
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 2697
90%ile Latency (in micros): 4639
95%ile Latency (in micros): 5539
99%ile Latency (in micros): 7931
99.9%ile Latency (in micros): 12335
Maximum Latency (in micros): 61823
QPS: 131904
An AsciiString object may only use a subsection of its backing byte array. We need to test for this and return a copy of the subsection if necessary.
Big thanks to @normanmaurer for uncovering this issue: https://github.com/netty/netty/issues/5472
Resolves#1756
The thread-unsafe method `io.grpc.testing.TestUtils.pickUnusedPort` causes flakes (#1756) in windows. Need to avoid use of this method in test as in windows the tests are running in different jvms and concurrent calls of this method in multiple processes tend to return the same port number.
There are some usages of this method in benchmarks, so moved the method to `io.grpc.benchmarks.Utils` and the method will only be used in benchmarks and not in test.
A transport is "in use" iff number of streams > 0. In following changes
the channel will use this information when deciding whether it should
transit to the IDLE mode (#1276).
Introduce CallCredentials as a first-class option to allow applications
to set per-call credentials into headers for outgoing RPCs. This will
supersede ClientAuthInterceptor. It has access to more
information (e.g., transport attributes, MethodDescriptor) and allow
results to be returned asynchronously, e.g., from a blocking I/O, which
was problemantic with ClientAuthInterceptor.
adding
ClientStream newStream(MethodDescriptor<?, ?> method, Metadata headers, CallOptions callOptions);
to ClientTransport interface
Created this PR first because both fail fast implementation and another change will be using this interface change