Found this bug in some unit test in which client-streaming call is hanging because `ServerCallImpl#messageRead()` did not handle RuntimeException properly
Highlights
==========
StatsTraceContext
-----------------
The bridge between gRPC library and Census. It keeps track of the total
payload sizes and the elapsed time of a Call. The rest of the gRPC code
doesn't invoke Census directly.
Context propagation
-------------------
StatsTraceContext carries CensusContext (and the upcoming TraceContext)
and is attached to the gRPC Context.
1. StatsTraceContext is created by ManagedChannelImpl, by calling
createClientContext(), which inherits the current CensusContext if available.
2. ManagedChannelImpl passes StatsTraceContext to ClientCallImpl, then
to the stream, then to the framer and deframer explicitly.
3. ClientCallImpl propagates the CensusContext to the headers.
1. ServerImpl creates a StatsTraceContext by implementing a new callback
method StatsTraceContext methodDetermined(MethodDescriptor, Metadata) on
ServerTransportListener.
2. NettyServerHandler calls methodDetermined() before creating the
stream, and passes the StatsTraceContext to the stream.
3. When ServerImpl creates the gRPC Context for the new ServerCall, it
calls the new method statsTraceContext() on ServerStream and puts the
StatsTraceContext in the Context.
Metrics recording
-----------------
1. Client-side start time: when ClientCallImpl is created
2. Server-side start time: when methodDetermined() is called
3. Server-side end time: in ServerStreamListener.closed(), but before
calling onComplete() or onCancel() on ServerCall.Listener.
4. Client-side end time: in ClientStreamListener.closed(), but before
calling onClonse() on ClientCall.Listener
Message sizes are recorded in MessageFramer and MessageDeframer. Both
the uncompressed and wire (possibly compressed) payload sizes are
counted.
TODOs
=====
The CensusContext created from headers on the server side should be
attached to the gRPC Context for the call. It's not done at this moment
because Census lacks the proper API to do it. It only affects tracing
and resource accounting, but doesn't affect stats functionality
Channel state API doesn't allow a TRANSIENT_FAILURE->IDLE edge.
Change TransportSet to always transition to CONNECTING after
TRANSIENT_FAILURE.
This behavior, combined with that it never uses IDLE_TIMEOUT to
transition from READY to IDLE, effectivly makes TransportSet
Channel-state API-compliant under an infinite IDLE_TIMEOUT.
Also set the default IDLE_TIMEOUT to 30min.
Trying to fix issue #2188
- Try to keep avoiding the lock issue #2152 and also to avoid race condition #2188.
- Add `checkState` for `endBackoff()`. Could help hit and identify any potential issue related to #2188.
- Make sure `startBackoff()` and `endBackoff()` invoked in the right order.
- Not to schedule endBackoff if transportSet has been shutdown.
US_ASCII may have not been initialized when the Code enums are created,
causing an NPE and making Status class fail to load:
java.lang.ExceptionInInitializerError
at io.grpc.Status$Code.<init>(Status.java:222)
at io.grpc.Status$Code.<clinit>(Status.java:79)
at io.grpc.internal.GrpcUtilTest.http2ErrorStatus(GrpcUtilTest.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:49)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:230)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:273)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:240)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:65)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:238)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:55)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:231)
at org.junit.runners.ParentRunner.run(ParentRunner.java:316)
at com.google.testing.junit.runner.junit4.CancellableRequestFactory$CancellableRunner.run(CancellableRequestFactory.java:90)
at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
at org.junit.runner.JUnitCore.run(JUnitCore.java:138)
at com.google.testing.junit.runner.junit4.JUnit4Runner.run(JUnit4Runner.java:112)
at com.google.testing.junit.runner.GoogleTestRunner.runTestsInSuite(GoogleTestRunner.java:197)
at com.google.testing.junit.runner.GoogleTestRunner.runTestsInSuite(GoogleTestRunner.java:174)
at com.google.testing.junit.runner.GoogleTestRunner.main(GoogleTestRunner.java:133)
Caused by: java.lang.NullPointerException
at io.grpc.Status$Code.values(Status.java:75)
at io.grpc.Status.buildStatusList(Status.java:249)
at io.grpc.Status.<clinit>(Status.java:245)
Strangely this only fails GrpcUtilTest.http2ErrorStatus inside google.
Now switch to use Guava Charsets.
Commit 656e8ce (#2208) removed most usages, but missed one and one was
re-added. Since all usages are removed, it should be much easier to
notice regressions.
This change exposed a pre-existing bug where shutdownNow wasn't called
for decommissionedTransports. The bug is fixed and a test added in this
commit.
Fixes#2120
Our API allows pings to be send even after the transport has been shutdown. We currently
don't handle the case, where the Netty channel has been closed but the NettyClientHandler
has not yet been removed from the pipeline, correctly. That is, we need to query the shutdown
status whenever we receive a ClosedChannelException.
Also, some cleanup.
The DefaultHttp2Headers class is a general-purpose Http2Headers implementation
and provides much more functionality than we need in gRPC. In gRPC, when reading
headers off the wire, we only inspect a handful of them, before converting to
Metadata.
This commit introduces a Http2Headers implementation that aims for insertion
efficiency, a low memory footprint and fast conversion to Metadata.
- Header names and values are stored in plain byte[].
- Insertion is O(1), while lookup is now O(n).
- Binary header values are base64 decoded as they are inserted.
- The byte[][] returned by namesAndValues() can directly be used to construct
a new Metadata object.
- For HTTP/2 request headers, the pseudo headers are no longer carried over to
Metadata.
A microbenchmark aiming to replicate the usage of Http2Headers in NettyClientHandler
and NettyServerHandler shows decent throughput gains when compared to DefaultHttp2Headers.
Benchmark Mode Cnt Score Error Units
InboundHeadersBenchmark.defaultHeaders_clientHandler avgt 10 283.830 ± 4.063 ns/op
InboundHeadersBenchmark.defaultHeaders_serverHandler avgt 10 1179.975 ± 21.810 ns/op
InboundHeadersBenchmark.grpcHeaders_clientHandler avgt 10 190.108 ± 3.510 ns/op
InboundHeadersBenchmark.grpcHeaders_serverHandler avgt 10 561.426 ± 9.079 ns/op
Additionally, the memory footprint is reduced by more than 50%!
gRPC Request Headers: 864 bytes
Netty Request Headers: 1728 bytes
gRPC Response Headers: 216 bytes
Netty Response Headers: 528 bytes
Furthermore, this change does most of the gRPC groundwork necessary to be able
to cache higher ordered objects in HPACK's dynamic table, as discussed in [1].
[1] https://github.com/grpc/grpc-java/issues/2217
The cast required in protobuf makes me question how much I like
ReflectableMarshaller, but it seems to be pretty sound and the cast is
more an artifact of generics than the API.
Nano and Thrift were purposefully not updated, since getting just the
class requires making a new message instance. That seems a bit lame. It
probably is no burden to create an instance to get the class, and it may
not be too hard to improve the factory to provide class information, but
didn't want to bother at this point. Especially since nano users are
unlikely to need the introspection functionality.
Metadata.removeAll creates an iterator for looking through removed
values even if the call doens't use it. This change adds a similar
method which doesn't create garbage.
This change makes it easier in the future to alter the internals
of Metadata where it may be expensive to return removed values.
The Context API is not particularly gRPC-specific, and will be used by
Census as its context propagation mechanism.
Removed all dependencies to make it easy for other libraries to depend
on.
Called whenever a ServerTransport is ready and terminated. Has the
ability to modify transport attributes, which ServerCall.attributes()
are based on.
Related changes:
- Attribute keys for remote address and SSL session are now moved from
ServerCall to a neutral place io.grpc.Grpc, because they can also be
used from ServerTransportFilter, and probably will be used on the
client-side too. The old keys on ServerCall is marked deprecated and
are equivalent to the new keys.
- Added transportReady() to ServerTransportListener.
Resolves#2132
This reduces the number of methods gRPC brings in by ~450, which is
substantial. Each application will see different numbers though,
depending on their usage and their other dependencies.
A very rough (under) counting for number of methods included because of
gRPC in android-interop-test is 2746, and that is reduced to 2313 (-433)
by this change. That count includes grpc, guava, okhttp, okio, and nano.
The actual reduction of methods is 447, with the discrepency due to
reduction of methods in java.util and java.lang. Of the 433 removed
methods, 377 are from com.google.common.collect and 61 from
com.google.common.base. The removal costed an increase of 5 methods
(total 1671) within io.grpc itself.
Instead of `List<List<ResolvedServerInfo>>`, `onUpdate` now takes
`List<ResolvedServerInfoGroup>` as an argument and every `ResolvedServerInfoGroup`
object can have `Attributes` attached to it which means that we can provide
attributes on each level:
* root level via `onUpdate` argument (applies to all servers)
* group level via property of `ResolvedServerInfoGroup` (applies to all servers
in the group)
* host level via property of `ResolvedServerInfo` (applies to a single server)
CodedInputStream is risk averse in ways that hurt performance when
parsing large messages. gRPC knows how large the input size is as it
is being read from the wire, and only tries to parse it once the entire
message has been read in. The message is represented as chunks of
memory strung together in a CompositeReadableBuffer, and then wrapped
in a custom BufferInputStream.
When passed to Protobuf, CodedInputStream attempts to read data out
of this InputStream into CIS's internal 4K buffer. For messages that
are much larger, CIS copies from the input in chunks of 4K and saved in
an ArrayList. Once the entire message size is read in, it is re-copied
into one large byte array and passed back up. This only happens for
ByteStrings and ByteBuffers that are read out of CIS. (See
CIS.readRawBytesSlowPath for implementation).
gRPC doesn't need this overhead, since we already have the entire
message in memory, albeit in chunks. This change copies the composite
buffer into a single heap byte buffer, and passes this (via
UnsafeByteOperations) into CodedInputStream. This pays one copy to
build the heap buffer, but avoids the two copes in CIS. This also
ensures that the buffer is considered "immutable" from CIS's point of
view.
Because CIS does not have ByteString aliasing turned on, this large
buffer will not accidentally be kept in memory even if only tiny fields
from the proto are still referenced. Instead, reading ByteStrings out
of CIS will always copy. (This copy, and the problems it avoids, can
be turned off by calling CIS.enableAliasing.)
Benchmark results will come shortly, but initial testing shows
significant speedup in throughput tests. Profiling has shown that
copying memory was a large time consumer for messages of size 1MB.
After debugging #2153, it would have been nice to know what the exact
parameter was that was null. This change adds a name for each
checkNotNull (and tries to normalized on static imports in order to
shorten lines)
io.grpc should not be depending on anything from internal. Also, the
convenience method of Deadline is part of our public API and shouldn't
use LogExceptionRunnable because it would surprise our users.
Swapped to lower-case 'log' since the logger is not immutable.
Implementations of ManagedClientTransport.start() are restricted from
calling the passed listener until start() returns, in order to avoid
reentrency problems with locks. For most transports this isn't a
problem, because they need additional threads anyway. InProcess uses no
additional threads naturally so ends up needing a thread just to
notifyReady. Now transports can just return a Runnable that can be run
after locks are dropped.
This was originally intended to be a performance optimization, but the
thread also causes nondeterminism because RPCs are delayed until
notifyReady is called. So avoiding the thread reduces needless fakes
during tests.