Implementations of ManagedClientTransport.start() are restricted from
calling the passed listener until start() returns, in order to avoid
reentrency problems with locks. For most transports this isn't a
problem, because they need additional threads anyway. InProcess uses no
additional threads naturally so ends up needing a thread just to
notifyReady. Now transports can just return a Runnable that can be run
after locks are dropped.
This was originally intended to be a performance optimization, but the
thread also causes nondeterminism because RPCs are delayed until
notifyReady is called. So avoiding the thread reduces needless fakes
during tests.
WriteQueue uses LinkedBlockingQueue, which has stronger synchronization
semantics than we need. It also requires that we batch reads from it
in order to get reasonable performance. After profiling the delay
between writing to LBQ and reading from it, there was a ~10us delay.
This change switches to using ConcurrentLinkedQueue as the underlying
queue, and removes the batching (reads). Using CLQ with batching is
slightly slower.
Benchmarks show favorable numbers for both latency and throughput.
Each of the following results were run serveral times:
Before:
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.unaryCall1024 true NETTY sample 321575 124185.027 ± 406.112 ns/op
TransportBenchmark.unaryCall1024 false NETTY sample 237400 168232.991 ± 548.043 ns/op
After:
Benchmark (direct) (transport) Mode Cnt Score Error Units
TransportBenchmark.unaryCall1024 true NETTY sample 354773 112552.339 ± 362.471 ns/op
TransportBenchmark.unaryCall1024 false NETTY sample 263297 151660.490 ± 507.463 ns/op
Qps with 10 outstanding RPCs per channel:
Before:
Channels: 4
Outstanding RPCs per Channel: 10
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 396
90%ile Latency (in micros): 680
95%ile Latency (in micros): 838
99%ile Latency (in micros): 1476
99.9%ile Latency (in micros): 5231
Maximum Latency (in micros): 43327
QPS: 85761
After:
Channels: 4
Outstanding RPCs per Channel: 10
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 384
90%ile Latency (in micros): 612
95%ile Latency (in micros): 725
99%ile Latency (in micros): 1080
99.9%ile Latency (in micros): 3107
Maximum Latency (in micros): 30447
QPS: 93353
The results are even better when under heavy load. Qps with 100
outstanding RPCs per channel:
Before:
Channels: 4
Outstanding RPCs per Channel: 100
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 2735
90%ile Latency (in micros): 5051
95%ile Latency (in micros): 6219
99%ile Latency (in micros): 9271
99.9%ile Latency (in micros): 13759
Maximum Latency (in micros): 44831
QPS: 125775
After:
Channels: 4
Outstanding RPCs per Channel: 100
Server Payload Size: 0
Client Payload Size: 0
50%ile Latency (in micros): 2697
90%ile Latency (in micros): 4639
95%ile Latency (in micros): 5539
99%ile Latency (in micros): 7931
99.9%ile Latency (in micros): 12335
Maximum Latency (in micros): 61823
QPS: 131904
An AsciiString object may only use a subsection of its backing byte array. We need to test for this and return a copy of the subsection if necessary.
Big thanks to @normanmaurer for uncovering this issue: https://github.com/netty/netty/issues/5472
Resolves#1756
The thread-unsafe method `io.grpc.testing.TestUtils.pickUnusedPort` causes flakes (#1756) in windows. Need to avoid use of this method in test as in windows the tests are running in different jvms and concurrent calls of this method in multiple processes tend to return the same port number.
There are some usages of this method in benchmarks, so moved the method to `io.grpc.benchmarks.Utils` and the method will only be used in benchmarks and not in test.
A transport is "in use" iff number of streams > 0. In following changes
the channel will use this information when deciding whether it should
transit to the IDLE mode (#1276).
Introduce CallCredentials as a first-class option to allow applications
to set per-call credentials into headers for outgoing RPCs. This will
supersede ClientAuthInterceptor. It has access to more
information (e.g., transport attributes, MethodDescriptor) and allow
results to be returned asynchronously, e.g., from a blocking I/O, which
was problemantic with ClientAuthInterceptor.
adding
ClientStream newStream(MethodDescriptor<?, ?> method, Metadata headers, CallOptions callOptions);
to ClientTransport interface
Created this PR first because both fail fast implementation and another change will be using this interface change
This introduces an AbstractStream2 that is intended to replace the
current AbstractStream. Only server-side is implemented in this commit
which is why AbstractStream remains. This is mostly a reorganization of
AbstractStream and children, but minor internal behavioral changes were
required which makes it appear more like a reimplementation.
A strong focus was on splitting state that is maintained on the
application's thread (with Stream) and state that is maintained by the
transport (and used for StreamListener). By splitting the state it makes
it much easier to verify thread-safety and to reason about interactions.
I consider this a stepping stone for making even more changes to
simplify the Stream implementations and do not think some of the changes
are yet at their logical conclusion. Some of the changes may also
immediately be replaced with something better. The focus was to improve
readability and comprehesibility to more easily make more interesting
changes.
The only thing really removed is some state checking during sending
which is already occurring in ServerCallImpl.
See #933
- Create InternalHandlerRegistry, an immutable look-up table. Handlers
passed to ServerBuilder.addService() go to this registry. This covers
the most common use cases. By keeping the registry internal we could
freely change the registry's interface to accommodate optimizations,
e.g., for hpack.
- The internal registry uses a flat fullMethodName -> handler look-up
table instead of a hierarchical one used before. It faster because it
saves one look-up and a substring.
- Introduces the fallback registry, settable by
ServerBuilder.fallbackHandlerRegistry(), for advanced users who want a
dynamic registry. Moved the current MutableHandlerRegistryImpl to
io.grpc.util.MutableHandlerRegistry as a stock implementation of the
fallback registry. The io.grpc.MutableHandlerRegistry interface is now
removed.
So far, we have passed a custom Executor to the NioEventLoopGroup constructor,
in order to get custom thread names and be compatible with both Netty 4 and
Netty 5. However, Netty 5 is no more (RIP) and Netty's DefaultThreadFactory
includes some optimizations around thread local storage, that Guava's executor
does not have.
The thread names will be a bit different, as DefaultThreadExecutor additionally
puts in the thread pool id after the name prefix.
For example:
Before:
grpc-default-boss-ELG-0
grpc-default-worker-ELG-0
grpc-default-worker-ELG-1
After:
grpc-default-boss-ELG-0-0
grpc-default-worker-ELG-1-0
grpc-default-worker-ELG-1-1