We previously limited at 1 minute because there was no server-side
enforcement. Now that there is server-side enforcement, it is safer to
allow clients to choose a lower value, since the service owner still has
to permit it.
Per spec this metric should be calculated on the server and sent back to
the client, for which the mechanism is not currently defined. As it's
not a required metric, we remove the incorrect implementation for now.
Internal ref: b/37208451
Background
==========
LoadBalancer needs to track RPC measurements and status for
load-reporting. We need to introduce a "Tracer" API for that.
Since such API is very close to the current
Census(instrumentation)-based stats reporting mechanism in terms of what
are recorded, we will migrate the Census-based stats reporting under the
new Tracer API.
Alternatives
============
We considered plumbing the LB-related information from the LoadBalancer
to the core, and recording those information along with the currently
recorded stats to Census. The LB-related information, such as LB_ID,
reason for dropping reqeusts etc, would be added to the Census
StatsContext as tags.
Since tags are held by StatsContext before eventually being recorded by
providing the measurements, and StatsContext is immutable, this would
require a way for LoadBalancer to override the StatsContext, which means
LoadBalancer API would has direct reference to the Census StatsContext.
This is undesirable because Census API is not stable yet.
Part of the LB-related information is whether the client has received
the initial headers from the server. While such information can be
grabbed by implementing a ClientInterceptor, it must be recorded along
with other information such as LB_ID to be useful, and LB_ID is only
available in GrpclbLoadBalancer.
Bottom line, trying to use solely the Census StatsContext API to record
LB load information would require extra data plumbing channel between
ClientInterceptor, LoadBalancer and the gRPC core, as well as exposing
Census API on the gRPC API. Even with those extensive changes, we are
yet to find a working solution. Therefore, we abandoned this idea and
propose this PR.
Summary of changes
==================
API summary
-----------
Introduce "StreamTracer" API, a callback interface for receiving stats
and tracing related updates concerning **a single stream**.
"ClientStreamTracer" and "ServerStreamTracer" add side-specific
events. A stream can have zero or more tracers and report to all of
them.
On the client-side, CallOptions now takes a list of
ClientStreamTracer.Factory. Opon creating a ClientStream, each of the
factory creates a ClientStreamTracer for the stream. This allows
ClientInterceptors to install its own tracer factories by overriding the
CallOptions.
Since StreamTracer only tracks the span of a stream, tracking of a
ClientCall needs to be done in a ClientInterceptor. By installing its
own StreamTracer when a ClientCall is created, ClientInterceptor can
associate the updates for a Call with the updates for the Streams
created for that Call. This is how we keep the existing Census
reporting mechanism in CensusStreamTracerModule.
On the server-side, ServerStreamTracer.Factory is added through the
ServerBuilder, and is used to create ServerStreamTracers for every
ServerStream.
The Tracer API supports propagation of stats/tracing information through
Context and metadata. Both client-side and server-side tracer factories
have access to the headers object. Client-side tracer relies on
interceptor to read the Context, while server-side tracer has
filterContext() method that can override the Context.
Implementation details
----------------------
Only real streams report stats. Pseudo streams such as delayed stream,
failing stream don't report. InProcess transport streams currently
don't report stats.
"StatsTraceContext" which used to receive updates from core and report
directly to Census (StatsContext), now delegates to the StreamTracers of
a stream. On the client-side, the scope of a StatsTraceContext reduces
from ClientCall to a ClientStream to match the scope of StreamTracer.
The Census-specific logic that was in StatsTraceContext is moved into
CensusStreamTracerModule, which produces factories for StreamTracers
that report to Census.
Reporting with StatsTraceContext is moved out of the Channel/Call layer
into Transport/Stream layer, to match the scope change of
StatsTraceContext.
Bug fixed
----------------
The end of a server-side call was reported in ServerCallImpl's
ServerStreamListenerImpl.closed(), which was wrong. Because closed()
receiving OK doesn't necessarily mean the RPC ended with OK. Instead it
means the server has successfully sent the final status, which may be
non-OK, to the client.
Now the end report is done in both ServerStream.close(any Status) and
before calling ServerStreamListener.closed(non-OK). Whichever happens
first is the reported status.
TODOs
=====
A follow-up change to the LoadBalancer API will add a
ClientStreamTracer.Factory to the PickResult to complete the API needed
by load-reporting.
To be in line with `NettyServerBuilder` APIs
- Deprecated `enableKeepAlive(boolean enable)` and
`enableKeepAlive(boolean enable, long keepAliveDelay, TimeUnit delayUnit, long keepAliveTimeout,
TimeUnit timeoutUnit)`
which never worked in v1.2
- Added `keepAliveTime(long keepAliveTime, TimeUnit timeUnit)` and
`keepAliveTimeout(long keepAliveTimeout, TimeUnit timeUnit)`
executor.schedule() will "eat" any exceptions thrown by the Runnables,
because the Future is expected to be used to see them. However, we never
call get() on the Future, so we need to just the exceptions like we do
elsewhere in this case.
We got a NullPointerException from ClientCallImpl#startDeadlineTimer
when a new Call is created after a Netty channel is terminated. Here
is a stacktrace:
INTERNAL: java.lang.NullPointerException
at io.grpc.internal.ClientCallImpl.startDeadlineTimer(ClientCallImpl.java:320)
at io.grpc.internal.ClientCallImpl.start(ClientCallImpl.java:253)
The following code snippet reproduces the bug:
```
ManagedChannel channel = NettyChannelBuilder.forAddress(host, port)
.usePlaintext(true)
.build();
channel.shutdown();
Thread.sleep(1000);
GreeterGrpc.GreeterBlockingStub stub =
GreeterGrpc.newBlockingStub(channel)
.withDeadlineAfter(10, TimeUnit.SECONDS);
stub.sayHello(HelloRequest.newBuilder().setName("world").build());
```
The issue was that ClientCallImpl is created from RealChannel#newCall
*after* ManagedChannelImpl#maybeTerminateChannel is called and
scheduledExecutor is set to null. In such a scenario,
deadlineCancellationExecutor is set to null.
I think there are several ways to fix this, but one way would be to
just avoid calling startDeadlineTimer() when
deadlineCancellationExecutor is null. DelayedClientTransport will
create a FailingClientStream with Status.UNAVAILABLE and we will get
```
Exception in thread "main" io.grpc.StatusRuntimeException:
UNAVAILABLE: Channel has shutdown (reported by delayed transport)
```
This removes a needless warning, and isn't much slower. Also this
includes a benchmark for StatsTraceContext to measure the overhead
for creation. It adds about 40ns per RPC. Optimization will come
after structural changes are made to break the dependency on
Census.
Resolves#2716
- Add attributes to EquivalentAddressGroup
- Deprecate ResolvedServerInfoGroup by EquivalentAddressGroup
- Deprecate ResolvedServerInfo, because attributes for a single address
with an address group is not found to be useful.
- The changes on the NameResolver and LoadBalancer interfaces are backward-compatible
in the next release, with which implementors can switch to the new API smoothly.
As a related change, redefine the semantics of DnsNameResolver and
RoundRobinLoadBalancer:
- Before: DnsNameResolver returns all addresses in one address group.
RoundRobinLoadBalancer ignores the grouping of addresses and
round-robin on every single addresses. It doesn't work well with the
one-server-multiple-address setup, e.g., both IPv4 and IPv6 addresses
are returned for a single serve, even if they are put in the same
address group by the NameResolver.
- After: DnsNameResolver returns every address in its own
EAG. RoundRobinLoadBalancer takes an EAG as a whole, and only
round-robin on the list of EAGs. The new behavior is a better
interpretation of the EAGs, and really allows the case where one
server has more than one addresses (e.g., IPv4 and IPv6).
This change will affect users that use custom LoadBalancer with the
stock DnsNameResolver, and those who use custom NameResolver with the
stock RoundRobinLoadBalancer.
Users who use both the stock DnsNameResolver and RoundRobinLoadBalancer
or PickFirstBalancer will see no behavioral change. Because they will
still round-robin on individual addresses from DNS, or do pick-first on
all addresses from DNS (PickFirstBalancer flattens all addresses).
The result is a simpler API and reduction of boilderplates.
`keepAlivedManager#onTransportshutdown` should not be called in `transport.shutdown()` because it is possible that there are still open RPC streams, and maybe inactive, so keepalive is still needed.
Preparing to support server side keepalive.
For the convience on server side, not to use Ping `onSuccess()` callback to cancle shutdownFuture any more, instead, regard `onDataReceived()` as ping Ack and cancel shutdownFuture in it.
The balancer service attaches a token string for each Server entry it
sends to the client. The client has to set the token to the "lb-token"
header when assigning that Server entry to an RPC.
For convenience of testing, also implemented hashCode() and equals() for
PickResult.
The build (jdk version jdk1.8.0_91) produces some noise
```
/usr/local/google/home/zdapeng/git/grpc-java/core/src/main/java/io/grpc/LoadBalancer.java:196: warning - Tag @link: can't find pickSubchannel(PickSubchannelArgs) in io.grpc.LoadBalancer.SubchannelPicker
/usr/local/google/home/zdapeng/git/grpc-java/core/src/main/java/io/grpc/LoadBalancer.java:388: warning - Tag @link: can't find pickSubchannel(
PickSubchannelArgs) in io.grpc.LoadBalancer.SubchannelPicker
/usr/local/google/home/zdapeng/git/grpc-java/core/src/main/java/io/grpc/LoadBalancer.java:178: warning - Tag @link: can't find pickSubchannel(PickSubchannelArgs) in io.grpc.LoadBalancer.SubchannelPicker
3 warnings
:grpc-core:javadocJar
```
It seems nothing wrong with the javadoc and it could be a javac's bug, but here's a workaround.
fix JavaStyle and ErrorProne warnings found in internal weekly import:
- Calls to ExpectedException#expect should always be followed by exactly one statement.
- Do not mock 'java.util.concurrent.Future'
handleResolvedAddresses constructs subchannels for each address in
an EquivalentAddressGroup, sharing a reference to an immutable
Atrributes. However, as noted, this Attributes instance contains
a mutable AtomicReference, eventually causing subchannel state
changes to be improperly reflected in *all* subchannels of an
EquivalentAddressGroup.
Guava 20 introduced some overloading optimizations for Preconditions
that require using Guava 20+ at runtime. Unfortunately, Guava 20 removes
some things that is causing incompatibilities with other libraries, like
Cassandra. While the incompatibility did trigger some of those libraries
to improve compatibility for newer Guavas, we'd like to give the
community more time to work through it. See #2688
At this commit, we appear to be compatible with Guava 18+. It's not
clear if we want to actually "support" 18, but it did compile. Guava 17
doesn't have at least MoreObjects, directExecutor, and firstNotNull.
Guava 21 compiles without warnings, so it should be compatible with
Guava 22 when it is released.
One test method will fail with the upcoming Guava 22, but this won't
impact applications. I made MoreThrowables to avoid using any
known-deprecated Guava methods in our JARs, to reduce pain for those
stuck with old versions of gRPC in the future (July 2018).
In the stand-alone Android apps I removed unnecessary explicit deps
instead of syncing the version used.
ErrorProne provides static analysis for common issues, including
misused variables GuardedBy locks.
This increases build time by 60% for parallel builds and 30% for
non-parallel, so I've provided a way to disable the check. It is on by
default though and will be run in our CI environments.
Futures almost universally should be handled in some way when being
returned, either to receive the value or to cancel scheduled tasks to
prevent leaks.
Netty is a bit of a special case though, since it constantly returns
futures that you ignore (even adding a listener returns the "this"
future). So we want to suppress the warning for code using Netty instead
of trying to fix it. When we enable ErrorProne in the build, we should
start passing -Xep:FutureReturnValueIgnored:OFF in the compilerArgs.
The LBv2 setter was added in #2658 as an abstract method, which breaks
anyone who implements ManagedChannelBuilder. Here we add a default
implementation that throws.
This is needed because in interceptor tests, often the types cannot
be changed. The void methods stay for users who are writing tests
where they actually don't care about types. The noop methods
require types to be specified. This is for users who don't care
about the implementation. These represent different levels of
commitment.
This eases the transition of code Mocking MethodDescriptor, which
breaks in this release.
In some environments DNS is not available and is performed by the
CONNECT proxy. Nothing "special" should need to be done for these
environments, but the previous support took shortcuts which knowingly
would not support such environments.
This change should fix both OkHttp and Netty. Netty's
Bootstrap.connect() resolved the name immediately whereas using
ChannelPipeline.connect() waits until the address reaches the end of the
pipeline. Netty uses NetUtil.toSocketAddressString() to get the name of
the address, which uses InetSocketAddress.getHostString() when
available.
OkHttp is still using InetSocketAddress.getHostName() which may issue
reverse DNS lookups. However, if the reverse DNS lookup fails, it should
convert the IP to a textual string like getHostString(). So as long as
the reverse DNS maps to the same machine as the IP, there should only be
performance concerns, not correctness issues. Since the DnsNameResolver
is creating unresolved addresses, the reverse DNS lookups shouldn't
occur in the common case.
Because "core"'s test source already depends on "testing", e.g.,
`core/src/test/java/io/grpc/internal/ServerCallImplTest.java` uses
`testing/src/main/java/io/grpc/internal/testing/StatsTestUtils.java`,
which forms a circular dependency.
This change moves the StatsContext setter accessors from "core"'s test
source to "testing".
Resolves#2651
This removes an abuse of scheduled executor in ManagedChannelImpl. The executor
was used to avoid deadlocking. Now we run the work on the same thread, but
delay it until locks have been released.
There is no need to fix ManagedChannelImpl2. Due to its different
threading model it didn't have need to abuse the scheduledExecutor.
Fixes#2444
These are only used in internal tests. In production,
StatsContextFactory is loaded by the "Instrumentation" library and must
be one per process, thus we won't allow setting it on a per-channel or
per-server basis.
* Remove else, make it easier to read
* Ready subchanns list will be at max the same as the input list
* Make status private, add pkg-private getter for tests
* Remove empty field, cache size, simplify logic
Make list unmodifiable. Document index with GuardedBy (see #nextSubchannel())
add `getAttributes()` to `ClientStream` and `ClientCall` to be able to share clientTransport
information such as socket TOS with higher lever API's, once the RPC picks up an active transport that is ready to use.
Previously it does it at shutdown, which was wrong because executor may
still be used before the server is terminated.
Resolves#2034
Uses ObjectPool to make this change testable. Cleans up test and makes
it mostly single-threaded, except for two deadlock tests that have to be
multi-threaded.
If a LoadBalancer2 is passed in, the builder will create ManagedChannelImpl2 instead of ManagedChannelImpl. This allows us to test the LBv2 classes on a large scale.
1. Adapt to LoadBalancer2 interface. Channel holds on to a single
DelayedClientTransport2.
2. Lock-free: every channel state mutation, including Subchannel
mutations, calling into LoadBalancer, idleness and shutdown, is made
from channelExecutor.
3. Idleness grace period is no longer needed.
In upstream, Census is renamed to "Instrumentation". `com.google.census` is renamed to `com.google.instrumentation.stats`.
In gRPC, "census" in every name is replaced by "stats".
Besides API changes, this implementation is also up-to-date with the
latest design:
1. Delegate to round-robin and pick-first policies if requested by
the naming system.
2. OOB channels to LoadBalancer always use the LB authority provided by
the naming system.
3. Never send application RPCs to balancer addresses, even if the
address returns UNIMPLEMENTED error.
"Because of spurious wakeups, wait() must always be called in a loop".
Got rid of wait().
"If you return or throw from a finally, then values returned or thrown from the try-catch block will be ignored. Consider using try-with-resources instead."
Ignored the exception thrown from finally.
1. Use ChannelExecutor to run callbacks. Now callbacks are no longer
run under the delayed transport lock.
2. Use explicit picker version instead of relying on identity equality
to detect new pickers. Rationale: if reprocess() is called again with
the same picker, all pending streams will be reprocessed by this picker
again, thus there is no reason to leave out the racing new stream.
* Fork DelayedClientTransport into DelayedClientTransport2, and fix a race
on it. Consider the following sequence:
1. Channel is created. Picker is initially null
2. Channel has a new RPC. Because picker is null, the delayed transport
is picked, but newStream() is not called yet.
3. LoadBalancer updates a new picker to Channel. Channel runs
reprocess() but does nothing because no pending stream is there.
4. newStream() called on the delayed transport.
In previous implementation, newStream() on step 4 will not see the
picker. It will only use the next picker.
After this change, delayed transport would save the latest picker and
use it on newStream(), which is the desired behavior.
Also deleted all the code that will not be used after the LB refactor.
* Also fixed a bug: newStream() should always return a failing stream if it's shutdown. Previously it's not doing so if picker is not null.
OutputStreamAdapter is a private class, and is only ever called by two
places: ByteStreams.copy, which never calls the single byte method, and
DrainTo, which potentially can. There are two classes that implement
DrainTo, which is primarily ProtoInputStream. It calls
MessageLite.writeTo(OutputStream) or back again to ByteStreams.copy.
MessageLite.writeTo in turn wraps the OutputStream in a
CodedOutputStream.OutputStreamEncoder, which then never calls the single
byte version. Thus, all know implementations never call the single
byte override.
Additionally, it is well known that the single byte write is slow, and
is expected to be wrapped in a BufferedOutputStream if there are many
small writes.
- Remove unused variable terminated from TransportListener#transportTerminated
- Do not mention getLock in the javadoc of InUseStateAggregator2#handleNotInUse
FackeClock used PriorityQueue for storing tasks which is not
thread-safe, and caused flake
> io.grpc.internal.ClientCallImplTest > deadlineExceededBeforeCallStarted FAILED
> java.lang.ArrayIndexOutOfBoundsException: -1
> at java.util.PriorityQueue.removeAt(PriorityQueue.java:619)
> at java.util.PriorityQueue.remove(PriorityQueue.java:378)
> at io.grpc.internal.FakeClock$ScheduledTask.cancel(FakeClock.java:89)
> at io.grpc.internal.ClientCallImpl.removeContextListenerAndCancelDeadlineFuture(ClientCallImpl.java:296)
> at io.grpc.internal.ClientCallImpl.start(ClientCallImpl.java:250)
> at io.grpc.internal.ClientCallImplTest.deadlineExceededBeforeCallStarted(ClientCallImplTest.java:615)
Document the threading requirements.
Turn all metrics to volatile because callEnded() may not be synchronized
with other metric-updating methods in the case where the RPC is closed
because of transport error or cancellation from the other side.
Remove precondition checks thus allow metrics updating methods to be
called after callEnded() has been called, which may happen since the
application may not be aware when the RPC is closed by the transport.
This is the first step of a major refactor for the LoadBalancer-related part of Channel impl (#1600). It forks TransportSet into InternalSubchannel and makes changes on it.
What's changed:
- InternalSubchannel no longer has delayed transport, thus will not buffer
streams when a READY real transport is absent.
- When there isn't a READ real transport, obtainActiveTransport() will
return null.
- InternalSubchannel is no longer a ManagedChannel
- Overhauled Callback interface, with onStateChange() replacing the
adhoc transport event methods.
- Call out channelExecutor, which is a serializing executor that runs
the Callback.
The first commit is an unmodified copy of the files that are being forked. Please review the second commit which changes on the forked files.
NoopCensusContextFactory used to use a shared zero-sized ByteBuffer,
which has mutable states like positions and thus is not thread-safe.
Sharing it means it will be used by all calls thus may be used from
different threads, which makes TSAN unhappy in tests.
Resolves#2377
resolvesgrpc/grpc#8715
now that setListener is called prior to
`JumpToApplicationThreadServerStreamListener` being completely ready to
use. We should not call `AbstractStream2#onStreamAllocated()` inside
`setListener()` anymore, but call it after `ServerImpl#streamCreated()`
is completed.
Resolves#1936
Two bugs fixed:
- NPE in `ServerImpl#streamCreated()` when stream listener not set before
stream closed
- It is possible that `internalCancel()` is called during
`InProcessClientStream#start()` due to early server `onComplete()` or server `onError()`,
in this case no need to enlist `streams`, otherwise the channel can not be shutdown by `shutdown()`.
Proxies can trigger errors, and we'd like to categorize common
proxy-generating errors into gRPC status codes for sane application
handling.
Fixes#2264
We only want to use the HTTP code for errors, when the response is not
grpc. grpc status codes may be mapped to HTTP codes in the future, and
we don't want to break when that happens. We also don't want to ever
accidentally use Status.OK without receiving it from the server, even
for HTTP 200.
core: adds @Nullable Object getAttachedObject() to ServiceDescriptor
compiler: Plumbing necessary to access proto file descriptors via
the reflection service
Found this bug in some unit test in which client-streaming call is hanging because `ServerCallImpl#messageRead()` did not handle RuntimeException properly