This change is needed after trying to use the new style protocol negotiators internally. The problem is that some handlers fire the event in handlerAdded, which is too early. The followup PNE is fired after handlerAdded, which breaks the composibility of the negotiators.
To fix this, this change modifies the negotiation flow. Specifically:
* Negotiators should NEVER fire a negotiation from handlerAdded, instead they should wait until userEventTriggered
* Negotiators now do state checking on the PNE. If it is set twice, it fails. If it has not been received when doing the next stage of negotiation, it fails.
* WBAEH now fires the initial, default event. This is the only handler that can fire it from handlerAdded
The tests updated are ones not using WBAEH (which they probably should). This change ensures attributes aren't lost when doing negotiation.
This change removes the WriteQueue linking and splits it out into each
of the commands, so that the trace is more precise, and the tag
information is correct.
It is still unclear what the initial Tag should be for ClientCallImpl,
since it should not access the TransportState to get the HTTP/2 stream id.
The issue: HealthCheckingLoadBalancer.shutdown() calls
hcState.onSubchannelState(SHUTDOWN) which removes that hcState from
helper.hcStates. Therefore, if more than one Subchannels are present,
ConcurrentModificationException will be thrown.
This is an alternative approach from #5848 that was reverted in #5875. Thanks to #5883, HealthCheckingLoadBalancer.shutdown() no longer has to fake SHUTDOWN notifications, and can completely rely on Subchannels' real SHUTDOWN notifications for triggering the clean-up.
This reverts commit e795f14bed. It breaks some of
our internal tests. At the very least the cast fails because we are reusing the
service for something without going through the normal stub. There may also
have been a test hang, but it'd take more effort to figure out.
No more methods on the `LoadBalancer` will be called after
`LoadBalancer#shutdown()` is called. This includes
`LoadBalancer#handleSubchannelState()` too. `SubchannelStateListener`
inherited this restriction. However, this special case makes
`onSubchannelState(SHUTDOWN)` an unreliable way of being notified
about `Subchannel` SHUTDOWN, and may confuse/complicate a
wrapping `LoadBalancer` that expects the full notification (e.g., #5875).
The javadoc isn't clear whether this restriction applies. I think
it's more useful to make it no apply.
We were logging when withDeadline() was used, not when the Context was used. As
discovered while looking at https://stackoverflow.com/q/56593692/4690866 .
In e19e8f7d updateTimeoutHeaders was removed and logIfContextNarrowedTimeout
was called directly. However, the two methods had reverse ordering of
callDeadine/outerCallDeadline and the caller did not get their arguments
swapped.
* fixed bug of intercepting a PickResult with no Subchannel, it should just return the original PickResult. Also, the test was not correct, fixed it.
* changed ClientLoadCounter to a mock in XdsLoadStatsStoreTest, it's not necessary to instantiate a real instance.
* added a TODO comment for suggesting a warning for desired locality counter missing when intercepting a PickResult
* use isSameInstanceAs for verifying intercepting invalid PickResult instead of isEqualTo.
panic mode was temporarily disabled by #4152 and re-enabled by #4245,
but the tests were not. This has caused a few test code that was
broken but not executed at all.
This reverts commit c6f15162ff. It broke
an internal health checking test because the server wouldn't shut down.
We assume the health checking RPC isn't getting closed.
* integrate recordDropRequest in LocalityStore
* integrated StatsStore#addLocality and StatsStore#removeLocality in LocalityStore in handling EDS response.
* integrated picker interception in LocalityStore
* integrate XdsLoadReportClient in XdsLoadBalancer
* put removing locality counters after updating subchannl pickers to narrow down race window
* fixed modifier for XdsLoadReportClientFactory
* refactor handleNewConfig method in XdsLoadBalancer for better readability
* edited message for closing lb rpc when balancer name changes
* weaker the specification of XdsLoadReportClient to allow start/stop be called multiple times.
* removed lrsWorking flag as we relaxed precondition of calling start/stop on XdsLoadReportClient
* refactor initLbChannel to be a factory method for better readability
* added comment for the case when child policy changes, lrs should not be affected
* changed comments for eliminating potential load lose upon locality update.
* make lb RPC cancellation message more informative
Transport level exceptions (e.g. "Connection reset by peer") are not
useful and clutter the logs. `NettyServerTransport` contains logic to
log such exceptions at level `FINE`.
When running with epoll, transport level exceptions are prefixed with
additional contextual information (e.g. "syscall:read(..) failed:") that
causes the exceptions to be logged at level `INFO`.
Update the filtering logic to match on error messages _containing_ the
blacklisted messages, rather than using string equality.
Closes#5872.
Signed-off-by: Nick Travers <n.e.travers@gmail.com>
gRPC issues flushes after close in the WriteQueue, which can show up as an NPE in the framer. This was thought to have been handled, by checking to see if there were any pending writes, but if the close() call gets far enough, the writes will be null. This causes an NPE when the flush comes though.
The issue is difficult to reproduce, and I think my test case emulates the failure. EmbeddedChannel is different than the normal Channels we use, making the precise ordering tough. The test case isn't exactly what the production code would do, but it does have the same ordering.
cc @jiangtaoli2016
Sample Stack trace:
```
Jun 10, 2019 2:09:03 PM io.grpc.ChannelLogger log
FINEST: [OobChannel<10>] Entering SHUTDOWN state
Jun 10, 2019 2:09:03 PM io.grpc.ChannelLogger log
FINEST: [Subchannel-OOB<11>: (fake-authority-that-is-always-the-same)] NettyClientTransport<14>: (/0:0:0:0:0:0:0:1:20008) SHUTDOWN with UNAVAILABLE(OobChannel is shutdown)
Jun 10, 2019 2:09:03 PM io.grpc.netty.NettyClientHandler close
FINE: Network channel being closed by the application.
Jun 10, 2019 2:09:03 PM io.grpc.internal.ClientCallImpl logIfContextNarrowedTimeout
FINE: Call timeout set to '4999299080' ns, due to context deadline. Explicit call timeout was not set.
Jun 10, 2019 2:09:03 PM io.netty.handler.codec.http2.Http2FrameLogger logGoAway
FINE: [id: 0x4bcebba6, L:/0:0:0:0:0:0:0:1:33296 - R:/0:0:0:0:0:0:0:1:20008] OUTBOUND GO_AWAY: lastStreamId=0 errorCode=0 length=0 bytes=
Jun 10, 2019 2:09:03 PM io.grpc.netty.NettyClientHandler onConnectionError
FINE: Caught a connection error
java.lang.NullPointerException
at io.grpc.alts.internal.TsiFrameHandler.flush(TsiFrameHandler.java:126)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:754)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:746)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:732)
at io.netty.handler.codec.http2.Http2ConnectionHandler.flush(Http2ConnectionHandler.java:201)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:754)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:746)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:732)
at io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:978)
at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:253)
at io.grpc.netty.WriteQueue.flush(WriteQueue.java:124)
at io.grpc.netty.WriteQueue.access$000(WriteQueue.java:32)
at io.grpc.netty.WriteQueue$1.run(WriteQueue.java:44)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Jun 10, 2019 2:09:03 PM io.netty.channel.AbstractChannelHandlerContext notifyHandlerException
WARNING: An exception was thrown by a user handler while handling an exceptionCaught event
java.lang.NullPointerException
at io.grpc.alts.internal.TsiFrameHandler.flush(TsiFrameHandler.java:126)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:754)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:746)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:732)
at io.netty.handler.codec.http2.Http2ConnectionHandler.onError(Http2ConnectionHandler.java:629)
at io.grpc.netty.AbstractNettyHandler.exceptionCaught(AbstractNettyHandler.java:81)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:276)
at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:268)
at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:143)
at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:297)
at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:836)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:756)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:746)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:732)
at io.netty.handler.codec.http2.Http2ConnectionHandler.flush(Http2ConnectionHandler.java:201)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:754)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:746)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:732)
at io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:978)
at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:253)
at io.grpc.netty.WriteQueue.flush(WriteQueue.java:124)
at io.grpc.netty.WriteQueue.access$000(WriteQueue.java:32)
at io.grpc.netty.WriteQueue$1.run(WriteQueue.java:44)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Jun 10, 2019 2:09:03 PM io.grpc.netty.NettyClientHandler channelInactive
FINE: Network channel is closed
Jun 10, 2019 2:09:03 PM io.grpc.ChannelLogger log
FINEST: [Subchannel-OOB<11>: (fake-authority-that-is-always-the-same)] NettyClientTransport<14>: (/0:0:0:0:0:0:0:1:20008) Terminated
Jun 10, 2019 2:09:03 PM io.grpc.ChannelLogger log
FINEST: [Subchannel-OOB<11>: (fake-authority-that-is-always-the-same)] Terminated
```
* extract self-defined Locality into XdsLocality class
* separate out functionalities for recording client load from lrsClient, xds load balancer will directly interact with XdsLoadStatsStore to set up locality counters
* added GRPC to constant TRAFFICDIRECTOR_HOSTNAME_FIELD name to better match that in XdsComms
* fixed bug of using the wrong cluster name in load report's ClusterStats, it should be GSLB service name, which is responsed by load report response (same as that in EDS response).
* added a new line to the end of files.
* Revert "fixed bug of using the wrong cluster name in load report's ClusterStats, it should be GSLB service name, which is responsed by load report response (same as that in EDS response)."
This reverts commit 6097dd4066.
* rephrase interface comment for StatsStore
* added equality and hashCode test for XdsLocality
The test case `XdsLbStateTest.handleSubchannelState()` was introduced before `LocalityStore` refactored out of `XdsLbState`. After `LocalityStore` refactored out, the test case should not be in `XdsLbStateTest` anymore. The test case is already covered in `LocalityStoreTest`.
During migration, the name resolver may not know when the client has been upgraded to xds, so it may still send grpclb v1 addresses with a list of policies including both grpclb v1 and xds.
HealthCheckingLoadBalancer.shutdown() calls
hcState.onSubchannelState(SHUTDOWN) which removes that hcState from
helper.hcStates. Therefore, if more than one Subchannels are present,
ConcurrentModificationException will be thrown.
Since HealthCheckingLoadBalancer.shutdown() will clear the hcStates
set after the loop, it's unnecessary to do the deletion within the
loop. However, when a Subchannel is shutdown by LoadBalancer, its
HcState still needs to be removed. To do that, change moves the
deletion to Subchannel.shutdown().
This add perfmark annotations in some key places, notably on transport/application boundaries, and thread hop locations. Perfmark records to a thread-local buffer the events that happen in each thread. Perfmark is disabled by default, and will compile to a noop unless Perfmark.setEnabled is invoked. This should make it free when disable, and pretty fast when it is enabled.
It is important that started tasks are ended, so several places in our code are moved to either try-finally blocks, or moved into a private method. I realize this is ugly, but I think it is manageable. In the future, we can look at making an agent or compiler plugin that simplifies the recording.
Linking between threads is done with a Link object, which is created on the "outbound" task, and used on the "inbound" task. This is slightly more verbose, and does has a small amount of runtime overhead, even when disabled. (for null checks, slightly higher memory usage, etc.) I think this is okay to, because it makes other optimizations much easier.
Depending on jdk:toolchain causes java_grpc_library to always use the
_default_ toolchain, even if the user tried to override it. Changing to
:current_java_toolchain allows the rule to use the user-selected
toolchain when overridden.
Tested by adding to BUILD:
load("@bazel_tools//tools/jdk:default_java_toolchain.bzl", "default_java_toolchain")
default_java_toolchain(
name = "mychain",
misc = ["-Amy=flag"],
visibility = ["//visibility:public"],
)
And then verifying -Amy=flag is in the output of:
bazel aquery --java_toolchain=:mychain services:_reflection_java_grpc
Fixes#5841
com.github.dcendents:android-maven-gradle-plugin is incompatible with
Gradle 5 and the project hasn't seen any activity in over a year, so it
seems unlikely to get fixed.
We want to use maven-publish anyway, since that's what we use elsewhere.