Just an is a8de9f0, lack of equals causes cluster_resolver to consider every update a different configuration and restart itself.
Handling NaN should really be prevented with validation, but it looks like that
would lead to yak shaving at the moment.
b/435208946
Since c4256add4 we no longer fabricate a TRANSIENT_FAILURE update from
children. However, previously that would have set
seenReadyOrIdleSinceTransientFailure = false and prevented future timer
creation. If a LB policy gives extraneous updates with state CONNECTING,
then it was possible to re-create failOverTimer which would then wait
the 10 seconds for the child to finish CONNECTING. We only want to give
the child one opportunity after transitioning out of READY/IDLE.
https://github.com/grpc/proposal/pull/509
Http2RstCounterEncoder has to be constructed before
NettyServerHandler/Http2ConnectionHandler so it must be static. Thus the
code/counters were moved into RstStreamCounter which then can be
constructed earlier and shared.
This depends on Netty 4.1.124 for a bug fix to actually call the
encoder:
be53dc3c9a
This implicitly disables NettyAdaptiveCumulator (#11284), which can have a
performance impact. We delayed upgrading Netty to give time to rework
the optimization, but we've gone too long already without upgrading
which causes problems for vulnerability tracking.
Notably, protobuf to 3.25.8, opentelemetry to 1.52.0. Protobuf in Bazel
has 25.5 in the BCR and it seems better to align the WORKSPACE
with that version. But we can't actually use 25.5 in BCR because it is
incompatible with Bazel 7.
This allows a server with access to PeerUid to check additional application-layer security policy *after* the call itself is authorized by the transport layer. Cross cutting application-layer checks could be done from a ServerInterceptor (RPC method level policy, say). Checks based on the substance of a request message could be done by the individual RPC method implementations themselves.
Instead of representing an aggregate cluster as a single cluster whose
priorities come from different underlying clusters, represent an aggregate cluster as an instance of a priority LB policy where each child is a cds LB policy for the underlying
cluster.
Avoiding so many deps will allow us to upgrade the protos without being
forced to upgrade to protobuf-java 4.x. It also removes the remaining
non-bzlmod dependencies.
It'd be really easy to get this wrong, so we do two things 1) mirror the
gradle configuration as much as possible, as that sees a lot of testing,
and 2) run the fake control plane with the _results_ of jarjar. There's
lots of classes that we could mess up, but that at least kicks the tires.
XdsTestUtils.buildRouteConfiguration() was moved to ControlPlaneRule to
stop the unnecessary circular dependency between the classes and to
avoid the many dependencies of XdsTestUtils.
I'm totally hacking java_grpc_library to improve the dependency
situation. Long-term, I think we will stop building Java libraries with
Bazel and require users to rely entirely on Maven Central. That seems to
be the direction Bazel is going and it will greatly simplify the
problems we've seen with protobuf having a single repository for many
languages. So while the hack isn't too bad, I hope we won't have to live
with it long-term.
The resource subscription to the fallback target was done only at the time of falling back, which can cause rpcs to fail. This change makes the fallback target to be subscribed and cached earlier, similar to C++ and go gRPC implementations.
The PriorityLB predates A56. tryNextPriority() now matches
ChoosePriority() from the gRFC.
The biggest change is waiting on CONNECTING children instead of failing
after the failOverTimer fires. The failOverTimer should be used to start
lower priorities more eagerly, but shouldn't cause the overall
connectivity state to become TRANSIENT_FAILURE on its own. The prior
behavior of creating the "Connection timeout for priority" failing
picker was particularly strange, because it didn't update child's
connectivity state. This previous behavior was creating errors because
of the failOverTimer with no way to diagnose what was going wrong.
b/428517222
The main reason I made a change here was to fix the tense from the
deadline "will be exceeded in" to "was exceeded after". But we really
don't want to be doing the string formatting unless the deadline is
actually exceeded. There were a few more changes to make some variables
effectively final.
Fix HashSet / HashMap initializations to have sufficient capacity allocated based on the number of keys to be inserted, without which it would always lead to a rehash / resize operation.
In #12185, RPCs were randomly hanging. In #12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
In observed cases, whether RST_STREAM or another failure from netty or
the server, listeners can fail to be notified when a connection yields a
null stream for the selected streamId. This causes hangs in clients,
despite deadlines, with no obvious resolution.
Tests which relied upon this promise succeeding must now change.
LoadBalancers shouldn't be called after shutdown(), but RingHashLb could
have enqueued work to the SynchronizationContext that executed after
shutdown(). This commit fixes problems discovered when auditing all LBs
usage of the syncContext for that type of problem.
Similarly, PickFirstLb could have requested a new connection after
shutdown(). We want to avoid that sort of thing too.
RingHashLb's test changed from CONNECTING to TRANSIENT_FAILURE to get
the latest picker. Because two subchannels have failed it will be in
TRANSIENT_FAILURE. Previously the test was using an older picker with
out-of-date subchannelView, and the verifyConnection() was too imprecise
to notice it was creating the wrong subchannel.
As discovered in b/430347751, where ClusterImplLb was seeing a new
subchannel being called after the child LB was shutdown (the shutdown
itself had been caused by RingHashConfig not implementing equals() and
was fixed by a8de9f07ab, which caused ClusterResolverLb to replace its
state):
```
java.lang.NullPointerException
at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createClusterLocalityFromAttributes(ClusterImplLoadBalancer.java:322)
at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createSubchannel(ClusterImplLoadBalancer.java:236)
at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
at io.grpc.internal.PickFirstLeafLoadBalancer.createNewSubchannel(PickFirstLeafLoadBalancer.java:527)
at io.grpc.internal.PickFirstLeafLoadBalancer.requestConnection(PickFirstLeafLoadBalancer.java:459)
at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:174)
at io.grpc.xds.LazyLoadBalancer$LazyDelegate.activate(LazyLoadBalancer.java:64)
at io.grpc.xds.LazyLoadBalancer$LazyDelegate.requestConnection(LazyLoadBalancer.java:97)
at io.grpc.util.ForwardingLoadBalancer.requestConnection(ForwardingLoadBalancer.java:61)
at io.grpc.xds.RingHashLoadBalancer$RingHashPicker.lambda$pickSubchannel$0(RingHashLoadBalancer.java:440)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:128)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.onData(XdsClientImpl.java:817)
```
grpc-binder clients authorize servers by checking the UID of the sender of the SETUP_TRANSPORT Binder transaction against some SecurityPolicy. But merely binding to an unauthorized server to learn its UID can enable "keep-alive" and "background activity launch" abuse, even if security policy ultimately decides the connection is unauthorized. Pre-authorization mitigates this kind of abuse by looking up and authorizing a candidate server Application's UID before binding to it. Pre-auth is especially important when the server's address is not fixed in advance but discovered by PackageManager lookup.
PROTOCOL-HTTP2.md specifies "TimeoutValue → {positive integer as ASCII
string of at most 8 digits}". Zero is not positive, so it should be
avoided. So make sure timeouts are at least 1 nanosecond instead of 0
nanoseconds.
grpc-go recently began disallowing zero timeouts in
https://github.com/grpc/grpc-go/pull/8290 which caused a regression as
grpc-java can generate such timeouts. Apparently no gRPC implementation
had previously been checking for zero timeouts.
Instead of changing the max(0) to max(1) everywhere, just move the max
handling into TimeoutMarshaller, since every caller of TIMEOUT_KEY was
doing the same max() handling.
Before fd8fd517d (in 2016!), grpc-java actually behaved correctly, as it
failed RPCs with timeouts "<= 0". The commit changed the handling to the
max(0) handling we see now.
b/427338711
297ab05ef converted CDS to XdsDependencyManager. This caused three
regressions:
* CdsLB2 as a RLS child would always fail with "Unable to find
non-dynamic root cluster" because is_dynamic=true was missing in
its service config
* XdsNameResolver only propagated resolution updates when the clusters
changed, so a CdsUpdate change would be ignored. This caused a hang
for RLS even with is_dynamic=true. For non-RLS the lack config update
broke the circuit breaking psm interop test. This would have been
more severe if ClusterResolverLb had been converted to
XdsDependenceManager, as it would have ignored EDS updates
* RLS did not propagate resolution updates, so CdsLB2 even with
is_dynamic=true the CdsUpdate for the new cluster would never arrive,
causing a hang
b/428120265
b/427912384
The @SystemApi runtime visibility requirement isn't really new. It has always been implicit in the required INTERACT_ACROSS_USERS permission, which (in production) can only be held by system apps.
The SDK_INT >= 30 requirement was also always present, via @RequiresApi() on BinderChannelBuilder#bindAsUser. This change just updates its replacement APIs (AndroidComponentAddress and TARGET_ANDROID_USER) to require it too.
The previous code did a ping-pong to make sure the transport had enough
time to process, but then proceeded to sleep 5 seconds. That sleep would
have been needed without the ping-pong, but with the ping-pong we are
confident all events have been drained from the transport. Deleting the
unnecessary sleeps saves 10 seconds, for each of the 9 instances of this
test.
ClusterResolverLb is still doing DNS itself, so disable it in XdsDepMan
until that migration has finished. EDS is fine in XdsDepman, because
XdsClient will share the result with ClusterResolverLb.