Just an is a8de9f0, lack of equals causes cluster_resolver to consider every update a different configuration and restart itself.
Handling NaN should really be prevented with validation, but it looks like that
would lead to yak shaving at the moment.
b/435208946
Since c4256add4 we no longer fabricate a TRANSIENT_FAILURE update from
children. However, previously that would have set
seenReadyOrIdleSinceTransientFailure = false and prevented future timer
creation. If a LB policy gives extraneous updates with state CONNECTING,
then it was possible to re-create failOverTimer which would then wait
the 10 seconds for the child to finish CONNECTING. We only want to give
the child one opportunity after transitioning out of READY/IDLE.
https://github.com/grpc/proposal/pull/509
Http2RstCounterEncoder has to be constructed before
NettyServerHandler/Http2ConnectionHandler so it must be static. Thus the
code/counters were moved into RstStreamCounter which then can be
constructed earlier and shared.
This depends on Netty 4.1.124 for a bug fix to actually call the
encoder:
be53dc3c9a
This implicitly disables NettyAdaptiveCumulator (#11284), which can have a
performance impact. We delayed upgrading Netty to give time to rework
the optimization, but we've gone too long already without upgrading
which causes problems for vulnerability tracking.
Notably, protobuf to 3.25.8, opentelemetry to 1.52.0. Protobuf in Bazel
has 25.5 in the BCR and it seems better to align the WORKSPACE
with that version. But we can't actually use 25.5 in BCR because it is
incompatible with Bazel 7.
This allows a server with access to PeerUid to check additional application-layer security policy *after* the call itself is authorized by the transport layer. Cross cutting application-layer checks could be done from a ServerInterceptor (RPC method level policy, say). Checks based on the substance of a request message could be done by the individual RPC method implementations themselves.
Instead of representing an aggregate cluster as a single cluster whose
priorities come from different underlying clusters, represent an aggregate cluster as an instance of a priority LB policy where each child is a cds LB policy for the underlying
cluster.
Avoiding so many deps will allow us to upgrade the protos without being
forced to upgrade to protobuf-java 4.x. It also removes the remaining
non-bzlmod dependencies.
It'd be really easy to get this wrong, so we do two things 1) mirror the
gradle configuration as much as possible, as that sees a lot of testing,
and 2) run the fake control plane with the _results_ of jarjar. There's
lots of classes that we could mess up, but that at least kicks the tires.
XdsTestUtils.buildRouteConfiguration() was moved to ControlPlaneRule to
stop the unnecessary circular dependency between the classes and to
avoid the many dependencies of XdsTestUtils.
I'm totally hacking java_grpc_library to improve the dependency
situation. Long-term, I think we will stop building Java libraries with
Bazel and require users to rely entirely on Maven Central. That seems to
be the direction Bazel is going and it will greatly simplify the
problems we've seen with protobuf having a single repository for many
languages. So while the hack isn't too bad, I hope we won't have to live
with it long-term.
The resource subscription to the fallback target was done only at the time of falling back, which can cause rpcs to fail. This change makes the fallback target to be subscribed and cached earlier, similar to C++ and go gRPC implementations.
The PriorityLB predates A56. tryNextPriority() now matches
ChoosePriority() from the gRFC.
The biggest change is waiting on CONNECTING children instead of failing
after the failOverTimer fires. The failOverTimer should be used to start
lower priorities more eagerly, but shouldn't cause the overall
connectivity state to become TRANSIENT_FAILURE on its own. The prior
behavior of creating the "Connection timeout for priority" failing
picker was particularly strange, because it didn't update child's
connectivity state. This previous behavior was creating errors because
of the failOverTimer with no way to diagnose what was going wrong.
b/428517222
The main reason I made a change here was to fix the tense from the
deadline "will be exceeded in" to "was exceeded after". But we really
don't want to be doing the string formatting unless the deadline is
actually exceeded. There were a few more changes to make some variables
effectively final.
Fix HashSet / HashMap initializations to have sufficient capacity allocated based on the number of keys to be inserted, without which it would always lead to a rehash / resize operation.
In #12185, RPCs were randomly hanging. In #12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
In observed cases, whether RST_STREAM or another failure from netty or
the server, listeners can fail to be notified when a connection yields a
null stream for the selected streamId. This causes hangs in clients,
despite deadlines, with no obvious resolution.
Tests which relied upon this promise succeeding must now change.
LoadBalancers shouldn't be called after shutdown(), but RingHashLb could
have enqueued work to the SynchronizationContext that executed after
shutdown(). This commit fixes problems discovered when auditing all LBs
usage of the syncContext for that type of problem.
Similarly, PickFirstLb could have requested a new connection after
shutdown(). We want to avoid that sort of thing too.
RingHashLb's test changed from CONNECTING to TRANSIENT_FAILURE to get
the latest picker. Because two subchannels have failed it will be in
TRANSIENT_FAILURE. Previously the test was using an older picker with
out-of-date subchannelView, and the verifyConnection() was too imprecise
to notice it was creating the wrong subchannel.
As discovered in b/430347751, where ClusterImplLb was seeing a new
subchannel being called after the child LB was shutdown (the shutdown
itself had been caused by RingHashConfig not implementing equals() and
was fixed by a8de9f07ab, which caused ClusterResolverLb to replace its
state):
```
java.lang.NullPointerException
at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createClusterLocalityFromAttributes(ClusterImplLoadBalancer.java:322)
at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createSubchannel(ClusterImplLoadBalancer.java:236)
at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
at io.grpc.internal.PickFirstLeafLoadBalancer.createNewSubchannel(PickFirstLeafLoadBalancer.java:527)
at io.grpc.internal.PickFirstLeafLoadBalancer.requestConnection(PickFirstLeafLoadBalancer.java:459)
at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:174)
at io.grpc.xds.LazyLoadBalancer$LazyDelegate.activate(LazyLoadBalancer.java:64)
at io.grpc.xds.LazyLoadBalancer$LazyDelegate.requestConnection(LazyLoadBalancer.java:97)
at io.grpc.util.ForwardingLoadBalancer.requestConnection(ForwardingLoadBalancer.java:61)
at io.grpc.xds.RingHashLoadBalancer$RingHashPicker.lambda$pickSubchannel$0(RingHashLoadBalancer.java:440)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:128)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.onData(XdsClientImpl.java:817)
```
grpc-binder clients authorize servers by checking the UID of the sender of the SETUP_TRANSPORT Binder transaction against some SecurityPolicy. But merely binding to an unauthorized server to learn its UID can enable "keep-alive" and "background activity launch" abuse, even if security policy ultimately decides the connection is unauthorized. Pre-authorization mitigates this kind of abuse by looking up and authorizing a candidate server Application's UID before binding to it. Pre-auth is especially important when the server's address is not fixed in advance but discovered by PackageManager lookup.
PROTOCOL-HTTP2.md specifies "TimeoutValue → {positive integer as ASCII
string of at most 8 digits}". Zero is not positive, so it should be
avoided. So make sure timeouts are at least 1 nanosecond instead of 0
nanoseconds.
grpc-go recently began disallowing zero timeouts in
https://github.com/grpc/grpc-go/pull/8290 which caused a regression as
grpc-java can generate such timeouts. Apparently no gRPC implementation
had previously been checking for zero timeouts.
Instead of changing the max(0) to max(1) everywhere, just move the max
handling into TimeoutMarshaller, since every caller of TIMEOUT_KEY was
doing the same max() handling.
Before fd8fd517d (in 2016!), grpc-java actually behaved correctly, as it
failed RPCs with timeouts "<= 0". The commit changed the handling to the
max(0) handling we see now.
b/427338711
297ab05ef converted CDS to XdsDependencyManager. This caused three
regressions:
* CdsLB2 as a RLS child would always fail with "Unable to find
non-dynamic root cluster" because is_dynamic=true was missing in
its service config
* XdsNameResolver only propagated resolution updates when the clusters
changed, so a CdsUpdate change would be ignored. This caused a hang
for RLS even with is_dynamic=true. For non-RLS the lack config update
broke the circuit breaking psm interop test. This would have been
more severe if ClusterResolverLb had been converted to
XdsDependenceManager, as it would have ignored EDS updates
* RLS did not propagate resolution updates, so CdsLB2 even with
is_dynamic=true the CdsUpdate for the new cluster would never arrive,
causing a hang
b/428120265
b/427912384
The @SystemApi runtime visibility requirement isn't really new. It has always been implicit in the required INTERACT_ACROSS_USERS permission, which (in production) can only be held by system apps.
The SDK_INT >= 30 requirement was also always present, via @RequiresApi() on BinderChannelBuilder#bindAsUser. This change just updates its replacement APIs (AndroidComponentAddress and TARGET_ANDROID_USER) to require it too.
The previous code did a ping-pong to make sure the transport had enough
time to process, but then proceeded to sleep 5 seconds. That sleep would
have been needed without the ping-pong, but with the ping-pong we are
confident all events have been drained from the transport. Deleting the
unnecessary sleeps saves 10 seconds, for each of the 9 instances of this
test.
ClusterResolverLb is still doing DNS itself, so disable it in XdsDepMan
until that migration has finished. EDS is fine in XdsDepman, because
XdsClient will share the result with ClusterResolverLb.
ClusterResolverLb gets the NameResolverRegistry from
LoadBalancer.Helper, so a new API was added in NameResover.Args to
propagate the same object to the name resolver tree.
RetryingNameResolver was exposed to xds. This is expected to be
temporary, as the retrying is being removed from ManagedChannelImpl and
moved into the resolvers. At that point, DnsNameResolverProvider would
wrap DnsNameResolver with a similar API to RetryingNameResolver and xds
would no longer be responsible.
This should often not matter much, but in b/412468630 it was cleary
visible that child creation order can skew load for the first batch of
RPCs. This doesn't solve all the cases, as further-away backends will
still be less likely chosen initially and it is ignorant of the LB
policy. But this doesn't impact correctness, is easy, and is one fewer
cases to worry about.
This is missing behavior defined in gRFC A74:
> As per gRFC A31, the ConfigSelector gives each RPC a ref to the
> cluster that was selected for it to ensure that the cluster is not
> removed from the xds_cluster_manager LB policy config before the RPC
> is done with its LB picks. These cluster refs will also hold a
> subscription for the cluster from the XdsDependencyManager, so that
> the XdsDependencyManager will not stop watching the cluster resource
> until the cluster is removed from the xds_cluster_manager LB policy
> config.
Without the logic, RPCs can race and see the error:
> INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster
Fixes#12152. This fixes the regression introduced in 297ab05e
TimeProvider provides wall time. That can move forward and backward as time is adjusted. OutlierDetection is measuring durations, so it should use a monotonic clock.
Fixes#11622
This will be used for logical dns clusters as part of gRFC A74. Swapping
to EnumMap wasn't really necessary, but was easy given the new type
system.
I can't say I'm particularly happy with the name of the new
TrackedWatcher type, but XdsConfigWatcher prevented using "Watcher"
because it won't implement the new interface, and ResourceWatcher
already exists in XdsClient. So we have TrackedWatcher, WatcherTracer,
TypeWatchers, and TrackedWatcherType.
It was introduced in fcb5c54e4 because at the time we didn't change the
API to communicate the status. When onResult2() was introduced in
90d0fabb1 this hack stopped being necessary.
The watchers can be completely regular, so the base class can do the
cache management while the subclasses are only concerned with
subscribing to children.
We here address the following obstacles in grpc-java to using Bazel's
--incompatible_disable_target_default_provider_fields flag:
```
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: in _java_grpc_library rule @@googleapis+//google/devtools/build/v1:build_java_grpc:
Traceback (most recent call last):
File "/private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/grpc-java+/java_grpc_library.bzl", line 94, column 30, in _java_rpc_library_impl
args.add(toolchain.plugin.files_to_run.executable, format = "--plugin=protoc-gen-rpc-plugin=%s")
Error: Accessing the default provider in this manner is deprecated and will be removed soon. It may be temporarily re-enabled by setting --incompatible_disable_target_default_provider_fields=false. See https://github.com/bazelbuild/bazel/issues/20183 for details.
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: Analysis of target '@@googleapis+//google/devtools/build/v1:build_java_grpc' failed
ERROR: Analysis of target '//src:bazel' failed; build aborted: Analysis failed
```
Just use a regular method instead of reusing the EvictionListener API.
Fix a few comments as well. Both of these changes were based on review
comments to pre-existing code in #11203.
Contributes to #11243
I noticed we deviated from gRFC A37 in some ways. It turned out those
were added to the gRFC later in https://github.com/grpc/proposal/pull/344:
- NACKing empty aggregate clusters
- Failing aggregate cluster when children could not be loaded
- Recusion limit of 16. We had this behavior already, but it was
ascribed to matching C++
There's disagreement on whether we should actually fail the aggregate
cluster for bad children, so I'm preserving the pre-existing behavior
for now.
The code is now doing a depth-first leaf traversal, not breadth-first.
This was odd to see, but the code was also pretty old, so the reasoning
seems lost to history. Since we haven't seen more than a single level of
aggregate clusters in practice, this wouldn't have been noticed by
users.
XdsDependencyManager.start() was created to guarantee that the callback
could not be called before returning from the constructor. Otherwise
XDS_CLUSTER_SUBSCRIPT_REGISTRY could potentially be null.
We can easily compute the rdsName and avoiding the state means we don't
need to override onResourceDoesNotExist() to keep the cache in-sync with
the config.
1fd29bc80 replaced cancelWatcher() with watcher.close(). But setting
cancelled was missing. Because the config update checks for shutdown,
the cancelled flag no longer avoids exceptions. But it seems best to
continue avoiding any processing after close to avoid surprises.
Reference counting doesn't release cycles, so swap to a tracing garbage
collector. This greatly simplifies the code as well, as diffing is no
longer necessary. (If vanilla reference counting was used, diffing
wouldn't have been necessary either as you just increment all the new
objects and decrement the old ones. But that doesn't work when use a set
instead of an integer.)
- Use @BinderThread to document restrictions on methods and certain fields.
- Make TransactionHandler non-public since only Android should call it.
- Replace an unnecessary AtomicLong with a plain old long.
The children of aggregate clusters have a priority order, so we can't
ever throw them in an ordinary set for later iteration.
This now detects recusion limits only after subscribing, but that
matches our existing behavior in CdsLoadBalancer2. We don't get much
value detecting the limit before subscribing and doing so makes watcher
types more different.
Loops are still a bit broken as they won't be unwatched when orphaned,
as they will form a reference loop. In CdsLoadBalancer2, duplicate
clusters had duplicate watchers so there was single-ownership and
reference cycles couldn't form. Fixing that is a bigger change.
Intermediate aggregate clusters are now included in XdsConfig, just for
simplicity. It doesn't hurt anything whether they are present or
missing. but it required updates to some tests.
* xds: Don't allow hostnames in address field
gRFC A27 specifies they must be IPv4 or IPv6 addresses. Certainly doing
a DNS lookup hidden inside the config object is asking for trouble.
The tests were accidentally doing a lot of failing DNS requests greatly
slowing them down. On my desktop, which made the problem most obvious
with five search paths in /etc/resolv.conf, :grpc-xds:test decreased
from 66s to 29s. The majority of that is XdsDependencyManagerTest which
went from 33s to .1s, as it generated a UUID for the in-process
transport each test and then used it as a hostname, which defeated
Java's DNS (negative) cache. The slowness was noticed because
XdsDependencyManagerTest should have run quickly as a single thread
without I/O, but was particularly slow on my desktop.
The cleanup caused me to audit serverName and the weird places it went.
I think some of them were tricks for XdsClientFallbackTest to squirrel
away something distinguishing, although reusing the serverName is asking
for confusion as is including the tricks in "shared" utilities.
XdsClientFallbackTest does have some non-trivial changes, but this seems
to fix some pre-existing bugs in the tests.
* Add failing hostname unit test
SOTW is unique in that it can become absent after being found. But if we
NACK when initially loading the resource, we don't want to delay, depend
on the resource timeout, and then give a poor error.
This was noticed while adding the EDS restriction that address is not a
hostname and some tests started hanging instead of failing quickly.
The optimization makes the code more complicated. Yes, we know that
maybePublishConfig() will do no work because of an outstanding watch,
but we don't do this for other new watchers created and doing so would
just make the code more bug-prone. This removes a difference in how
different watcher types are handled.
This provides better type and missing-map handling. Note that
getWatchers() now implicitly creates the map if it doesn't exist,
instead of just returning an empty map. That makes it a bit easier to
use and more importantly avoids accidents where a bug tries to modify
the immutable map.
The most important change here is to handle subscribeToCluster() calls
after shutdown(), and preventing the internal state from being heavily
confused as the assumption is there are no watchers after shutdown().
ClusterSubscription.closed isn't strictly necessary, but I don't want
the code to depend on double-deregistration being safe.
maybePublishConfig() isn't being called after shutdown(), but adding the
protection avoids a class of bugs that would cause channel panic.
gRPC doesn't create the CronetEngine, so even though streaming is
observing the CronetEngine's User-Agent, we don't have control of that.
In addition, CronetEngines are commonly shared between gRPC and normal
HTTP traffic, so we don't actually expect users to set gRPC in engine's
user agent. The existing behavior seems to be working as well as
feasible.
Fixes#11582
android-interop has been failing to build since 46485c8 because it
didn't have cmake installed and defined LDFLAGS/CXXFLAGS with pkg-config
before make_dependencies.sh had been run.
Android-interop didn't verify the codegen is up-to-date. Building the
codegen was just a relic from when android was its own separate gradle
build. Avoiding codegen means we don't have to compile absl/protobuf and
have a C++ toolchain.
After many years of issue 9179 being open, there's been nothing to show
that we need the javax.annotations.Generated annotation. Most tools use
file paths and a few check for annotations with "Generated" in the name.
ErrorProne has a few that check for javax.annotations.Generated, but
only UnnecessarilyFullyQualified looks like it'd be a problem and it is
disabled by default. We're not getting any more information, no users
have reported issues with `@generated=omit`, and the existing dependency
is annoying users, so just drop it.
Given we will still retain the GrpcGenerated annotation, it seems highly
likely things are already okay. Even if there are problems they would
probably be addressed by adding a io.grpc.stub.annotations.Generated
annotation or small tweaks. In the short-term, (non-Bazel) users can use
`@generated=javax`, but long-term we could consider removing the option
assuming we've resolved any outstanding issues.
We will want to update the examples and the README to remove the
org.apache.tomcat:annotations-api dependency after the next release.
Fixes#9179
This version runs way faster than BinderTransportTest and doesn't require an actual Android device/emulator. It'll allow future tests to simulate things that are difficult/impossible on real Android, at the price of some realism.
The plugin now outputs to "generated/sources". The IDE configuration
explicitly adding the folders to the source sets hasn't been needed for
some years.
This prevents a NPE and subsequent channel panic when trying to build a
config (because there are no watchers, so waitingOnResource==false)
without any listener and route.
```
java.lang.NullPointerException: Cannot invoke "io.grpc.xds.XdsDependencyManager$RdsUpdateSupplier.getRdsUpdate()" because "routeSource" is null
at io.grpc.xds.XdsDependencyManager.buildUpdate(XdsDependencyManager.java:295)
at io.grpc.xds.XdsDependencyManager.maybePublishConfig(XdsDependencyManager.java:266)
at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:899)
at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:888)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.notifyWatcher(XdsClientImpl.java:929)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.lambda$onData$0(XdsClientImpl.java:837)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
```
I think this fully-fixes the problem today, but not tomorrow.
subscribeToCluster() is racy as well, but not yet used.
This was noticed when idleTimeout was firing, with some other code
calling getState(true) to wake the channel back up. That may have made
this panic more visible than it would be otherwise, but that has not
been investigated.
b/412474567
The security code referenced fields removed from gRFC A29 before it was
finalized.
Note that this fixes a bug in CommonTlsContextUtil where
CombinedValidationContext was not checked. I believe this was the only
location with such a bug as I audited all non-test usages of
has/getValidationContext() and confirmed they have have a corresponding
has/getCombinedValidationContext().
Generating a KeyPair is very expensive when running with TSAN, because
TSAN keeps the JVM in interpreted mode. This speeds up the test running
on my desktop from .368s to .151s; faster, but nobody cares. With TSAN,
the speedup is from 150-500s to 4-6s. Within Google the test was timing
out because it was taking so long. While we can increase the timeout,
it seems better to speed up the test in this easy way.
"EXP" stood for experimental and all documentation that referenced it made it clear it was experimental. It's been some years since we started logging a message when it was used to say it will be deleted. There's no time like the present to delete it.
I think at some point there were more usages in the tests. But now it
is pretty easy.
PriorityLb.ChildLbState.picker is initialized to
FixedResultPicker(NoResult). So now that GracefulSwitchLb is using the
same picker, equals() is able to de-dup an update.
Previously it would wait for the new LB to enter READY. However, that
prevents there being an upper-bound on how long the old policy will
continue to be used. The point of graceful switch is to avoid RPCs
seeing increased latency when we swap config. We don't want it to
prevent the system from becoming eventually consistent.
Some clients watching health status can cancel their watch and `HealthService` when trying to notify these watchers were getting CANCELLED exception because there was no cancellation handler set on the `StreamObserver`. This change sets the cancellation handler that removes the watcher from the set of watcher clients to be notified of the health status.
panic() calls a good amount of code, so it could get another exception.
The SynchronizationContext is running on an arbitrary thread and we
don't want to propagate this secondary exception up its stack (to be
handled by its UncaughtExceptionHandler); it we wanted that we'd
propagate the original exception.
This second exception will only be seen in the logs; the first exception
was logged and will be used to fail RPCs.
Also related to http://yaqs/8493785598685872128 and b692b9d26
This is to support gRFC A83 xDS GCP Authentication Filter:
> Otherwise, the filter will look in the CDS resource's metadata for a
> key corresponding to the filter's instance name.
It is much harder to debug refcounting problems when we ignore
impossible situations. So make such impossible cases complain loudly so
the bug is obvious.
This allows Filters to access the xds configuration for their own
processing. From gRFC A83:
> This data is available via the XdsConfig attribute introduced in A74.
> If the xDS ConfigSelector is not already passing that attribute to the
> filters, it will need to be changed to do so.
Contributes to the gRFC A74 effort.
https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md
The alternative to using Mockito's ArgumentMatcher is to use Hamcrest.
However, Hamcrest did not impress me. ArgumentMatcher is trivial if you
don't care about the error message.
This fixes a pre-existing issue where ConfigSelector.releaseCluster
could revert the LB config back to using cluster manager after releasing
all RPCs using a cluster have committed.
Co-authored-by: Larry Safran <lsafran@google.com>
It is confusing/harder to read the logs when the
activations/deactivations because of the config happen before the log
entry describing the new config.
Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).
The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819ac6, updateBalancingState() updates the picker immediately.
cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.
The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819ac6 along
with this commit reduced it to 0 failures in 1000 runs.
Discovered when investigating b/394850611
ffcc360ba adjusted updateBalancingState() to require being run within
the sync context. However, it still queued the work into the sync
context, which was unnecessary. This re-entering the sync context
unnecessarily delays the new state from being used.
This PR adds support filter state retention in Java. The mechanism
will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.
### Filter instance lifecycle
#### xDS gRPC clients
New filter instances are created per combination of:
1. `XdsNameResolver` instance,
2. Filter name+typeUrl as configured in
HttpConnectionManager (HCM) http_filters.
Existing client-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
HCM that is missing filter configuration for name+typeUrl
combination of this instance.
- All filter instances when watched LDS resource is missing from an
LDS update.
- All filter instances name resolver shutdown.
#### xDS-enabled gRPC servers
New filter instances are created per combination of:
1. Server instance,
2. FilterChain name,
3. Filter name+typeUrl as configured in FilterChain's HCM.http_filters
Filter instances of Default Filter Chain is tracked separately per:
1. Server instance,
2. Filter name+typeUrl in default_filter_chain's HCM.http_filters.
Existing server-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
FilterChain with HCM.http_filters that is missing configuration for
filter name+typeUrl.
- All filter instances associated with the FilterChain when an LDS
update no longer contains FilterChain's name.
- All filter instances when watched LDS resource is missing from an
LDS update.
- All filter instances on server shutdown.
### Related
- Part 1: #11883
`XdsServerWrapper#generatePerRouteInterceptors` was always intended
to be executed within a sync context. This PR ensures that by calling
`syncContext.throwIfNotInThisSynchronizationContext()`.
This change is needed for upcoming xDS filter state retention because
the new tests in XdsServerWrapperTest flake with this NPE:
> `Cannot invoke "io.grpc.xds.client.XdsClient$ResourceWatcher.onChanged(io.grpc.xds.client.XdsClient$ResourceUpdate)" because "this.ldsWatcher" is null`
We want to move away from handleResolvedAddresses(). These are "easy" in
that they need no logic. LBs extending ForwardingLoadBalancer had the
method duplicated from handleResolvedAddresses() and swapped away from
`super` because ForwardingLoadBalancer only forwards
handleResolvedAddresses() reliably today. Duplicating small methods was
less bug-prone than dealing with ForwardingLoadBalancer.
Implements [gRFC A76: explicitly setting the request hash key for the
ring hash LB policy][A76]
* Explictly setting the request hash key is guarded by the
`GRPC_EXPERIMENTAL_RING_HASH_SET_REQUEST_HASH_KEY` environment
variable until API stabilized.
Tested:
* Verified end-to-end by spinning up multiple gRPC servers and a gRPC
client that injects a custom service (load balancing) config with
`ring_hash_experimental` and a custom `request_hash_header` (with
NO associated value in the metadata headers) which generates a random
hash for each request to the ring hash LB. Verified picks/RPCs are
split evenly/uniformly across all backends.
* Ran affected unit tests with thread sanitizer and 1000 iterations to
prevent data races.
[A76]: https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md#explicitly-setting-the-request-hash-key
This is the first step towards supporting filter state retention in
Java. The mechanism will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.
In Java, xDS HTTP filters are backed by classes implementing
`io.grpc.xds.Filter`, from here just called "Filters". To support
Filter state retention (next PR), Java's xDS implementation must be
able to create unique Filter instances per:
- Per HCM
`envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager`
- Per filter name as specified in
`envoy.extensions.filters.network.http_connection_manager.v3.HttpFilter.name`
This PR **does not** implements Filter state retention, but lays the
groundwork for it by changing how filters are registered and
instantiated. To achieve this, all existing Filter classes had to be
updated to the new instantiation mechanism described below.
Prior to these this PR, Filters had no livecycle. FilterRegistry
provided singleton instances for a given typeUrl. This PR introduces
a new interface `Filter.Provider`, which instantiates Filter classes.
All functionality that doesn't need an instance of a Filter is moved
to the Filter.Provider. This includes parsing filter config proto
into FilterConfig and determining the filter kind
(client-side, server-side, or both).
This PR is limited to refactoring, and there's no changes to the
existing behavior. Note that all Filter Providers still return
singleton Filter instances. However, with this PR, it is now possible
to create Providers that return a new Filter instance each time
`newInstance` is called.
A failing Status from acceptResolvedAddresses means something is wrong
with the config, but parts of the config may still have been applied.
Thus there are now two possible flows: errors that should prevent
updateOverallBalancingState() and errors that should have no effect
other than the return code. To manage that, MultChildLb must always be
responsible for calling updateOverallBalancingState().
acceptResolvedAddressesInternal() was inlined to make that error
processing easier. No existing usages actually needed to have logic
between updating the children and regenerating the picker.
RingHashLb already was verifying that the address list was not empty, so
the short-circuiting when acceptResolvedAddressesInternal() returned an
error was impossible to trigger. WrrLb's updateWeightTask() calls the
last picker, so it can run before acceptResolvedAddressesInternal(); the
only part that matters is re-creating the weightUpdateTimer.
S2AStub is an internal API and shouldn't be used outside of s2a. It is
still available for tests.
IntegrationTest was moved to io.grpc.s2a. It uses a io.grpc.s2a class,
so shouldn't be in internal.handler
Switched to using 8192 which is the current value of Segment.SIZE and just have a test check that they are equal.
The reason for doing this is that Segment.SIZE is Kotlin internal so shouldn't be used outside of its module.
To try to aid failure when building android-interop-testing
```
The Daemon will expire after the build after running out of JVM heap space.
The project memory settings are likely not configured or are configured to an insufficient value.
The daemon will restart for the next build, which may increase subsequent build times.
These settings can be adjusted by setting 'org.gradle.jvmargs' in 'gradle.properties'.
The currently configured max heap space is '512 MiB' and the configured max metaspace is '384 MiB'.
...
Exception in thread "Daemon client event forwarder" java.lang.OutOfMemoryError: Java heap space
...
> Task :grpc-android-interop-testing:mergeDexDebug FAILED
ERROR:D8: java.lang.OutOfMemoryError: Java heap space
com.android.builder.dexing.DexArchiveMergerException: Error while merging dex archives:
```
* don't process resourceDoesNotExist for watchers that have been cancelled.
* Change test to use an ArgumentMatcher instead of expecting that only the final result will be sent since depending on timing there may be configs sent for clusters being removed with their entries as errors.
Currently this improves 2 flows
1. Known length message which length is greater than 1Mb. Previously the
first buffer was 1Mb, and then many buffers of 4096 bytes (from
CodedOutputStream), now subsequent buffers are also up to 1Mb
2. In case of compression, the first write is always 10 bytes buffer
(gzip header), but worth allocating more space
* Change XdsConfig to match spec with a `children` object holding either `a list of leaf cluster names` or `an EdsUpdate`. Removed intermediate aggregate nodes from `XdsConfig.clusters`.
* core: updates the backoff range being used from [0, 1] to [0.8, 1.2] as per the A6 redefinition
* adds a flag for experimental jitter
* xds: Allow FaultFilter's interceptor to be reused
This is the only usage of PickSubchannelArgs when creating a filter's
ClientInterceptor, and a follow-up commit will remove the argument and
actually reuse the interceptors. Other filter's interceptors can
already be reused.
There doesn't seem to be any significant loss of legibility by making
FaultFilter a more ordinary interceptor, but the change does cause the
ForwardingClientCall to be present when faultDelay is configured,
independent of whether the fault delay ends up being triggered.
Reusing interceptors will move more state management out of the RPC path
which will be more relevant with RLQS.
* netty: Removed 4096 min buffer size (#11856)
* netty: Removed 4096 min buffer size
* turns the flag in a var for better efficiency
---------
Co-authored-by: Eric Anderson <ejona@google.com>
The variables from the do-while are no longer initialized to let the
compiler verify that the loop sets each. Unnecessary comparisons to null
are also removed and is more obvious as the variables are never set to
null. Added a minor optimization of computing the RPCs path once instead
of once for each route. The variable declarations were also sorted to
match their initialization order.
This does fix an unlikely bug where if the old code could successfully
matched a route but fail to retain the cluster, then when trying a
second time if the route was _not_ matched it would re-use the prior route
and thus infinite-loop failing to retain that same cluster.
It also adds a missing cast to unsigned long for a uint32 weight. The old
code would detect if the _sum_ was negative, but a weight using 32 bits
would have been negative and never selected.
The Kokoro aarch64 build runs on x86 with an emulator, and has always
been flaky due to the slow execution speed. At present it is continually
failing due to deadline exceededs. GitHub Actions is running on aarch64
hardware, so is much faster (4 minutes vs 30 minutes, without including
the speedup from GitHub Action's caching).
This adds a createFrom(Attributes) to mirror the check(Attributes) added
in ba8ab79. It also adds conveniences for ClientCall for both
createFrom() and check(). This allows getting peer information from
ClientCall and CallCredentials.RequestInfo, as was already available
from ServerCall.
The tests were reworked to test the Attribute-based methods and then
only basic tests for client/server.
Fixes#11042
The field was made final in 4b52639aa but was soon reverted in 3ebb3e192
because of what I assume was a bad merge conflict resolution. The field
has contained an immutable object since its introduction in d25f5acf1,
so it is pretty likely to remain a constant in the future.
This moves the interceptor creation from the ConfigSelector to the
resource update handling.
The code structure changes will make adding support for filter
lifecycles (for RLQS) a bit easier. The filter lifecycles will allow
filters to share state across interceptors, and constructing all the
interceptors on a single thread will mean filters wouldn't need to be
thread-safe (but their interceptors would be thread-safe).
Setting the authority is only useful when creating a real stream, as
there will be a following pick otherwise. In addition, DelayedStream
will buffer each call to setAuthority() in a list and we don't want that
memory usage. Note that no LBs are using this feature yet, so users
would not have been exposed to the memory use.
We also needed to setAuthority() when the LB selected a subchannel on
the first pick attempt.
This is the only usage of PickSubchannelArgs when creating a filter's
ClientInterceptor, and a follow-up commit will remove the argument and
actually reuse the interceptors. Other filter's interceptors can
already be reused.
There doesn't seem to be any significant loss of legibility by making
FaultFilter a more ordinary interceptor, but the change does cause the
ForwardingClientCall to be present when faultDelay is configured,
independent of whether the fault delay ends up being triggered.
Reusing interceptors will move more state management out of the RPC path
which will be more relevant with RLQS.
* Have acceptResolvedAddresses() do a seek when in CONNECTING state and cleanup removed subchannels when a seek was successful.
Move cleanup of removed subchannels into a method so it can be called from 2 places in acceptResolvedAddresses.
Since the seek could mean we never looked at the first address, if we go off the end of the index and haven't looked at the all of the addresses then instead of scheduleBackoff() we reset the index and request a connection.
d65d3942e increased the test speed of
connect_then_mainServerDown_fallbackServerUp by using FakeClock.
However, it introduced a data race because FakeClock is not thread-safe.
This change injects a single thread for gRPC callbacks such that
syncContext is run on a thread under the test's control.
A simpler approach would be to expose syncContext from XdsClientImpl for
testing. However, this test is in a different package and I wanted to
avoid adding a public method.
```
Read of size 8 at 0x00008dec9d50 by thread T25:
#0 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Lio/grpc/internal/FakeClock$ScheduledTask;JLjava/util/concurrent/TimeUnit;)V FakeClock.java:140
#1 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;)Ljava/util/concurrent/ScheduledFuture; FakeClock.java:150
#2 io.grpc.SynchronizationContext.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/ScheduledExecutorService;)Lio/grpc/SynchronizationContext$ScheduledHandle; SynchronizationContext.java:153
#3 io.grpc.xds.client.ControlPlaneClient$AdsStream.handleRpcStreamClosed(Lio/grpc/Status;)V ControlPlaneClient.java:491
#4 io.grpc.xds.client.ControlPlaneClient$AdsStream.lambda$onStatusReceived$0(Lio/grpc/Status;)V ControlPlaneClient.java:429
#5 io.grpc.xds.client.ControlPlaneClient$AdsStream$$Lambda+0x00000001004a95d0.run()V ??
#6 io.grpc.SynchronizationContext.drain()V SynchronizationContext.java:96
#7 io.grpc.SynchronizationContext.execute(Ljava/lang/Runnable;)V SynchronizationContext.java:128
#8 io.grpc.xds.client.ControlPlaneClient$AdsStream.onStatusReceived(Lio/grpc/Status;)V ControlPlaneClient.java:428
#9 io.grpc.xds.GrpcXdsTransportFactory$EventHandlerToCallListenerAdapter.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V GrpcXdsTransportFactory.java:149
#10 io.grpc.PartialForwardingClientCallListener.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V PartialForwardingClientCallListener.java:39
...
Previous write of size 8 at 0x00008dec9d50 by thread T4 (mutexes: write M0, write M1, write M2, write M3):
#0 io.grpc.internal.FakeClock.forwardTime(JLjava/util/concurrent/TimeUnit;)I FakeClock.java:368
#1 io.grpc.xds.XdsClientFallbackTest.connect_then_mainServerDown_fallbackServerUp()V XdsClientFallbackTest.java:358
...
```
Protobuf is interested in using absl::string_view instead of const
std::string&. Just copy to std::string as the C++17 build isn't yet
operational and that level of performance doesn't matter.
cl/711732759 b/353571051
The soak code grew considerably in 6a92a2a22e. Since it isn't a JUnit
test and doesn't resemble the other tests, it doesn't belong in
AbstractInteropTest. AbstractInteropTest has lots of users, including it
being re-compiled for use on Android, so moving it out makes the
remaining code more clear for the more common cases.
This PR resolves an issue with peer address extraction in the soak
test.
In current `TestServiceClient` implementation, the same
`clientCallCapture` atomic is shared across threads, leading to
incorrect peer extraction. This fix ensures that each thread uses a
local variable for capturing the client call.
This is in service to gRFC A89. Since the gRFC isn't finalized this
purposefully doesn't really do anything yet. The grpc-opentelemetry
change to use this optional label will be done after the gRFC is merged.
grpc-opentelemetry currently has a hard-coded list (one entry) of labels
that it looks for, and this label will need to be added.
b/356167676
These changes reduce connect_then_mainServerDown_fallbackServerUp test
time from 20 seconds to 5 s by faking time for the the does-no-exist
timer.
XdsClientImpl only uses the TimeProvider for CSDS cache details, so any
implementation should be fine. FakeXdsClient provides an implementation,
so might as well use it as it is one less clock to think about.
I noticed an old JDK 8u275 failed on the test because the modification
time's resolution was one second. A newer JDK 8u432 worked fine, so it's
not really a problem for me, but increasing the time difference is
cheap. I used two seconds as that's the resolution available on FAT
(which is unlikely to be TMPDIR, even on Windows).
Since approximately the LBv2 API (the current API) was introduced, gRPC
won't use a transport until it is ready. Long ago, transports could be
used before they were ready and these old tests were not waiting for the
negotiator to complete before starting. We need them to wait for the
handshake to complete to avoid a test-only data race in getAttributes()
noticed by TSAN.
Throwing away data frames in the Noop handshaker is necessary to act
like a normal handshaker; they don't allow data frames to pass until the
handshake is complete. Without the handling, it goes through invalid
code paths in NettyClientHandler where a terminated transport becomes
ready, and a similar data race.
```
Write of size 4 at 0x00008db31e2c by thread T37:
#0 io.grpc.netty.NettyClientHandler.handleProtocolNegotiationCompleted(Lio/grpc/Attributes;Lio/grpc/InternalChannelz$Security;)V NettyClientHandler.java:517
#1 io.grpc.netty.ProtocolNegotiators$GrpcNegotiationHandler.userEventTriggered(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V ProtocolNegotiators.java:937
#2 io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(Ljava/lang/Object;)V AbstractChannelHandlerContext.java:398
#3 io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(Lio/netty/channel/AbstractChannelHandlerContext;Ljava/lang/Object;)V AbstractChannelHandlerContext.java:376
#4 io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(Ljava/lang/Object;)Lio/netty/channel/ChannelHandlerContext; AbstractChannelHandlerContext.java:368
#5 io.grpc.netty.ProtocolNegotiators$ProtocolNegotiationHandler.fireProtocolNegotiationEvent(Lio/netty/channel/ChannelHandlerContext;)V ProtocolNegotiators.java:1107
#6 io.grpc.netty.ProtocolNegotiators$WaitUntilActiveHandler.channelActive(Lio/netty/channel/ChannelHandlerContext;)V ProtocolNegotiators.java:1011
...
Previous read of size 4 at 0x00008db31e2c by thread T4 (mutexes: write M0, write M1, write M2, write M3):
#0 io.grpc.netty.NettyClientHandler.getAttributes()Lio/grpc/Attributes; NettyClientHandler.java:345
#1 io.grpc.netty.NettyClientTransport.getAttributes()Lio/grpc/Attributes; NettyClientTransport.java:387
#2 io.grpc.netty.NettyClientTransport.newStream(Lio/grpc/MethodDescriptor;Lio/grpc/Metadata;Lio/grpc/CallOptions;[Lio/grpc/ClientStreamTracer;)Lio/grpc/internal/ClientStream; NettyClientTransport.java:198
#3 io.grpc.netty.NettyClientTransportTest$Rpc.<init>(Lio/grpc/netty/NettyClientTransport;Lio/grpc/Metadata;)V NettyClientTransportTest.java:953
#4 io.grpc.netty.NettyClientTransportTest.huffmanCodingShouldNotBePerformed()V NettyClientTransportTest.java:631
...
```
```
Read of size 4 at 0x00008f983a3c by thread T4 (mutexes: write M0, write M1):
#0 io.grpc.netty.NettyClientHandler.getAttributes()Lio/grpc/Attributes; NettyClientHandler.java:345
#1 io.grpc.netty.NettyClientTransport.getAttributes()Lio/grpc/Attributes; NettyClientTransport.java:387
#2 io.grpc.netty.NettyClientTransport.newStream(Lio/grpc/MethodDescriptor;Lio/grpc/Metadata;Lio/grpc/CallOptions;[Lio/grpc/ClientStreamTracer;)Lio/grpc/internal/ClientStream; NettyClientTransport.java:198
#3 io.grpc.netty.NettyClientTransportTest$Rpc.<init>(Lio/grpc/netty/NettyClientTransport;Lio/grpc/Metadata;)V NettyClientTransportTest.java:973
#4 io.grpc.netty.NettyClientTransportTest$Rpc.<init>(Lio/grpc/netty/NettyClientTransport;)V NettyClientTransportTest.java:969
#5 io.grpc.netty.NettyClientTransportTest.handlerExceptionDuringNegotiatonPropagatesToStatus()V NettyClientTransportTest.java:425
...
Previous write of size 4 at 0x00008f983a3c by thread T56:
#0 io.grpc.netty.NettyClientHandler$FrameListener.onSettingsRead(Lio/netty/channel/ChannelHandlerContext;Lio/netty/handler/codec/http2/Http2Settings;)V NettyClientHandler.java:960
...
```
* Move creating the retry timer in handleRpcStreamClosed to as late as possible and call `close` so that the `call` is cancelled.
Also add some debug logging.
If the control plane sends a resource type the client doesn't understand
at-the-moment, the control plane will still expect the client to include
the nonce if the client subscribes to the type in the future.
This most easily happens when unsubscribing the last resource of a type.
Which meant 1cf1927d1 was insufficient.
Internal* classes should generally be accessors that are used outside of
the package/project. Only one attribute was used outside of xds, so
leave only that one attribute in InternalXdsAttributes. One attribute
was used by the internal.security package, so move the definition to the
same package to reduce the circular dependencies.
There's no reason to use the interface outside of
XdsClientImpl/ControlPlaneClient. Since XdsClientImpl implements the
interface directly, its methods are still public. That can be a future
cleanup.
The module metadata in Guava causes the -jre version to be selected even
when you choose the -android version. Gradle did not give any clues that
this was happening, and while
`println(configurations.compileClasspath.resolve())` shows the different
jar in use, most other diagonstics don't. dependencyInsight can show you
this is happening, but only if you know which dependency has a problem
and read Guava's module metadata first to understand the significance of
the results.
You could argue this is a Guava-specific problem. I was able to get
parts of our build working with attributes and resolutionStrategy
configurations mentioned at
https://github.com/google/guava/releases/tag/v32.1.0 , so that only
Guava would be changed. But it was fickle giving poor error messages or
silently swapping back to the -jre version.
Given the weak debuggability, the added complexity, and the lack of
value module metadata is providing us, disabling module metadata for our
entire build seems prudent.
See https://github.com/google/guava/issues/7575
These repositories are already included from the main build.gradle, so
they don't do anything. Much less do they need to be defined twice in
the same file.
In e08b9db20 we added `@DoNotCall` annotations to some call sites, but
Bazel used an older version of ErrorProne that complained at times it
shouldn't. The minimum version of Bazel we test/support is now Bazel 6,
well past Bazel 3.4+.
This avoids the dependency on animalsniffer-annotations. grpc-api, and
particularly grpc-context, are used many low-level places and it is
beneficial for them to be very low dependency. This brings grpc-context
back to zero-dependency.
grpc-binder's upcoming AndroidIntentNameResolver needs to know the target Android user so it can resolve target URIs in the correct place. Unfortunately, Android's built in intent:// URI scheme has no way to specify a user and in fact the android.os.UserHandle object can't reasonably be encoded as a String at all.
We solve this problem by extending NameResolver.Args with the same type-safe and domain-specific Key<T> pattern used by CallOptions, Context and CreateSubchannelArgs. New "custom" arguments could apply to all NameResolvers of a certain URI scheme, to all NameResolvers producing a particular type of java.net.SocketAddress, or even to a specific NameResolver subclass.
In 61f19d707a I swapped the signatures to use the version catalog. But I
failed to preserve the `@signature` extension and it all seemed to
work... But in fact all the animalsniffer tasks were completing as
SKIPPED as they lacked signatures. The build.gradle changes in this
commit are to fix that while still using version catalog.
But while it was broken violations crept in. Most violations weren't
too important and we're not surprised went unnoticed. For example, Netty
with TLS has long required the Java 8 API
`setEndpointIdentificationAlgorithm()`, so using `Optional` in the same
code path didn't harm anything in particular. I still swapped it to
Guava's `Optional` to avoid overuse of `@IgnoreJRERequirement`.
One important violation has not been fixed and instead I've disabled the
android signature in api/build.gradle for the moment. The violation is
in StatusException using the `fillInStackTrace` overload of Exception.
This problem [had been noticed][PR11066], but we couldn't figure out
what was going on. AnimalSniffer is now noticing this and agreeing with
the internal linter. There is still a question of why our interop tests
failed to notice this, but given they are no longer running on pre-API
level 24, that may forever be a mystery.
[PR11066]: https://github.com/grpc/grpc-java/pull/11066
StructOrError is a more generic API, but we have StatusOr now so we
don't want new usages of StructOrError. Moving StructOrError out of
io.grpc.xds.client will make it easier to delete StructOrError once
we've migrated to StatusOr in the future.
TRANSPORT_SOCKET_NAME_TLS should also move, but it wasn't immediately
clear to me where it should go.
ObjectPool is our standard solution for dealing with the
sometimes-shutdown resources. This was implemented by a contributor not
familiar with regular tools.
There are wider changes that can be made here, but I chose to just do a
smaller change because this class is used by GrpclbNameResolver.
The channel log is shared by many components and is poorly suited to
the noise of per-RPC events. This commit restricts RLS usage of the
logger to no more frequent than cache entry events. This may still be
too frequent, but should substantially improve the signal-to-noise and
we can do further rework as needed.
Many of the log entries were poor because they lacked enough context.
They weren't even clear they were from RLS. The cache entry events now
regularly include the request key in the logs, allowing you to follow
events for specific keys. I would have preferred using the hash code,
but NumberFormat is annoying and toString() may be acceptable given its
convenience.
This commit reverts much of eba699ad. Those logs have not proven to be
helpful as they produce more output than can be reasonably stored.
This was noticed because of a CallOptionsTest flake that had a
surprising error:
```
expected : 59.983387319
but was : 59.983387319
outside tolerance in seconds: 0.01
```
The target UserHandle is best modeled as part of the SocketAddress not the Channel since it's part of the server's location.
This change allows a NameResolver to select different target users over time within a single Channel.
The goal of this PR is to increase the test coverage of the C2P E2E load test by improving the rpc_soak and channel_soak tests to support concurrency.
**rpc_soak:**
The client performs many large_unary RPCs in sequence over the same channel. The test can run in either a concurrent or non-concurrent mode, depending on the number of threads specified (soak_num_threads):
- Non-Concurrent Mode: When soak_num_threads = 1, all RPCs are performed sequentially on a single thread.
- Concurrent Mode: When soak_num_threads > 1, the client uses multiple threads to distribute the workload. Each thread performs a portion of the total soak_iterations, executing its own set of RPCs concurrently.
**channel_soak:**
Similar to rpc_soak, but this time each RPC is performed on a new channel. The channel is created just before each RPC and is destroyed just after. Note on Concurrent Execution and Channel Creation: In a concurrent execution setting (i.e., when soak_num_threads > 1), each thread performs a portion of the total soak_iterations and creates and destroys its own channel for each RPC iteration.
- createNewChannel Function: In channel_soak, the createNewChannel function is used by each thread to create a new channel before every RPC. This function ensures that each RPC has a separate channel, preventing race conditions by isolating channels between threads. It shuts down the previous channel (if any) and creates a new one for each iteration, ensuring accurate latency measurement per RPC.
- Thread-specific logs will include the thread_id, helping to track performance across threads, especially when each thread is managing its own channel lifecycle.
Generated code for v1alpha was ignored, but not v1. Ignoring v1 reduces
lines being checked from 16,145 to 6,303, significantly improving the
overall code coverage and removing noise. This was noticed because there
was a very clear drop at 0aa976c4 visible in the coveralls.io coverage
graph, the point when v1 was introduced.
When spiffe support was added it caused
tlsClientServer_useSystemRootCerts_validationContext to become flaky.
This is because test execution order was important for whether the race
would occur.
Fixes#11678
This is a step toward removing ResolvedAddresses from ChildLbState,
which isn't actually used by MultiChildLb. Most usages of the EAG usages
can be served more directly without peering into MultiChildLb's
internals or even accessing ChildLbStates, which make the tests less
sensitive to implementation changes. Some changes do leverage the new
behavior of MultiChildLb where it preserves the order of the entries.
This does fix an important bug in shutdown tests. The tests looped over
the ChildLbStates after shutdown, but shutdown deleted all the children
so it looped over an entry collection. Fixing that exposed that
deliverSubchannelState() didn't function after shutdown, as the listener
was removed from the map when the subchannel was shut down. Moving the
listener onto the TestSubchannel allowed having access to the listener
even after shutdown.
A few places in LeastRequestLb lines were just deleted, but that's
because an existing assertion already provided the same check but
without digging into MultiChildLb.
When forwarding from Listener onAddresses to Listener2 continue to use onResult and not onResult2 because the latter requires to be called from within synchronization context and it breaks existing code that didn't need to do so when using the old Listener interface.
feab4e54 removed xds v2 for the Gradle build. Testing with a deploy.jar,
I see the same 4 MB size reduction (31 -> 27 MB) here.
While an orca dependency is deleted in this commit, it is only a direct
dependency. It remains in the :orca target, so doesn't contribute a size
reduction.
PAUSED Looper mode has been the default for many years, maybe around
robolectric 4.5 (9ae9f0b6a6). Explicitly specifying PAUSED Looper mode
is not necessary.
cl/690684542
When java.time.Instant is available use the timestamp from this class in nano precision rather than using System.currentTimeInMillis and converting it to nanos.
Fixes#5494.
Allow using system root certs for server cert validation rather than CA root certs provided by the control plane when the validation context provided by the control plane specifies so.
This reverts commit 99f86835ed.
The change doesn't handle `null` messages, which don't happen with
protobuf, but can happen with other marshallers, especially in tests.
See cl/689445172
This will reopen#5969.
Callers are frequently confused by this message and waste time looking for problems in the client when the root cause is simply a server crash. See b/371447460 for more context.
It is the `Executor appExecutor` that should be given an asynchronous
task, not `CallCredentials.MetadataApplier applier`.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This had been used for a time with a combined inprocess+binder server.
However, just having multiple servers worked fine and this is no longer
used/needed.
If a panic is followed a panic, we'd ignore the second. But if an
exception happens while entering panic mode we may fail to update the
picker with the first error. This is "fine" from a correctness
standpoint; all bets are off when panicking and we've already logged the
first error. But failing RPCs can often be more easily seen than just
the log.
Noticed because of http://yaqs/8493785598685872128
* Add S2AStub cleanup handler.
* Give TLS and Cleanup handlers name + update comment.
* Don't add TLS handler twice.
* Don't remove explicitly, since done by fireProtocolNegotiationEvent.
* plumb S2AStub close to handshake end + add integration test.
* close stub when TLS negotiation fails.
When an ADS stream in closed with a non-OK status after receiving a response, new status will be updated to OK status. This makes the fail behavior consistent with gRFC A57.
* throw IllegalArgumentException in ProtoUtil.
* throw exception in TrustManager in more standard way.
* handle IllegalArgumentException in SslContextFactory.
* Don't throw error on unknown TLS version.
Combined success / error status passed via ResolutionResult to the NameResolver.Listener2 interface's onResult2 method - Addresses in the success case or address resolution error in the failure case now get set in ResolutionResult::addressesOrError by the internal name resolvers.
* Change PickFirstLeafLoadBalancer to only have 1 subchannel at a time if environment variable GRPC_SERIALIZE_RETRIES == true.
Cache serializingRetries value so that it doesn't have to look up the flag every time.
Clear the correct task when READY in processSubchannelState and move the logic to cancelScheduledTasks
Cleanup based on PR review
remove unneeded checks for shutdown.
* Fix previously broken tests
* Shutdown previous subchannel when run off end of index.
* Provide option to disable subchannel retries to let PFLeafLB take control of retries.
* InternalSubchannel internally goes to IDLE when sees TF when reconnect is disabled.
Remove an extra index.increment in LeafLB
When running on the JDK, it is quite normal for Conscrypt not to be
present. We'll end up using the JDK 9 ALPN API and everything will be
fine. On Android, it would be extremely rare for someone to completely
remove the default Android security providers, so the warning was almost
never going to trigger on that platform anyway.
A map of children is still needed, but is created temporarily on update.
The order of children is currently preserved, but we could use regular
HashMaps if that is not useful.
* Combine MtlsToS2ChannelCredentials and S2AChannelCredentials.
* Check if file exists.
* S2AChannelCredentials API requires credentials used for client-s2a channel.
* remove MtlsToS2A library in BUILD.
* Don't check state twice.
* Don't check for file existence in tests.
Instead of doing a dance of supplementing config so the later
createChildAddressesMap() won't delete children, just look at the
existing children and don't delete any that shouldn't be deleted.
* Use StandardCharsets in FakeS2AServerTest.
* Use add instead of offer in S2AStub.
* remove dead code in ProtoUtil.java.
* Mark convertTlsProtocolVersion as VisibleForTesting.
* S2AStub doesn't return responses at front of queue.
* Remove global SHARED_RESOURCE_CHANNELS.
* Don't suppress RethrowReflectiveOperationExceptionAsLinkageError.
* Update javadoc.
* Make clear which certs are used in tests + add how to regenerate.
1. Removing $ when looking for the commit 'Start of development cycle...' because it produces empty result with the $. It seems how the squash was done may influence whether $ will work or not.
2. Added an explicit git push instruction at step 5 of tagging and what base branch to use, since it will cause conflict with the default base branch used of master.
The main goal was to make sure subchannels went CONNECTING only after a
connection was requested (since the test doesn't transition to
CONNECTING from TF). That helps guarantee that the test is using the
expected subchannel.
The missing ClusterImplLB.requestConnection() doesn't actually matter
much, as cluster manager doesn't propagate connection requests.
* Added null check for xdsClient in onSubChannelState. This avoids NPE
for xdsClient when LB is shutdown and onSubChannelState is called later
as part of listener callback. As shutdown is racy and eventually consistent,
this check would avoid calculating locality after LB is shutdown.
* Mark S2A public APIs as experimental.
* Rename S2AChannelCredentials createBuilder API to newBuilder.
* Remove usage of AdvancedTls.
* Use InsecureChannelCredentials.create instead of Optional.
* Invoke Thread.currentThread().interrupt() in a InterruptedException block.
* S2AHandshakerServiceChannel doesn't use custom event loop.
* use executorPool.
* log when channel not shutdown.
* use a cached threadpool.
* update non-executor version.
Move unused and unimportant fields to local variables. pickUnusedPort()
is inherently racy, so avoid using it when unnecessary. The channel's
default executor is fine to use, but if you don't like it
directExecutor() would be an option too. But blocking stub doesn't even
use the executor for unary RPCs. Thread.join() does not propagate
exceptions from the Thread; it just waits for the thread to exit.
Add opentelemetry tracing API, guarded by environmental variable(disabled by default).
Use server interceptor to explicitly propagate span to the application thread.
unix.sh is shared by multiple OSes and environments. Clear JAVA_HOME,
since we never want to use that as PATH is more reliable, better
supported, and more typical.
* use an attribute from resolved addresses IS_PETIOLE_POLICY to control whether or not health checking is supported so that top level versions can't do any health checking, while those under petiole policies can.
Fixes#11413
Detachable lets a buffer outlive its original lifetime. The new lifetime
is application-controlled. If the application fails to read/close the
stream, then the leak detector wouldn't make clear what code was
responsible for the buffer's lifetime. With this touch, we'll be able to
see detach() was called and thus know the application needs debugging.
Realized when looking at b/364531464, although I think the issue is
unrelated.
This makes ClusterManagerLB more straight-forward, focusing on just the
things that are relevant to it, and it avoids specialized map key
handling in updateChildrenWithResolvedAddresses().
The child policy config should be refreshed every address update, so it
shouldn't be stored in the ChildLbState. In addition, none of the
current usages actually used what was stored in the ChildLbState in a
meaningful way (it was always null).
ResolvedAddresses was also removed from createChildLbState(), as nothing
in it should be needed for creation; it varies over time and the values
passed at creation are immutable.
While child LB policies are unlikey to change for each cluster name (RLS
returns regular cluster names, so should be unique), and the
configuration for CDS policies won't change, RLS configuration can
definitely change.
It doesn't do anything.
Call scheduleNextConnection() unconditionally since it is responsible
for checking if `enableHappyEyeballs == true`. It's also surprising to
check in the CONNECTING case but not the IDLE case.
It is trivial to avoid the exception from
addressIndex.getCurrentAddress(). The log message was inaccurate, as the
subchannel might have been TRANSIENT_FAILURE. The only important part of
the condition was whether the subchannel was the current subchannel.
It will never throw, because it would only throw if helper is null, but
helper is checkNotNull()ed in the constructor. It could have checked for
a null return value instead; since it hasn't been, it is clear we don't
need this check.
Bazel had the dependency added because of #5046, where Guava was
depending on it as compile-only and Bazel build have "unknown enum
constant" warnings. Guava now has a compile dependency on j2objc, so
this workaround is no longer needed. There are currently no version skew
issues in Gradle, which was the only usage.
Since 04474970 RingHashLB has not used
acceptResolvedAddressesInternal(). At the time that was needed because
deactivated children were part of MultiChildLB. But in 9de8e443, the
logic of RingHashLB and MultiChildLB.acceptResolvedAddressesInternal()
converged, so it can now swap back to using the base class for more
logic.
One LB no longer needs to extend ChildLbState and one has to start, so
it is a bit of a wash. There are more LBs that need the auto-request
logic, but if we have an API where subclasses override it without
calling super then we can't change the implementation in the future.
Adding behavior on top of a base class allows subclasses to call super,
which lets the base class change over time.
There was no point to using subchannels as keys to
subchannelToReportListenerMap, as the listener is per-child. That meant
the keys would be guaranteed to be known ahead-of-time and the
unsynchronized getOrCreateOrcaListener() during picking was unnecessary.
The picker still stores ChildLbStates to make sure that updating weights
uses the correct children, but the picker itself no longer references
ChildLbStates except in the constructor. That means weight calculation
is moved into the LB policy, as child.getWeight() is unsynchronized, and
the picker no longer needs a reference to helper.
A package-private class isn't visible and `@Internal` is stronger than
experimental. The only way users should use WRR is via the
weight_round_robin string, and that's already not suffixed with
_experimental.
Closes#9885
* otel tracing: add binary format, grpcTraceBinContextPropagator
* exception handling, use api base64 encoder omit padding
remove binary format abstract class in favor of binary marshaller
Some addresses are equal even though their toString is different
(InetSocketAddress ignores the hostname when it has an address). And
some addresses are not equal even though their toString might be the
same (AnonymousInProcessSocketAddress doesn't override toString()).
InetSocketAddress/InetAddress do not cache the toString() result. Thus,
even in the worst case that uses a HashSet, this should use less memory
than the earlier approach, as no strings are formatted. It probably also
significantly improves performance in the reasonably common case when an
Endpoint is created just for looking up a key, because the string
creation in the constructor isn't then amorized.
updateChildrenWithResolvedAddresses(), for example, creates n^2 Endpoint
objects for lookups.
It came up in #11073, and I saw it could use a little updating. Notably,
I'm linking to a guide to what Git commit messages should look like. I
also tried to make the language less heavy-handed and demanding.
This reverts commit 9ba2f9dec5.
It causes a channel panic due to unimplemented onResult2().
```
java.lang.UnsupportedOperationException: Not implemented.
at io.grpc.NameResolver$Listener2.onResult2(NameResolver.java:257)
at io.grpc.internal.DnsNameResolver$Resolve.lambda$run$0(DnsNameResolver.java:334)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:333)
```
b/356669977
They share very little code, and we really don't want RoundRobinLb to be
public and non-final. Originally, WRR was expected to share much more
code with RR, and even delegated to RR at times. The delegation was
removed in 111ff60e. After dca89b25, most of the sharing has been moved
out into general-purpose tools that can be used by any LB policy.
FixedResultPicker now has equals to makes it as a EmptyPicker
replacement. RoundRobinLb still uses EmptyPicker because fixing its
tests is a larger change. OutlierDetectionLbTest was changed because
FixedResultPicker is used by PickFirstLeafLb, and now RoundRobinLb can
squelch some of its updates for ready pickers.
`cncf/xds`: Sync protos to the latest imported version
cncf/xds@024c85f (commit 2024-07-23, cl/655545156).
Should be a noop, just a routine xDS proto update to make upcoming
RLQS-related imports simpler, see related #11401.
Note that CEL is only added as a bazel dependency as now it's required
to build cncf/xds. Actual third-party source import will be done in
the follow up PR, where RLQS dependencies are added to the import
scripts.
Otherwise, the server will continue sending updates and if we
re-subscribe to the last resource, the server won't re-send it. Also
completely remove the per-type state, as it could only add confusion.
`envoyproxy/envoy`: Sync protos to the latest imported version
ab911ac2ff
(commit 2024-07-06, cl/651956889).
Should be a noop, just a routine xDS proto update to make upcoming
RLQS-related imports simpler.
Introducing NameResolver listener method "Status Listener2::onResult2(ResolutionResult)" that returns Status of the acceptance of the name resolution by the load balancer, and the Name Resolver will call this method for both success and error cases.
From gRFC A58:
> When less than two subchannels have load info, all subchannels will
> get the same weight and the policy will behave the same as round_robin
We don't include protobuf in IO_GRPC_GRPC_JAVA_ARTIFACTS, so there might
not actually be an alias available for it to @com_google_protobuf. While
we could add it, it is easier to use the @com_google_protobuf references
directly.
This was preventing `bazel query 'deps(//...)' from succeeding, because
it couldn't find javalite.
Since Bazel 6 [1], Bazel has used com_google_protobuf for javalite. We
only used the other repo because Bazel expected it, which was because
Protobuf split out javalite to a separate branch for a while. Since
everything is now reunified, we can use a singular protobuf repo.
1. abdb1d6bfe
V1 version of the proto reflection service, as the v1.alpha service has been deprecated.
* Create V1 alpha service wrapping underlying V1 service, by modifying the ServerServiceDefinition.
* Create ProtoReflectionService for the v1alpha proto by producing a ServerServiceDefinition constructed from that of the v1 service but with the service and method names and proto descriptors modified.
Issue #6724.
Java 8 isn't installed, and was needed by the old Android SDK. With the
current SDK, it can work on Java 11 but it needs some dependencies
installed.
Python 2.7 isn't available any more, but instead of porting to Python 3,
it was just replaced with a curl command.
The GSON upgrade slightly changed an error string, so the test was
updated to be less of a change detector.
Some OpenTelemetry dependencies are alpha versions, so needed an
adjustment in build.gradle to accept the versions. Similarly, Undertow
includes Final in its version numbers which needs to be accepted.
CentOS 7 became end-of-life on July 1st and is no longer working. We now
dynamically link against libstdc++, as RHEL 8 doesn't support static
linking: https://access.redhat.com/articles/rhel8-abi-compatibility
We now use objdump in check-artifact for all linux architectures. This
avoids using a mix of objdump and ldd. ldd shows transitive
dependencies, which is less convenient.
This is to replace switchTo(), to allow composing GracefulSwitchLb with
other helpers like MultiChildLb. It also prevents users of
GracefulSwitchLb from needing to use ServiceConfigUtil.
opencensus-proto is old generated code, which is not compatible with
protobuf-java 4.27.2 and may not be fixed since the project is dead.
Since it is unused, I think this doesn't cause any trouble for
downstream users trying to use protobuf-java 4.x. Related to #11015.
Add gRPC OpenTelemetry example. The example uses Prometheus exporter to export metrics and can be verified locally.
It also provides an example using LoggingMetricExporter to export and log the metrics using java.util.logging.
* Eliminate NPE after recovering from a temporary name resolution failure.
* Add test case for 2 failing subchannels to make sure it causes channel to go into TF.
Allocating this executor before BinderServer even exists is convoluted and actually leaks if the built server is never actually start()ed. Instead, have BinderServer own this executor directly, with a lifetime from start() until termination. Pass it to the ServerAuthInterceptor via TransportAuthorizationState Attribute instead of at construction time.
Using --runs_per_test=1000, this changes the flake rate of TlsTest from
2% to 0%.
While I believe it is possible to write a reliable test for this
(including noticing the SSLSocket behavior), it was becoming too
invasive so I gave up.
Fixes#11012
These are overrides of BinderTransport itself, so not used elsewhere.
They are essentially private. It was scary seeing `@GuardedBy` for a
public method. I copied the annotation to the base class to make sure
ErrorProne could verify the calls.
The test was added in e4e7f3a06 when InProcess stopped returning a
Runnable from start(). In c5a63a1 we realized (indirectly) that there's
no point in using the Runnable any more.
This test failed with Binder (which seems to have been using the
Runnable unnecessarily), and InProcess, Netty, and OkHttp don't use the
Runnable. Instead of fixing it, we'll just move toward stopping using
Runnable.
I'm not removing the Runnable usage from Binder in this commit because
this test is currently causing CI failures and I don't want to do a
behavior change when fixing it.
There are no longer any devices (virtual or otherwise) that support API
level 21, 22, or 23. Google Play services is still supporting API level
21 (although there is a pattern of notifying of dropped levels in July,
and dropping them in August).
This hasn't been needed since f8f569e07, when InternalSubchannel stopped
calling start() with a lock held. Note that also means no transport
needs to return a Runnable (but some still are).
I had noticed in e4e7f3a06 that it was safe for InProcess to call the
listener directly within start(), but I didn't notice this Javadoc that
said it wasn't allowed.
CachingRlsLbClient already calls it with a lock held. The only reason
the cache needs to manage the lock itself is for the periodic cleanup.
Let the consumer of the cache handle the timer.
Returning the runnable did nothing, as both the start method and the
runnable are run within the synchronization context. I believe the
Runnable used to be required in the previous implementation of
ManagedChannelImpl (the lock-based implementation before we created
SynchronizationContext).
This fixes a NPE seen in ServerImpl because the server expects proper
ordering of transport lifecycle events.
```
Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException: Cannot invoke "java.util.concurrent.Future.cancel(boolean)" because "this.handshakeTimeoutFuture" is null
at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.transportReady(ServerImpl.java:440)
at io.grpc.inprocess.InProcessTransport$4.run(InProcessTransport.java:215)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
```
b/338445186
fea577c80 disabled an optimization that some tests notice, as it can
change execution order. This restores the old behavior, at slight
expense to seeing relationship between in-use tracking and idle mode.
This will be used by the metadata exchange of CSM. When recording
per-attempt metrics, we really need per-attempt data and can't leverage
ClientInterceptors.
8844cf7b8 triggered a regression where a new RPC wouldn't cause the
channel to exit idle mode, if an RPC was still progressing on an old
transport. This was already possible previously, but was racy.
8844cf7b8 made it less racy and more obvious.
The two added `exitIdleMode()` calls in this commit are companions to
those in `enterIdleMode()`, which detect whether the channel should
immediately exit idle mode.
Noticed in cl/635819804.
Verifies that latest versions of Tomcat/Undertow/Jetty pass
integration tests - I manually verified that all ignored tests still
fail.
Two tests failed in Jetty, it appears that the integration test
anticipates that the server implementation is willing to send larger
trailers than the client SETTINGS frame allows for. Since the server
refuses to send too large of headers/trailers, the client does not
receive the too-large payloads, and doesn't fail with the expected
message. This change was introduced in Jetty 10.0.15/11.0.11. Those
tests are ignored.
* Fix 3d party dependency use_repo
* remove protobuf as it is already added as module dep
* fix
* fix
* fix
* return com_google_protobuf_javalite archive and use it in MODULE.bazel
DelayedClientTransport already had to handle all the cases, so
ManagedChannelImpl picking was acting only as an optimization.
Optimizing DelayedClientTransport to avoid the lock when not queuing
makes ManagedChannelImpl picking entirely redundant, and allows us to
remove the duplicate race-handling logic.
This avoids double-picking when queuing, where ManagedChannelImpl does a
pick, decides to queue, and then DelayedClientTransport re-performs the
pick because it doesn't know which pick version was used. This was
noticed with RLS, which mutates state within the picker.
Previously, picker was likely null if entering backoff soon after
start-up. This prevented the picker from being updated and directing
queued RPCs to the fallback. It would work for new RPCs if RLS returned
extremely rapidly; both ManagedChannelImpl and DelayedClientTransport do
a pick before enqueuing so the ManagedChannelImpl pick could request
from RLS and DelayedClientTransport could use the response. So the test
uses a delay to purposefully avoid that unlikely-in-real-life case.
Creating a resolving OOB channel for InProcess doesn't actually change
the destination from the parent, because InProcess uses directaddress.
Thus the fakeRlsServiceImpl is now being added to the fake backend
server, because the same server is used for RLS within the test.
b/333185213
Some APIs were marked experimental but had internal APIs in their
surface. These were all changed to internal. And then the internal APIs
were mostly hidden from generated documentation.
All these APIs will eventually become public and maybe even stable. But
they need some iteration before we're ready for others to start using
them.
* Change HappyEyeballs flag default value to false since some G3 users are seeing problems.
Put the flag logic in a common place for PickFirstLeafLoadBalancer & WRR's test.
* Set expected requestConnection count based on whether happy eyeballs is enabled or not
* Disable new PickFirstLB
* Fix test expectations to handle both new and old PF LB paths.
OpenTelemetryModule is renamed to GrpcOpenTelemetry. The Builder is now
`final`, although that should only impact mocks as it had a private
constructor.
Fixes#10591
The optional label API was added in 4c78a974 and xds_cluster_impl was
plumbed in 077dcbf9.
From gRFC A78:
> ### Optional xDS Locality Label
>
> When xDS is used, it is desirable for some metrics to include an optional
> label indicating which xDS locality the metrics are associated with.
> We want to provide this optional label for the metrics in both the
> existing per-call metrics defined in [A66] and in the new metrics for
> the WRR LB policy, described below.
>
> If locality information is available, the value of this label will be of
> the form `{region="${REGION}", zone="${ZONE}", sub_zone="${SUB_ZONE}"}`,
> where `${REGION}`, `${ZONE}`, and `${SUB_ZONE}` are replaced with the
> actual values. If no locality information is available, the label will
> be set to the empty string.
>
> #### Per-Call Metrics
>
> To support the locality label in the per-call metrics, we will provide
> a mechanism for LB picker to add optional labels to the call attempt
> tracer. We will then use this mechanism in the `xds_cluster_impl`
> policy's picker to set the locality label. ...
>
> This label will be available on the following per-call metrics:
> - `grpc.client.attempt.duration`
> - `grpc.client.attempt.sent_total_compressed_message_size`
> - `grpc.client.attempt.rcvd_total_compressed_message_size`
This is needed by gRFC A78 for xds metrics, and for RLS metrics. Since
gauges need to acquire a lock (or other synchronization) in the
callback, the callback allows batching multiple gauges together to avoid
acquiring-and-requiring such locks.
Unlike other metrics, gauges are reported on-demand to the MetricSink.
This means not all sinks will receive the same data, as the sinks will
ask for the gauges at different times.
This will be used for gRFC A66's OTel per-RPC metric label:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`).
The majority of the changes are to move target computation from
ManagedChannelImpl into the builder. A small hack API was added to
ManagedChannelBuilder to get the target to create an interceptor.
This should preserve all the existing behavior of GlobalInterceptors as
used by grpc-gcp-observability, including it disabling the implicit
OpenCensus integration.
Both the old and new API are internal. I hid Configurator and
ConfiguratorRegistry behind Internal-prefixed classes, like had been
done with GlobalInterceptors to further discourage use until the API is
ready.
GlobalInterceptorsTest was modified to become ConfiguratorRegistryTest.
As part of gRFC A78:
> To support the locality label in the WRR metrics, we will extend the
> `weighted_target` LB policy (see A28) to define a resolver attribute
> that indicates the name of its child. This attribute will be passed
> down to each of its children with the appropriate value, so that any
> LB policy that sits underneath the `weighted_target` policy will be
> able to use it.
xds_cluster_impl is involved because it uses the child names in the
AddressFilter, which must match the names used by weighted_target.
Instead of using Locality.toString() in multiple policies and assuming
the policies agree, we now have xds_cluster_impl decide the locality's
name and pass it down explicitly. This allows us to change the name
format to match gRFC A78:
> If locality information is available, the value of this label will be
> of the form `{region="${REGION}", zone="${ZONE}",
> sub_zone="${SUB_ZONE}"}`, where `${REGION}`, `${ZONE}`, and
> `${SUB_ZONE}` are replaced with the actual values. If no locality
> information is available, the label will be set to the empty string.
This adds the following components that are required for gRPC A79
non-per-call metrics architecture.
- MetricSink implementation for gRPC OpenTelemetry
- Configurator for plumbing per call metrics ClientInterceptor and
ServerStreamTracer.Factory via unified OpenTelemetryModule.
Integrates the new features of the the Kokoro PSM Interop install library introduced in grpc/psm-interop#73.
Nearly all common functionality was moved from per-language/per-branch PSM Interop build scripts to [psm_interop_kokoro_lib.sh](https://github.com/grpc/psm-interop/blob/main/.kokoro/psm_interop_kokoro_lib.sh):
1. The list of tests in the each test suite
2. Per-test-suite flag customization
3. `run_test` methods
4. `build_docker_images_if_needed` methods
5. Generic `build_test_app_docker_images` methods (simple docker build + docker push + docker tag). grpc-java is one exception, as it doesn't run docker directly, but a cloudbuild flow.
Now all PSM Interop jobs share the same buildscripts by all test suites:
1. buildscript that invokes the test: `psm-interop-test-{language}.sh` (configured as `build_file` in the build cfg)
2. buildscript that builds the xDS test client/server and publishes them as a Docker image: `psm-interop-build-{language}.sh` (conventional name called from `psm_interop_kokoro_lib.sh`)
`psm-interop-test-{language}.sh`:
1. Sets `GRPC_LANGUAGE`, `BUILD_SCRIPT_DIR` environment variables.
2. Downloads the shared `psm_interop_kokoro_lib.sh` from the main branch of the psm-interop repo.
3. Sources `psm-interop-build-{language}.sh`
4. Calls `psm::run "${PSM_TEST_SUITE}"` (`PSM_TEST_SUITE` configured in the cfg file).
`psm-interop-build-{language}.sh`:
1. Defines `psm::lang::build_docker_images` which is called from `psm_interop_kokoro_lib.sh`.
2. Invokes any repo-specific logic.
3. May use `psm::build::docker_images_generic` for generic Docker build, tag, push, or provide implement its own build/publish method.
References:
- b/288578634
- See the full list of the new features at grpc/psm-interop#73.
- Additional fixes to the shared lib: grpc/psm-interop#78, grpc/psm-interop#79
gRFC A78 has WRR and pick-first include a `grpc.target` label, defined
in A66:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`). For channels such as inprocess channels
> where a target URI is not available, implementations can synthesize a
> target URI.
Since 06df25b65d, WRR has been calling this method, and it will get an
exception. We don't want WRR to be broken until we have MetricRecorder
fully plumbed.
As part of gRFC A78:
> To support the locality label in the per-call metrics, we will provide
> a mechanism for LB picker to add optional labels to the call attempt
> tracer.
* added MetricRecorderImpl and unit tests for MetricInstrumentRegistry
* updated MetricInstrumentRegistry to use array instead of ArrayList
* renamed record<>Counter APIs to add<>Counter. Added check for mismatched label values
* added lock for instruments array
`getMinEvictionTime()` was fixed to make sure only deltas were used for
comparisons (`a < b` is broken; `a - b < 0` is okay). It had also
returned `0` by default, which was meaningless as there is no epoch for
`System.nanoTime()`. LinkedHashLruCache now passes the current time into
a few more functions since the implementations need it and it was
sometimes already available. This made it easier to make some classes
static.
Instead of having docs in RefCountedChildPolicyWrapperFactory saying
that every method was guarded by a lock, I added `@GuardedBy("lock")`
within CachingRlsLbClient, so now it is clearly not thread-safe and the
lock protects access. The AtomicLong was replaced with a long since
1) there was no multi-threading and 2) the logic was not atomic-safe
which was misleading.
In OpenCensus recording an attempt was delayed in order to wait for
inboundUncompressedSize(). But we don't need that in OpenTelemetry, and
could have removed this code when copying from OpenCensus.
Today, deframer errors cancel the stream without communicating a status code
to the peer. This change causes deframer errors to trigger a best-effort
attempt to send trailers with a status code so that the peer understands
why the stream is being closed.
Fixes#3996
`sendGrpcFrame` owns the buffer in `SendGrpcFrameCommand`. If the frame is not handed off to netty, it needs to be released in the method.
https://github.com/grpc/grpc-java/issues/11115
It is easy to manage these things outside of MultiChildLb and it makes
the shared code easier and use less memory. In particular, we don't want
to use many instances of GracefulSwitchLb in virtually every policy
simply because it was needed in one or two cases.
Adds interfaces required for recording metrics from gRPC components. And added API to get `MetricRecorder` in `LoadBalancer.Helper` and add `MetricSink` to `ManagedChannelBuilder`.
Handles Netty write frame failures caused by issues in the Netty
itself.
Normally we don't need to do anything on frame write failures because
the cause of a failed future would be an IO error that resulted in
the stream closure. Prior to this PR we treated these issues as a
noop, except the initial headers write on the client side.
However, a case like netty/netty#13805 (a bug in generating next
stream id) resulted in an unclosed stream on our side. This PR adds
write frame future failure handlers that ensures the stream is
cancelled, and the cause is propagated via Status.
Fixes#10849
The text between the GRPC_DEPS_{START,END} must be identical in
formatting. Probably not a problem in general and not necessarily bad.
But it is simplistic.
Eric waking up this morning:
> We need more sed.
This started with combining handleNewRequest with asyncRlsCall, but that
emphasized pre-existing synchronization issues and trying to fix those
exposed others. It was hard to split this into smaller commits because
they were interconnected.
handleNewRequest was combined with asyncRlsCall to use a single code
flow for handling the completed future while also failing the pick
immediately for thottled requests. That flow was then reused for
refreshing after backoff and data stale. It no longer optimizes the RPC
completing immediately because that would not happen in real life; it
only happens in tests because of inprocess+directExecutor() and we don't
want to test a different code flow in tests. This did require updating
some of the tests.
One small behavior change to share the combined asyncRlsCall with
backoff is we now always invalidate an entry after the backoff.
Previously the code could replace the entry with its new value in one
operation if the asyncRlsCall future completed immediately. That only
mattered to a single test which now sees an EXPLICIT eviction.
SynchronizationContext used to provide atomic scheduling in
BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was
only accessed from the sync context. The same was true for calling up
the LB tree with `updateBalancingState()`. In particular, adding entries
to the cache during a pick could evict entries without running the
cleanup methods within the context, as well as the RLS channel
transitioning from TRANSIENT_FAILURE to READY. This was replaced with
using a bare Future with a lock to provide atomicity.
BackoffCacheEntry no longer uses the current time and instead waits for
the backoff timer to actually run before considering itself expired.
Previously, it could race with periodic cleanup and get evicted before
the timer ran, which would cancel the timer and forget the
backoffPolicy. Since the backoff timer invalidates the entry, it is
likely useless to claim it ever expires, but that level of behavior was
preserved since I didn't look into the LRU cache deeply.
propagateRlsError() was moved out of asyncRlsCall because it was not
guaranteed to run after the cache was updated. If something was already
running on the sync context, then RPCs would hang until another update
caused updateBalancingState().
Some methods were moved out of the CacheEntry classes to avoid
shared-state mutation in constructors. But if we add something in a
factory method, we want to remove it in a sibling method to the factory
method, so additional code is moved for symmetry. Moving shared-state
mutation ouf of constructors is important because 1) it is surprising
and 2) ErrorProne doesn't validate locking within constructors. In
general, having shared-state methods in CacheEntries also has the
problem that ErrorProne can't validate CachingRlsLbClient calls to
CacheEntry. ErrorProne can't know that "lock" is already held because
CacheEntry could have been created from a _different instance_ of
CachingRlsLbClient and there's no way for us to let ErrorProne prove it
is the same instance of "lock".
DataCacheEntry still mutates global state that requires a lock in its
constructor, but it is less severe of a problem and it requires more
choices to address.
According to the docs, I can use bazel to build examples, but
retry-example is not supported in bazel config. So If you'll try to
build this example with bazel, you'll get an error:
```
examples git:(master) ✗ bazel build :retrying-hello-world-server :retrying-hello-world-client
ERROR: Skipping ':retrying-hello-world-client': no such target '//:retrying-hello-world-client': target 'retrying-hello-world-client' not declared in package '' defined by /Users/rostik404/projects/grpc-java/examples/BUILD.bazel
ERROR: no such target '//:retrying-hello-world-client': target 'retrying-hello-world-client' not declared in package '' defined by /Users/rostik404/projects/grpc-java/examples/BUILD.bazel
INFO: Elapsed time: 0.331s
INFO: 0 processes.
ERROR: Build did NOT complete successfully
```
* Add option in OkHttpServerBuilder
* Add value as MAX_CONCURRENT_STREAM setting in settings frame sent by the server to the client per connection
* Enforce limit by sending a RST frame with REFUSED_STREAM error
The recommended way to load dependencies from `rules_jvm_external`
is to make use of the `@maven` workspace, and the most readable
way of doing that is to use the `artifact` macro provides.
This removes the need to generate the "compat" namespaces, which
`rules_jvm_external` provided for backwards compatibility with
older releases. This change also sets things up for supporting
`bzlmod`: this requires all workspaces accessed by a library to
be named "up front" in the `MODULE.bazel` file. This way, the
only repo that needs to be exported is `@maven`, rather than the
current huge list.
The name resolver takes some time before it returns addresses. While waiting the channel will be IDLE instead of the proper CONNECTING. This generally doesn't matter since RPCs behave similarly for IDLE and CONNECTING, but is confusing for users when watching channel.getState() closely.
Fixes#10517.
Including a Status description makes it easier to debug subchannel
closure issues if it's clear that a subchannel became unavailable because
of an outlier detection ejection.
We provided extra details when the RPC is killed by CallOptions'
Deadline, but didn't do the same for Context.
To avoid duplicating code, things were restructured, including the
threading. There are more code flows now, but I think the
multi-threading came out more obvious and less error-prone. I didn't
change the status when the deadline is already expired, because the
text is shared with DelayedClientCall and AbstractInteropTest doesn't
distinguish between the two cases.
This is a roll-forward that avoids a NPE when cancel() is called
without an earlier call to start().
As seen at b/300991330
* Allow the queued byte threshold for a Stream to be ready to be configurable
- on clients this is exposed by setting a CallOption
- on servers this is configured by calling a method on ServerCall or ServerStreamListener
* Have EDS resource parse the additional addresses from envoy message
* Update respositories.bzl to point to current grpc-proto instead of a 2021 version.
* Update respositories.bzl to point to recent cncf/xds and envoyproxy/data-plane-api
* Add cncf_upda to repositories.bzl
We provided extra details when the RPC is killed by CallOptions'
Deadline, but didn't do the same for Context.
To avoid duplicating code, things were restructured, including the
threading. There are more code flows now, but I think the
multi-threading came out more obvious and less error-prone. I didn't
change the status when the deadline is already expired, because the
text is shared with DelayedClientCall and AbstractInteropTest doesn't
distinguish between the two cases.
As seen at b/300991330
To make it stable, this PR hides protobuf from being exposed via the
API.
Note: this breaks ABI of `CsdsService.streamClientStatus` and
`CsdsService.fetchClientStatus`, but these methods should not
normally be called by the user.
Closes#8016.
The gradle wrapper was removed from example-oauth because we don't want
to maintain the wrapper copy in each example (at least right now it
doesn't make sense for it to be the only other example to have the
gradle wrapper).
The local race passes `rlsPicker` to the channel before
CachingRlsLbClient is finished constructing. `RlsPicker` can use
multiple of the fields not yet initialized. This seems not to be
happening in practice, because it appears like it would break things
very loudly (e.g., NPE).
The remote race seems incredibly hard to hit, because it requires an RPC
to complete before the pending data tracking the RPC is added to a map.
But with if a system is at 100% CPU utilization, maybe it can be hit. If
it is hit, all RPCs needing the impacted cache entry will forever be
buffered.
This removes a grpc-ism environment variable. Note that the logger is
still registered under XdsClientImpl. That could maybe change, but it is
a bit unclear what it should become and it seemed better for this to
have no behavior changes.
This method is only needed sometimes, and with time will be needed less
and less. Don't require new types to implement it, instead relying on
control planes to use the new approach.
`project(':grpc-netty').configurations` requires the grpc-netty project
to be configured, which requires evaluationDependsOn. Without
evaluationDependsOn, project loading order is arbitrary and you can get
random errors after small configuration changes.
Or rather, server response is ambiguous and this usage is not generally
what we mean when we say it. The example shows how to get an error for
any failed RPC, not just those coming from a failing server.
The existing comment caused confusion at
https://stackoverflow.com/a/78104828
The xDS library only honored names retrieved from the inner resource
containers, but for wrapped resources the outer layer could contain the
required name. This commit prefers the name on the wrapped container
over the inner resource name.
SelfSignedCertificate is not available on Java 17 because
OpenJdkSelfSignedCertGenerator is not available. This only impacted
tests.
AccessController is being removed, and these locations are doing simple
reflection which is unlikely to require it even when a security policy
is in effect. There's other places we do reflection without the
AccessController, so either no security policies care or the users can
update their policies to allow it.
xDS v2 support was dropped about a year ago, but the xds package still
had a few xDS v2 usages. This PR:
- Removes all leftover usages of xDS v2 classes in gprc-xds
- Removes all imported xDS v2 protos and their leaf dependencies:
- Removes xDS v2 generated services
- Makes minor improvements to the xds import script output
### Before
```sh
# Imported 154 protos.
❯ find . -iname "*xds*.jar" -exec du -h {} \; | col -x
13M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-original.jar
6.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-sources.jar
388K ./build/libs/grpc-xds-1.63.0-SNAPSHOT-javadoc.jar
14M ./build/libs/grpc-xds-1.63.0-SNAPSHOT.jar
```
### After
```sh
# Imported 86 protos.
❯ find . -iname "*xds*.jar" -exec du -h {} \; | col -x
9.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-original.jar
4.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-sources.jar
388K ./build/libs/grpc-xds-1.63.0-SNAPSHOT-javadoc.jar
9.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT.jar ```
Reduction:
- Number of protos: 44%
- Jar size: 35%
In addition to removing a test that only applies to KitKat, switch tests
that require 19 to not specifying the SDK version as we only support min
sdk version 21, which has the required API.
Also removes the SDK version check from isProfileOwner, to trigger a
runtime exception when too low of an SDK version is used.
This allows the checkForUpdates task to notice the dependencies and
suggest updates.
I plan to upgrade some of the servers after this change in hopes it
reduces test flakiness.
We can just compare the Deadline instances instead of asserting that
very little time has passed during the test. Real time probably still
matters in the test, but only insofar that the deadline is not expired
by the time ClientCallImpl sees it.
This fixes a test failure seen in the emulated aarch64 CI. Note that the
message says "ns" when it should say "ms", but this change deletes the
code with that typo.
```
java.lang.AssertionError: timeout: 548 ns
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at io.grpc.internal.ClientCallImplTest.assertTimeoutBetween(ClientCallImplTest.java:1102)
at io.grpc.internal.ClientCallImplTest.contextDeadlineShouldBePropagatedToStream(ClientCallImplTest.java:828)
```
`isolatedResourceDeletions()` has failed with a timeout waiting on
onChanged when running under TSAN. TSAN can slow things down, so let's
increase the timeout to ensure it isn't just timeout flake.
`-link` does I/O to download the package list, for every javadoc
invocation. There is no caching, so this happens many times per build.
Swap to offline mode to avoid spamming the servers, and avoid build
failures if the servers aren't entirely healthy.
We had 'test.dependsOn', but it is only run if the golden tasks are
configured, which they generally won't be because nothing depends on
them. This prevented the test from actually running. This bug was
introduced in 0ff9f37b because previously the golden tasks were eagerly
constructed.
Remove the "extraPackage" argument because it is a constant and it
confused me for a bit wondering when it was necessary.
It wasn't actually being used. Since Java 8u252 in early 2020 we've been
using ALPN from the JDK. The Jetty ALPN Agent has been a noop.
We do keep the Jetty ALPN support in the code and tests, but we don't
have the infrastructure to actually run it.
When we implement A71, we're no longer going to have a single xds
client, but instead one per channel target. Add that parameter now, even
though it is unused, to avoid managing the (internal) API breakage when
we implement fallback.
The DNS lookups are taking considerable time on the Windows CI (~11s),
which causes the test to time out:
```
Wanted but not invoked:
ldsResourceWatcher.onError(<any>);
-> at io.grpc.xds.XdsClientImplTestBase.sendToNonexistentHost(XdsClientImplTestBase.java:3733)
Actually, there were zero interactions with this mock.
at io.grpc.xds.XdsClientImplTestBase.sendToNonexistentHost(XdsClientImplTestBase.java:3733)
```
The ARM build, which uses an emulator, has had this test succeed, so the
failure seems unrelated to CPU usage. We want to avoid external I/O
anyway during tests, so removing the DNS lookup is good.
The TSAN comment referenced XdsClientImplTestBase.sendToNonexistentHost,
but the test no longer calls fakeClock.forwardTime so the comment was
out-of-date. Change the comment to make clear the race involved.
The Java 8 runtime is end of support. Leaving this a gae-jdk8 for now.
The gae-jdk8 was because AppEngine changed dramatically from Java 7 to
Java 8. Nowadays the versions are more in line with OpenJDK and not very
different from each other.
Fixes#10925
I'm trying to upgrade to a newer Windows Kokoro image, but the new one
has an old vswhere installed that breaks Gradle. Our old image doesn't
have vswhere at all. If vswhere isn't found, this rename prints some
errors, but the bat script continues executing. So this change is
compatible with both the older and newer image.
In 0d39bf50 the ReadyPicker was changed holding List<Subchannel> to
List<ChildLbState>, but ChildLbState mutates over time and is not
synchronized. We want the picker to have a snapshot of the data, so copy
the data from ChildLbState instead of using it directly.
Unfortunately the tests depended on the ChildLbState a bit, so we need
to save the EAG only to use it in tests. That's okay for now, but in the
future we'll probably want to remove that unnecessary memory usage.
Add the 'fake' dependency to grpc-netty instead of grpc-core.
grpc-okhttp already depends on grpc-util and probably would be fine
without round_robin on Android.
There's not actually a circular dependency, but some tools can't handle
the compile vs runtime distinction. Such tools are broken, but fixes
have been slow and this approach works with no real downfalls.
Works around #10576#10701
* netty: improve server handling of writes to reset streams
A server stream can be reset by the client while server writes are still queued. After the stream is reset, the netty connection will forget the stream object. The `NettyServerHandler` must deal with that situation. `sendResponseHandlers` already had some code to do that. This change standardizes that code and adds it to `sendGrpcFrame`. This fixes a potential bug where a `SendGrpcFrameCommand` with `endOfStream=true` would raise an `AssertionError` if written to a reset stream. (This bug is not currently reachable because `endOfStream=false` for all server `SendGrpcFrameCommand` objects.)
* Do not call into the encoder when we know the stream is gone.
* add final, change method permissions, add javadoc, cleanup unneeded, move updateOverallBalancingState to ClusterManagerLB and make it abstract
* Restructure to eliminate the flags as protected methods
* Move methods around so that the candidates for override are near the top.
* Reorder picker methods lower
Http2OkHttp is now unnecessary, as Http2Test tests OkHttp client to
Netty server. receivedDataForFinishedStream() was the only remaining
unique test and it seems already covered by AbstractInteropTest these
days.
We prefer to test using the stable APIs, as they are what our users
should be using. Http2Client continues using NettyChannelBuilder because
it is intended to test grpc-netty.
Retryable was added in google-auth-library 1.5.3 to make clear the
situations that deserve a retry of the RPC. Upgrading to that caused
problems because of transitive dependency issues syncing into Google so
it was reverted in 369f87be. google-auth-library 1.11.0 changed the
approach to avoid the transitive dependency updates. cl/601545581
upgraded to 1.22.0 inside Google. Bump to that version and swap away
from the imprecise IOException heuristic. go/auth-correct-retry
Fixes#6808
The RetryTest was flaky, and it seems to have been caused by the client
and server getting assigned to the same event loop. Separating the two
reduces the flake rate from ~3% to less than 0.1% (no flakes in a 1000).
While I was here fixing the executors, I reduced the number of threads
created and shut down the threads after they are no longer used. This
had no impact to the flake rate (no flakes in 1000).
`envoyproxy/envoy`: Sync protos to the latest imported version
147e6b9523
(commit 2024-01-24, cl/604403196).
Should be a noop, just a routine xDS proto update to make upcoming
RLQS-related imports simpler.
Minor refactor to the tlsContextManager to not expose itself on the xdsClientImpl constructor.
This is to allow people who plugins xdsTransportFactory to use the API easily.
Currently few of the interfaces needed to define and start a watch for a xDS resource are package private, which can't be used externally outside of io.grpc.xds. Exposing them outside allows users to define their own custom resources and start a watch along with the default supported resources.
Also as part of this change, move an Exception defined in the XdsClientImpl into XdsResourceType. As XdsClientImpl is an implementation package, it makes more sense to expose it via the XdsResourceType class.
I initially omitted the visibility modifier because this class began as an interface. Since it moved to an abstract class, we must make it public so it can be overriden by subclasses in the integrator's packages.
Part of #10566.
Splits the :grpc-android-interop-testing:assembleDebug and
:grpc-android-interop-testing:assembleDebugAndroidTest build
targets with hopes of avoiding OOMs.
- Multiple test cases assumed all messages would arrive on a single MessageProducer but this isn't guaranteed by the API contract.
- testBadTransactionStreamThroughput_b163053382 was writing `serverCallsCompleted` on one thread and reading it on another without synchronization. A deeper problem was that waiting for the call to complete on the server doesn't mean messages are immediately available on the client.
- Replaced 100ms polling loops with wait()/notifyAll()
- Close() InputStreams that we read as required by the MessageProducer#next contract.
Passes with --runs_per_test=1000
To support runfiles, the rule has to track more than just the
executable. `files_to_run` has both the runfile and executable
information (as separate fields), as does `files`, (combined as depset).
So using those when able is inherently "safe." `files_to_run.executable`
is only the executable, so does not propagate dependency information,
so we make sure to pass `files` to the rule in addition.
(`files_to_run.executable` is formatted into a string, so it wouldn't
carry depset information anyway.)
As originally noticed in cl/597962426
* services: Remove deprecated `io.grpc.services.BinaryLogs`
* services: Remove //services:binarylog bazel target
Was deprecated more than 2 years ago in bab1fe3.
`io.grpc.protobuf.services.BinaryLogs` should be used instead.
Experimental tracking ticket:
https://github.com/grpc/grpc-java/issues/4017
6efa9ee3 added `volatile` to `attributes` after TSAN detected a data
race that was added in 91d15ce4. The race was because attributes may be
read from another thread after `transportReady()`, and the
post-filtering assignment occurred after `transportReady()`. The code
now filters the attributes separately so they are updated before calling
`transportReady()`.
Original TSAN failure:
```
Read of size 4 at 0x0000cd44769c by thread T23:
#0 io.grpc.netty.NettyClientHandler.getAttributes()Lio/grpc/Attributes; NettyClientHandler.java:327
#1 io.grpc.netty.NettyClientTransport.getAttributes()Lio/grpc/Attributes; NettyClientTransport.java:363
#2 io.grpc.netty.NettyClientTransport.newStream(Lio/grpc/MethodDescriptor;Lio/grpc/Metadata;Lio/grpc/CallOptions;[Lio/grpc/ClientStreamTracer;)Lio/grpc/internal/ClientStream; NettyClientTransport.java:183
#3 io.grpc.internal.MetadataApplierImpl.apply(Lio/grpc/Metadata;)V MetadataApplierImpl.java:74
#4 io.grpc.auth.GoogleAuthLibraryCallCredentials$1.onSuccess(Ljava/util/Map;)V GoogleAuthLibraryCallCredentials.java:141
#5 com.google.auth.oauth2.OAuth2Credentials$FutureCallbackToMetadataCallbackAdapter.onSuccess(Lcom/google/auth/oauth2/OAuth2Credentials$OAuthValue;)V OAuth2Credentials.java:534
#6 com.google.auth.oauth2.OAuth2Credentials$FutureCallbackToMetadataCallbackAdapter.onSuccess(Ljava/lang/Object;)V OAuth2Credentials.java:525
...
Previous write of size 4 at 0x0000cd44769c by thread T24:
#0 io.grpc.netty.NettyClientHandler$FrameListener.onSettingsRead(Lio/netty/channel/ChannelHandlerContext;Lio/netty/handler/codec/http2/Http2Settings;)V NettyClientHandler.java:920
#1 io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onSettingsRead(Lio/netty/channel/ChannelHandlerContext;Lio/netty/handler/codec/http2/Http2Settings;)V DefaultHttp2ConnectionDecoder.java:515
...
```
io.grpc.util.CertificateUtils does much of the same thing as xds's
CertificateUtils, but also supports EC keys. The xds code pre-dates the
grpc-util class, so it isn't surprising it wasn't using it.
There's a good number of usages of the xds CertificateUtils, so I just
got rid of the duplicate implementation, but didn't yet bother changing
callers io.grpc.util.
* Provide a default implementation for new method added to ManagedTransport.Listener to support ClientTransportFilters
* Relax test constraint to reduce flakiness due to timing.
* Add test for listener.filterTransport.
* View -> Tool Windows -> Gradle -> Edit Run Configuration -> Defaults -> JUnit -> Before lauch -> + -> Run Gradle task, enter the task in the build.gradle that sets the javaagent.
Step 1 must be taken, otherwise by the default JUnit Test Runner running a single test in IDE will trigger all the tests.
## Guidelines for Pull Requests
How to get your contributions merged smoothly and quickly.
- Create **small PRs** that are narrowly focused on **addressing a single concern**. We often times receive PRs that are trying to fix several things at a time, but only one fix is considered acceptable, nothing gets merged and both author's & review's time is wasted. Create more PRs to address different concerns and everyone will be happy.
- For speculative changes, consider opening an issue and discussing it first. If you are suggesting a behavioral or API change, consider starting with a [gRFC proposal](https://github.com/grpc/proposal).
- Provide a good **PR description** as a record of **what** change is being made and **why** it was made. Link to a github issue if it exists.
- Don't fix code style and formatting unless you are already changing that line to address an issue. PRs with irrelevant changes won't be merged. If you do want to fix formatting or style, do that in a separate PR.
- Unless your PR is trivial, you should expect there will be reviewer comments that you'll need to address before merging. We expect you to be reasonably responsive to those comments, otherwise the PR will be closed after 2-3 weeks of inactivity.
- Maintain **clean commit history** and use **meaningful commit messages**. See [maintaining clean commit history](#maintaining-clean-commit-history) for details.
- For speculative changes, consider opening an issue and discussing it to avoid
wasting time on an inappropriate approach. If you are suggesting a behavioral
structure. Have a good **commit description** as a record of **what** and
**why** the change is being made. Link to a GitHub issue if it exists. The
commit description makes a good PR description and is auto-copied by GitHub if
you have a single commit when creating the PR.
If your change is mostly for a single module (e.g., other module changes are
trivial), prefix your commit summary with the module name changed. Instead of
"Add HTTP/2 faster-than-light support to gRPC Netty" it is more terse as
"netty: Add faster-than-light support".
- Don't fix code style and formatting unless you are already changing that line
to address an issue. If you do want to fix formatting or style, do that in a
separate PR.
- Unless your PR is trivial, you should expect there will be reviewer comments
that you'll need to address before merging. Address comments with additional
commits so the reviewer can review just the changes; do not squash reviewed
commits unless the reviewer agrees. PRs are squashed when merging.
- Keep your PR up to date with upstream/master (if there are merge conflicts, we can't really merge your change).
- **All tests need to be passing** before your change can be merged. We recommend you **run tests locally** before creating your PR to catch breakages early on. Also, `./gradlew build` (`gradlew build` on Windows) **must not introduce any new warnings**.