Just an is a8de9f0, lack of equals causes cluster_resolver to consider every update a different configuration and restart itself.
Handling NaN should really be prevented with validation, but it looks like that
would lead to yak shaving at the moment.
b/435208946
Since c4256add4 we no longer fabricate a TRANSIENT_FAILURE update from
children. However, previously that would have set
seenReadyOrIdleSinceTransientFailure = false and prevented future timer
creation. If a LB policy gives extraneous updates with state CONNECTING,
then it was possible to re-create failOverTimer which would then wait
the 10 seconds for the child to finish CONNECTING. We only want to give
the child one opportunity after transitioning out of READY/IDLE.
https://github.com/grpc/proposal/pull/509
Http2RstCounterEncoder has to be constructed before
NettyServerHandler/Http2ConnectionHandler so it must be static. Thus the
code/counters were moved into RstStreamCounter which then can be
constructed earlier and shared.
This depends on Netty 4.1.124 for a bug fix to actually call the
encoder:
be53dc3c9a
This implicitly disables NettyAdaptiveCumulator (#11284), which can have a
performance impact. We delayed upgrading Netty to give time to rework
the optimization, but we've gone too long already without upgrading
which causes problems for vulnerability tracking.
Notably, protobuf to 3.25.8, opentelemetry to 1.52.0. Protobuf in Bazel
has 25.5 in the BCR and it seems better to align the WORKSPACE
with that version. But we can't actually use 25.5 in BCR because it is
incompatible with Bazel 7.
This allows a server with access to PeerUid to check additional application-layer security policy *after* the call itself is authorized by the transport layer. Cross cutting application-layer checks could be done from a ServerInterceptor (RPC method level policy, say). Checks based on the substance of a request message could be done by the individual RPC method implementations themselves.
Instead of representing an aggregate cluster as a single cluster whose
priorities come from different underlying clusters, represent an aggregate cluster as an instance of a priority LB policy where each child is a cds LB policy for the underlying
cluster.
Avoiding so many deps will allow us to upgrade the protos without being
forced to upgrade to protobuf-java 4.x. It also removes the remaining
non-bzlmod dependencies.
It'd be really easy to get this wrong, so we do two things 1) mirror the
gradle configuration as much as possible, as that sees a lot of testing,
and 2) run the fake control plane with the _results_ of jarjar. There's
lots of classes that we could mess up, but that at least kicks the tires.
XdsTestUtils.buildRouteConfiguration() was moved to ControlPlaneRule to
stop the unnecessary circular dependency between the classes and to
avoid the many dependencies of XdsTestUtils.
I'm totally hacking java_grpc_library to improve the dependency
situation. Long-term, I think we will stop building Java libraries with
Bazel and require users to rely entirely on Maven Central. That seems to
be the direction Bazel is going and it will greatly simplify the
problems we've seen with protobuf having a single repository for many
languages. So while the hack isn't too bad, I hope we won't have to live
with it long-term.
The resource subscription to the fallback target was done only at the time of falling back, which can cause rpcs to fail. This change makes the fallback target to be subscribed and cached earlier, similar to C++ and go gRPC implementations.
The PriorityLB predates A56. tryNextPriority() now matches
ChoosePriority() from the gRFC.
The biggest change is waiting on CONNECTING children instead of failing
after the failOverTimer fires. The failOverTimer should be used to start
lower priorities more eagerly, but shouldn't cause the overall
connectivity state to become TRANSIENT_FAILURE on its own. The prior
behavior of creating the "Connection timeout for priority" failing
picker was particularly strange, because it didn't update child's
connectivity state. This previous behavior was creating errors because
of the failOverTimer with no way to diagnose what was going wrong.
b/428517222
The main reason I made a change here was to fix the tense from the
deadline "will be exceeded in" to "was exceeded after". But we really
don't want to be doing the string formatting unless the deadline is
actually exceeded. There were a few more changes to make some variables
effectively final.
Fix HashSet / HashMap initializations to have sufficient capacity allocated based on the number of keys to be inserted, without which it would always lead to a rehash / resize operation.
In #12185, RPCs were randomly hanging. In #12207 this was tracked down
to the headers promise completing successfully, but the netty stream
was null. This was because the headers write hadn't completed but
stream.close() had been called by goingAway().
In observed cases, whether RST_STREAM or another failure from netty or
the server, listeners can fail to be notified when a connection yields a
null stream for the selected streamId. This causes hangs in clients,
despite deadlines, with no obvious resolution.
Tests which relied upon this promise succeeding must now change.
LoadBalancers shouldn't be called after shutdown(), but RingHashLb could
have enqueued work to the SynchronizationContext that executed after
shutdown(). This commit fixes problems discovered when auditing all LBs
usage of the syncContext for that type of problem.
Similarly, PickFirstLb could have requested a new connection after
shutdown(). We want to avoid that sort of thing too.
RingHashLb's test changed from CONNECTING to TRANSIENT_FAILURE to get
the latest picker. Because two subchannels have failed it will be in
TRANSIENT_FAILURE. Previously the test was using an older picker with
out-of-date subchannelView, and the verifyConnection() was too imprecise
to notice it was creating the wrong subchannel.
As discovered in b/430347751, where ClusterImplLb was seeing a new
subchannel being called after the child LB was shutdown (the shutdown
itself had been caused by RingHashConfig not implementing equals() and
was fixed by a8de9f07ab, which caused ClusterResolverLb to replace its
state):
```
java.lang.NullPointerException
at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createClusterLocalityFromAttributes(ClusterImplLoadBalancer.java:322)
at io.grpc.xds.ClusterImplLoadBalancer$ClusterImplLbHelper.createSubchannel(ClusterImplLoadBalancer.java:236)
at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
at io.grpc.util.ForwardingLoadBalancerHelper.createSubchannel(ForwardingLoadBalancerHelper.java:47)
at io.grpc.internal.PickFirstLeafLoadBalancer.createNewSubchannel(PickFirstLeafLoadBalancer.java:527)
at io.grpc.internal.PickFirstLeafLoadBalancer.requestConnection(PickFirstLeafLoadBalancer.java:459)
at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:174)
at io.grpc.xds.LazyLoadBalancer$LazyDelegate.activate(LazyLoadBalancer.java:64)
at io.grpc.xds.LazyLoadBalancer$LazyDelegate.requestConnection(LazyLoadBalancer.java:97)
at io.grpc.util.ForwardingLoadBalancer.requestConnection(ForwardingLoadBalancer.java:61)
at io.grpc.xds.RingHashLoadBalancer$RingHashPicker.lambda$pickSubchannel$0(RingHashLoadBalancer.java:440)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:128)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.onData(XdsClientImpl.java:817)
```
grpc-binder clients authorize servers by checking the UID of the sender of the SETUP_TRANSPORT Binder transaction against some SecurityPolicy. But merely binding to an unauthorized server to learn its UID can enable "keep-alive" and "background activity launch" abuse, even if security policy ultimately decides the connection is unauthorized. Pre-authorization mitigates this kind of abuse by looking up and authorizing a candidate server Application's UID before binding to it. Pre-auth is especially important when the server's address is not fixed in advance but discovered by PackageManager lookup.
PROTOCOL-HTTP2.md specifies "TimeoutValue → {positive integer as ASCII
string of at most 8 digits}". Zero is not positive, so it should be
avoided. So make sure timeouts are at least 1 nanosecond instead of 0
nanoseconds.
grpc-go recently began disallowing zero timeouts in
https://github.com/grpc/grpc-go/pull/8290 which caused a regression as
grpc-java can generate such timeouts. Apparently no gRPC implementation
had previously been checking for zero timeouts.
Instead of changing the max(0) to max(1) everywhere, just move the max
handling into TimeoutMarshaller, since every caller of TIMEOUT_KEY was
doing the same max() handling.
Before fd8fd517d (in 2016!), grpc-java actually behaved correctly, as it
failed RPCs with timeouts "<= 0". The commit changed the handling to the
max(0) handling we see now.
b/427338711
297ab05ef converted CDS to XdsDependencyManager. This caused three
regressions:
* CdsLB2 as a RLS child would always fail with "Unable to find
non-dynamic root cluster" because is_dynamic=true was missing in
its service config
* XdsNameResolver only propagated resolution updates when the clusters
changed, so a CdsUpdate change would be ignored. This caused a hang
for RLS even with is_dynamic=true. For non-RLS the lack config update
broke the circuit breaking psm interop test. This would have been
more severe if ClusterResolverLb had been converted to
XdsDependenceManager, as it would have ignored EDS updates
* RLS did not propagate resolution updates, so CdsLB2 even with
is_dynamic=true the CdsUpdate for the new cluster would never arrive,
causing a hang
b/428120265
b/427912384
The @SystemApi runtime visibility requirement isn't really new. It has always been implicit in the required INTERACT_ACROSS_USERS permission, which (in production) can only be held by system apps.
The SDK_INT >= 30 requirement was also always present, via @RequiresApi() on BinderChannelBuilder#bindAsUser. This change just updates its replacement APIs (AndroidComponentAddress and TARGET_ANDROID_USER) to require it too.
The previous code did a ping-pong to make sure the transport had enough
time to process, but then proceeded to sleep 5 seconds. That sleep would
have been needed without the ping-pong, but with the ping-pong we are
confident all events have been drained from the transport. Deleting the
unnecessary sleeps saves 10 seconds, for each of the 9 instances of this
test.
ClusterResolverLb is still doing DNS itself, so disable it in XdsDepMan
until that migration has finished. EDS is fine in XdsDepman, because
XdsClient will share the result with ClusterResolverLb.
ClusterResolverLb gets the NameResolverRegistry from
LoadBalancer.Helper, so a new API was added in NameResover.Args to
propagate the same object to the name resolver tree.
RetryingNameResolver was exposed to xds. This is expected to be
temporary, as the retrying is being removed from ManagedChannelImpl and
moved into the resolvers. At that point, DnsNameResolverProvider would
wrap DnsNameResolver with a similar API to RetryingNameResolver and xds
would no longer be responsible.
This should often not matter much, but in b/412468630 it was cleary
visible that child creation order can skew load for the first batch of
RPCs. This doesn't solve all the cases, as further-away backends will
still be less likely chosen initially and it is ignorant of the LB
policy. But this doesn't impact correctness, is easy, and is one fewer
cases to worry about.
This is missing behavior defined in gRFC A74:
> As per gRFC A31, the ConfigSelector gives each RPC a ref to the
> cluster that was selected for it to ensure that the cluster is not
> removed from the xds_cluster_manager LB policy config before the RPC
> is done with its LB picks. These cluster refs will also hold a
> subscription for the cluster from the XdsDependencyManager, so that
> the XdsDependencyManager will not stop watching the cluster resource
> until the cluster is removed from the xds_cluster_manager LB policy
> config.
Without the logic, RPCs can race and see the error:
> INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster
Fixes#12152. This fixes the regression introduced in 297ab05e
TimeProvider provides wall time. That can move forward and backward as time is adjusted. OutlierDetection is measuring durations, so it should use a monotonic clock.
Fixes#11622
This will be used for logical dns clusters as part of gRFC A74. Swapping
to EnumMap wasn't really necessary, but was easy given the new type
system.
I can't say I'm particularly happy with the name of the new
TrackedWatcher type, but XdsConfigWatcher prevented using "Watcher"
because it won't implement the new interface, and ResourceWatcher
already exists in XdsClient. So we have TrackedWatcher, WatcherTracer,
TypeWatchers, and TrackedWatcherType.
It was introduced in fcb5c54e4 because at the time we didn't change the
API to communicate the status. When onResult2() was introduced in
90d0fabb1 this hack stopped being necessary.
The watchers can be completely regular, so the base class can do the
cache management while the subclasses are only concerned with
subscribing to children.
We here address the following obstacles in grpc-java to using Bazel's
--incompatible_disable_target_default_provider_fields flag:
```
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: in _java_grpc_library rule @@googleapis+//google/devtools/build/v1:build_java_grpc:
Traceback (most recent call last):
File "/private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/grpc-java+/java_grpc_library.bzl", line 94, column 30, in _java_rpc_library_impl
args.add(toolchain.plugin.files_to_run.executable, format = "--plugin=protoc-gen-rpc-plugin=%s")
Error: Accessing the default provider in this manner is deprecated and will be removed soon. It may be temporarily re-enabled by setting --incompatible_disable_target_default_provider_fields=false. See https://github.com/bazelbuild/bazel/issues/20183 for details.
ERROR: /private/var/tmp/_bazel_dws/7fd3cd5077fbf76d9e2ae421c39ef7ed/external/googleapis+/google/devtools/build/v1/BUILD.bazel:81:18: Analysis of target '@@googleapis+//google/devtools/build/v1:build_java_grpc' failed
ERROR: Analysis of target '//src:bazel' failed; build aborted: Analysis failed
```
Just use a regular method instead of reusing the EvictionListener API.
Fix a few comments as well. Both of these changes were based on review
comments to pre-existing code in #11203.
Contributes to #11243
I noticed we deviated from gRFC A37 in some ways. It turned out those
were added to the gRFC later in https://github.com/grpc/proposal/pull/344:
- NACKing empty aggregate clusters
- Failing aggregate cluster when children could not be loaded
- Recusion limit of 16. We had this behavior already, but it was
ascribed to matching C++
There's disagreement on whether we should actually fail the aggregate
cluster for bad children, so I'm preserving the pre-existing behavior
for now.
The code is now doing a depth-first leaf traversal, not breadth-first.
This was odd to see, but the code was also pretty old, so the reasoning
seems lost to history. Since we haven't seen more than a single level of
aggregate clusters in practice, this wouldn't have been noticed by
users.
XdsDependencyManager.start() was created to guarantee that the callback
could not be called before returning from the constructor. Otherwise
XDS_CLUSTER_SUBSCRIPT_REGISTRY could potentially be null.
We can easily compute the rdsName and avoiding the state means we don't
need to override onResourceDoesNotExist() to keep the cache in-sync with
the config.
1fd29bc80 replaced cancelWatcher() with watcher.close(). But setting
cancelled was missing. Because the config update checks for shutdown,
the cancelled flag no longer avoids exceptions. But it seems best to
continue avoiding any processing after close to avoid surprises.
Reference counting doesn't release cycles, so swap to a tracing garbage
collector. This greatly simplifies the code as well, as diffing is no
longer necessary. (If vanilla reference counting was used, diffing
wouldn't have been necessary either as you just increment all the new
objects and decrement the old ones. But that doesn't work when use a set
instead of an integer.)
- Use @BinderThread to document restrictions on methods and certain fields.
- Make TransactionHandler non-public since only Android should call it.
- Replace an unnecessary AtomicLong with a plain old long.
The children of aggregate clusters have a priority order, so we can't
ever throw them in an ordinary set for later iteration.
This now detects recusion limits only after subscribing, but that
matches our existing behavior in CdsLoadBalancer2. We don't get much
value detecting the limit before subscribing and doing so makes watcher
types more different.
Loops are still a bit broken as they won't be unwatched when orphaned,
as they will form a reference loop. In CdsLoadBalancer2, duplicate
clusters had duplicate watchers so there was single-ownership and
reference cycles couldn't form. Fixing that is a bigger change.
Intermediate aggregate clusters are now included in XdsConfig, just for
simplicity. It doesn't hurt anything whether they are present or
missing. but it required updates to some tests.
* xds: Don't allow hostnames in address field
gRFC A27 specifies they must be IPv4 or IPv6 addresses. Certainly doing
a DNS lookup hidden inside the config object is asking for trouble.
The tests were accidentally doing a lot of failing DNS requests greatly
slowing them down. On my desktop, which made the problem most obvious
with five search paths in /etc/resolv.conf, :grpc-xds:test decreased
from 66s to 29s. The majority of that is XdsDependencyManagerTest which
went from 33s to .1s, as it generated a UUID for the in-process
transport each test and then used it as a hostname, which defeated
Java's DNS (negative) cache. The slowness was noticed because
XdsDependencyManagerTest should have run quickly as a single thread
without I/O, but was particularly slow on my desktop.
The cleanup caused me to audit serverName and the weird places it went.
I think some of them were tricks for XdsClientFallbackTest to squirrel
away something distinguishing, although reusing the serverName is asking
for confusion as is including the tricks in "shared" utilities.
XdsClientFallbackTest does have some non-trivial changes, but this seems
to fix some pre-existing bugs in the tests.
* Add failing hostname unit test
SOTW is unique in that it can become absent after being found. But if we
NACK when initially loading the resource, we don't want to delay, depend
on the resource timeout, and then give a poor error.
This was noticed while adding the EDS restriction that address is not a
hostname and some tests started hanging instead of failing quickly.
The optimization makes the code more complicated. Yes, we know that
maybePublishConfig() will do no work because of an outstanding watch,
but we don't do this for other new watchers created and doing so would
just make the code more bug-prone. This removes a difference in how
different watcher types are handled.
This provides better type and missing-map handling. Note that
getWatchers() now implicitly creates the map if it doesn't exist,
instead of just returning an empty map. That makes it a bit easier to
use and more importantly avoids accidents where a bug tries to modify
the immutable map.
The most important change here is to handle subscribeToCluster() calls
after shutdown(), and preventing the internal state from being heavily
confused as the assumption is there are no watchers after shutdown().
ClusterSubscription.closed isn't strictly necessary, but I don't want
the code to depend on double-deregistration being safe.
maybePublishConfig() isn't being called after shutdown(), but adding the
protection avoids a class of bugs that would cause channel panic.
gRPC doesn't create the CronetEngine, so even though streaming is
observing the CronetEngine's User-Agent, we don't have control of that.
In addition, CronetEngines are commonly shared between gRPC and normal
HTTP traffic, so we don't actually expect users to set gRPC in engine's
user agent. The existing behavior seems to be working as well as
feasible.
Fixes#11582
android-interop has been failing to build since 46485c8 because it
didn't have cmake installed and defined LDFLAGS/CXXFLAGS with pkg-config
before make_dependencies.sh had been run.
Android-interop didn't verify the codegen is up-to-date. Building the
codegen was just a relic from when android was its own separate gradle
build. Avoiding codegen means we don't have to compile absl/protobuf and
have a C++ toolchain.
After many years of issue 9179 being open, there's been nothing to show
that we need the javax.annotations.Generated annotation. Most tools use
file paths and a few check for annotations with "Generated" in the name.
ErrorProne has a few that check for javax.annotations.Generated, but
only UnnecessarilyFullyQualified looks like it'd be a problem and it is
disabled by default. We're not getting any more information, no users
have reported issues with `@generated=omit`, and the existing dependency
is annoying users, so just drop it.
Given we will still retain the GrpcGenerated annotation, it seems highly
likely things are already okay. Even if there are problems they would
probably be addressed by adding a io.grpc.stub.annotations.Generated
annotation or small tweaks. In the short-term, (non-Bazel) users can use
`@generated=javax`, but long-term we could consider removing the option
assuming we've resolved any outstanding issues.
We will want to update the examples and the README to remove the
org.apache.tomcat:annotations-api dependency after the next release.
Fixes#9179
This version runs way faster than BinderTransportTest and doesn't require an actual Android device/emulator. It'll allow future tests to simulate things that are difficult/impossible on real Android, at the price of some realism.
The plugin now outputs to "generated/sources". The IDE configuration
explicitly adding the folders to the source sets hasn't been needed for
some years.
This prevents a NPE and subsequent channel panic when trying to build a
config (because there are no watchers, so waitingOnResource==false)
without any listener and route.
```
java.lang.NullPointerException: Cannot invoke "io.grpc.xds.XdsDependencyManager$RdsUpdateSupplier.getRdsUpdate()" because "routeSource" is null
at io.grpc.xds.XdsDependencyManager.buildUpdate(XdsDependencyManager.java:295)
at io.grpc.xds.XdsDependencyManager.maybePublishConfig(XdsDependencyManager.java:266)
at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:899)
at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:888)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.notifyWatcher(XdsClientImpl.java:929)
at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.lambda$onData$0(XdsClientImpl.java:837)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
```
I think this fully-fixes the problem today, but not tomorrow.
subscribeToCluster() is racy as well, but not yet used.
This was noticed when idleTimeout was firing, with some other code
calling getState(true) to wake the channel back up. That may have made
this panic more visible than it would be otherwise, but that has not
been investigated.
b/412474567
The security code referenced fields removed from gRFC A29 before it was
finalized.
Note that this fixes a bug in CommonTlsContextUtil where
CombinedValidationContext was not checked. I believe this was the only
location with such a bug as I audited all non-test usages of
has/getValidationContext() and confirmed they have have a corresponding
has/getCombinedValidationContext().
Generating a KeyPair is very expensive when running with TSAN, because
TSAN keeps the JVM in interpreted mode. This speeds up the test running
on my desktop from .368s to .151s; faster, but nobody cares. With TSAN,
the speedup is from 150-500s to 4-6s. Within Google the test was timing
out because it was taking so long. While we can increase the timeout,
it seems better to speed up the test in this easy way.
"EXP" stood for experimental and all documentation that referenced it made it clear it was experimental. It's been some years since we started logging a message when it was used to say it will be deleted. There's no time like the present to delete it.
I think at some point there were more usages in the tests. But now it
is pretty easy.
PriorityLb.ChildLbState.picker is initialized to
FixedResultPicker(NoResult). So now that GracefulSwitchLb is using the
same picker, equals() is able to de-dup an update.
Previously it would wait for the new LB to enter READY. However, that
prevents there being an upper-bound on how long the old policy will
continue to be used. The point of graceful switch is to avoid RPCs
seeing increased latency when we swap config. We don't want it to
prevent the system from becoming eventually consistent.
Some clients watching health status can cancel their watch and `HealthService` when trying to notify these watchers were getting CANCELLED exception because there was no cancellation handler set on the `StreamObserver`. This change sets the cancellation handler that removes the watcher from the set of watcher clients to be notified of the health status.
panic() calls a good amount of code, so it could get another exception.
The SynchronizationContext is running on an arbitrary thread and we
don't want to propagate this secondary exception up its stack (to be
handled by its UncaughtExceptionHandler); it we wanted that we'd
propagate the original exception.
This second exception will only be seen in the logs; the first exception
was logged and will be used to fail RPCs.
Also related to http://yaqs/8493785598685872128 and b692b9d26
This is to support gRFC A83 xDS GCP Authentication Filter:
> Otherwise, the filter will look in the CDS resource's metadata for a
> key corresponding to the filter's instance name.
It is much harder to debug refcounting problems when we ignore
impossible situations. So make such impossible cases complain loudly so
the bug is obvious.
This allows Filters to access the xds configuration for their own
processing. From gRFC A83:
> This data is available via the XdsConfig attribute introduced in A74.
> If the xDS ConfigSelector is not already passing that attribute to the
> filters, it will need to be changed to do so.
Contributes to the gRFC A74 effort.
https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md
The alternative to using Mockito's ArgumentMatcher is to use Hamcrest.
However, Hamcrest did not impress me. ArgumentMatcher is trivial if you
don't care about the error message.
This fixes a pre-existing issue where ConfigSelector.releaseCluster
could revert the LB config back to using cluster manager after releasing
all RPCs using a cluster have committed.
Co-authored-by: Larry Safran <lsafran@google.com>
It is confusing/harder to read the logs when the
activations/deactivations because of the config happen before the log
entry describing the new config.
Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).
The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819ac6, updateBalancingState() updates the picker immediately.
cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.
The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819ac6 along
with this commit reduced it to 0 failures in 1000 runs.
Discovered when investigating b/394850611
ffcc360ba adjusted updateBalancingState() to require being run within
the sync context. However, it still queued the work into the sync
context, which was unnecessary. This re-entering the sync context
unnecessarily delays the new state from being used.
This PR adds support filter state retention in Java. The mechanism
will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.
### Filter instance lifecycle
#### xDS gRPC clients
New filter instances are created per combination of:
1. `XdsNameResolver` instance,
2. Filter name+typeUrl as configured in
HttpConnectionManager (HCM) http_filters.
Existing client-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
HCM that is missing filter configuration for name+typeUrl
combination of this instance.
- All filter instances when watched LDS resource is missing from an
LDS update.
- All filter instances name resolver shutdown.
#### xDS-enabled gRPC servers
New filter instances are created per combination of:
1. Server instance,
2. FilterChain name,
3. Filter name+typeUrl as configured in FilterChain's HCM.http_filters
Filter instances of Default Filter Chain is tracked separately per:
1. Server instance,
2. Filter name+typeUrl in default_filter_chain's HCM.http_filters.
Existing server-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
FilterChain with HCM.http_filters that is missing configuration for
filter name+typeUrl.
- All filter instances associated with the FilterChain when an LDS
update no longer contains FilterChain's name.
- All filter instances when watched LDS resource is missing from an
LDS update.
- All filter instances on server shutdown.
### Related
- Part 1: #11883
`XdsServerWrapper#generatePerRouteInterceptors` was always intended
to be executed within a sync context. This PR ensures that by calling
`syncContext.throwIfNotInThisSynchronizationContext()`.
This change is needed for upcoming xDS filter state retention because
the new tests in XdsServerWrapperTest flake with this NPE:
> `Cannot invoke "io.grpc.xds.client.XdsClient$ResourceWatcher.onChanged(io.grpc.xds.client.XdsClient$ResourceUpdate)" because "this.ldsWatcher" is null`
We want to move away from handleResolvedAddresses(). These are "easy" in
that they need no logic. LBs extending ForwardingLoadBalancer had the
method duplicated from handleResolvedAddresses() and swapped away from
`super` because ForwardingLoadBalancer only forwards
handleResolvedAddresses() reliably today. Duplicating small methods was
less bug-prone than dealing with ForwardingLoadBalancer.
Implements [gRFC A76: explicitly setting the request hash key for the
ring hash LB policy][A76]
* Explictly setting the request hash key is guarded by the
`GRPC_EXPERIMENTAL_RING_HASH_SET_REQUEST_HASH_KEY` environment
variable until API stabilized.
Tested:
* Verified end-to-end by spinning up multiple gRPC servers and a gRPC
client that injects a custom service (load balancing) config with
`ring_hash_experimental` and a custom `request_hash_header` (with
NO associated value in the metadata headers) which generates a random
hash for each request to the ring hash LB. Verified picks/RPCs are
split evenly/uniformly across all backends.
* Ran affected unit tests with thread sanitizer and 1000 iterations to
prevent data races.
[A76]: https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md#explicitly-setting-the-request-hash-key
This is the first step towards supporting filter state retention in
Java. The mechanism will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.
In Java, xDS HTTP filters are backed by classes implementing
`io.grpc.xds.Filter`, from here just called "Filters". To support
Filter state retention (next PR), Java's xDS implementation must be
able to create unique Filter instances per:
- Per HCM
`envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager`
- Per filter name as specified in
`envoy.extensions.filters.network.http_connection_manager.v3.HttpFilter.name`
This PR **does not** implements Filter state retention, but lays the
groundwork for it by changing how filters are registered and
instantiated. To achieve this, all existing Filter classes had to be
updated to the new instantiation mechanism described below.
Prior to these this PR, Filters had no livecycle. FilterRegistry
provided singleton instances for a given typeUrl. This PR introduces
a new interface `Filter.Provider`, which instantiates Filter classes.
All functionality that doesn't need an instance of a Filter is moved
to the Filter.Provider. This includes parsing filter config proto
into FilterConfig and determining the filter kind
(client-side, server-side, or both).
This PR is limited to refactoring, and there's no changes to the
existing behavior. Note that all Filter Providers still return
singleton Filter instances. However, with this PR, it is now possible
to create Providers that return a new Filter instance each time
`newInstance` is called.
A failing Status from acceptResolvedAddresses means something is wrong
with the config, but parts of the config may still have been applied.
Thus there are now two possible flows: errors that should prevent
updateOverallBalancingState() and errors that should have no effect
other than the return code. To manage that, MultChildLb must always be
responsible for calling updateOverallBalancingState().
acceptResolvedAddressesInternal() was inlined to make that error
processing easier. No existing usages actually needed to have logic
between updating the children and regenerating the picker.
RingHashLb already was verifying that the address list was not empty, so
the short-circuiting when acceptResolvedAddressesInternal() returned an
error was impossible to trigger. WrrLb's updateWeightTask() calls the
last picker, so it can run before acceptResolvedAddressesInternal(); the
only part that matters is re-creating the weightUpdateTimer.
S2AStub is an internal API and shouldn't be used outside of s2a. It is
still available for tests.
IntegrationTest was moved to io.grpc.s2a. It uses a io.grpc.s2a class,
so shouldn't be in internal.handler
Switched to using 8192 which is the current value of Segment.SIZE and just have a test check that they are equal.
The reason for doing this is that Segment.SIZE is Kotlin internal so shouldn't be used outside of its module.
To try to aid failure when building android-interop-testing
```
The Daemon will expire after the build after running out of JVM heap space.
The project memory settings are likely not configured or are configured to an insufficient value.
The daemon will restart for the next build, which may increase subsequent build times.
These settings can be adjusted by setting 'org.gradle.jvmargs' in 'gradle.properties'.
The currently configured max heap space is '512 MiB' and the configured max metaspace is '384 MiB'.
...
Exception in thread "Daemon client event forwarder" java.lang.OutOfMemoryError: Java heap space
...
> Task :grpc-android-interop-testing:mergeDexDebug FAILED
ERROR:D8: java.lang.OutOfMemoryError: Java heap space
com.android.builder.dexing.DexArchiveMergerException: Error while merging dex archives:
```
* don't process resourceDoesNotExist for watchers that have been cancelled.
* Change test to use an ArgumentMatcher instead of expecting that only the final result will be sent since depending on timing there may be configs sent for clusters being removed with their entries as errors.
Currently this improves 2 flows
1. Known length message which length is greater than 1Mb. Previously the
first buffer was 1Mb, and then many buffers of 4096 bytes (from
CodedOutputStream), now subsequent buffers are also up to 1Mb
2. In case of compression, the first write is always 10 bytes buffer
(gzip header), but worth allocating more space
* Change XdsConfig to match spec with a `children` object holding either `a list of leaf cluster names` or `an EdsUpdate`. Removed intermediate aggregate nodes from `XdsConfig.clusters`.
* core: updates the backoff range being used from [0, 1] to [0.8, 1.2] as per the A6 redefinition
* adds a flag for experimental jitter
* xds: Allow FaultFilter's interceptor to be reused
This is the only usage of PickSubchannelArgs when creating a filter's
ClientInterceptor, and a follow-up commit will remove the argument and
actually reuse the interceptors. Other filter's interceptors can
already be reused.
There doesn't seem to be any significant loss of legibility by making
FaultFilter a more ordinary interceptor, but the change does cause the
ForwardingClientCall to be present when faultDelay is configured,
independent of whether the fault delay ends up being triggered.
Reusing interceptors will move more state management out of the RPC path
which will be more relevant with RLQS.
* netty: Removed 4096 min buffer size (#11856)
* netty: Removed 4096 min buffer size
* turns the flag in a var for better efficiency
---------
Co-authored-by: Eric Anderson <ejona@google.com>
The variables from the do-while are no longer initialized to let the
compiler verify that the loop sets each. Unnecessary comparisons to null
are also removed and is more obvious as the variables are never set to
null. Added a minor optimization of computing the RPCs path once instead
of once for each route. The variable declarations were also sorted to
match their initialization order.
This does fix an unlikely bug where if the old code could successfully
matched a route but fail to retain the cluster, then when trying a
second time if the route was _not_ matched it would re-use the prior route
and thus infinite-loop failing to retain that same cluster.
It also adds a missing cast to unsigned long for a uint32 weight. The old
code would detect if the _sum_ was negative, but a weight using 32 bits
would have been negative and never selected.
The Kokoro aarch64 build runs on x86 with an emulator, and has always
been flaky due to the slow execution speed. At present it is continually
failing due to deadline exceededs. GitHub Actions is running on aarch64
hardware, so is much faster (4 minutes vs 30 minutes, without including
the speedup from GitHub Action's caching).
This adds a createFrom(Attributes) to mirror the check(Attributes) added
in ba8ab79. It also adds conveniences for ClientCall for both
createFrom() and check(). This allows getting peer information from
ClientCall and CallCredentials.RequestInfo, as was already available
from ServerCall.
The tests were reworked to test the Attribute-based methods and then
only basic tests for client/server.
Fixes#11042
The field was made final in 4b52639aa but was soon reverted in 3ebb3e192
because of what I assume was a bad merge conflict resolution. The field
has contained an immutable object since its introduction in d25f5acf1,
so it is pretty likely to remain a constant in the future.
This moves the interceptor creation from the ConfigSelector to the
resource update handling.
The code structure changes will make adding support for filter
lifecycles (for RLQS) a bit easier. The filter lifecycles will allow
filters to share state across interceptors, and constructing all the
interceptors on a single thread will mean filters wouldn't need to be
thread-safe (but their interceptors would be thread-safe).
Setting the authority is only useful when creating a real stream, as
there will be a following pick otherwise. In addition, DelayedStream
will buffer each call to setAuthority() in a list and we don't want that
memory usage. Note that no LBs are using this feature yet, so users
would not have been exposed to the memory use.
We also needed to setAuthority() when the LB selected a subchannel on
the first pick attempt.
This is the only usage of PickSubchannelArgs when creating a filter's
ClientInterceptor, and a follow-up commit will remove the argument and
actually reuse the interceptors. Other filter's interceptors can
already be reused.
There doesn't seem to be any significant loss of legibility by making
FaultFilter a more ordinary interceptor, but the change does cause the
ForwardingClientCall to be present when faultDelay is configured,
independent of whether the fault delay ends up being triggered.
Reusing interceptors will move more state management out of the RPC path
which will be more relevant with RLQS.
* Have acceptResolvedAddresses() do a seek when in CONNECTING state and cleanup removed subchannels when a seek was successful.
Move cleanup of removed subchannels into a method so it can be called from 2 places in acceptResolvedAddresses.
Since the seek could mean we never looked at the first address, if we go off the end of the index and haven't looked at the all of the addresses then instead of scheduleBackoff() we reset the index and request a connection.
d65d3942e increased the test speed of
connect_then_mainServerDown_fallbackServerUp by using FakeClock.
However, it introduced a data race because FakeClock is not thread-safe.
This change injects a single thread for gRPC callbacks such that
syncContext is run on a thread under the test's control.
A simpler approach would be to expose syncContext from XdsClientImpl for
testing. However, this test is in a different package and I wanted to
avoid adding a public method.
```
Read of size 8 at 0x00008dec9d50 by thread T25:
#0 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Lio/grpc/internal/FakeClock$ScheduledTask;JLjava/util/concurrent/TimeUnit;)V FakeClock.java:140
#1 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;)Ljava/util/concurrent/ScheduledFuture; FakeClock.java:150
#2 io.grpc.SynchronizationContext.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/ScheduledExecutorService;)Lio/grpc/SynchronizationContext$ScheduledHandle; SynchronizationContext.java:153
#3 io.grpc.xds.client.ControlPlaneClient$AdsStream.handleRpcStreamClosed(Lio/grpc/Status;)V ControlPlaneClient.java:491
#4 io.grpc.xds.client.ControlPlaneClient$AdsStream.lambda$onStatusReceived$0(Lio/grpc/Status;)V ControlPlaneClient.java:429
#5 io.grpc.xds.client.ControlPlaneClient$AdsStream$$Lambda+0x00000001004a95d0.run()V ??
#6 io.grpc.SynchronizationContext.drain()V SynchronizationContext.java:96
#7 io.grpc.SynchronizationContext.execute(Ljava/lang/Runnable;)V SynchronizationContext.java:128
#8 io.grpc.xds.client.ControlPlaneClient$AdsStream.onStatusReceived(Lio/grpc/Status;)V ControlPlaneClient.java:428
#9 io.grpc.xds.GrpcXdsTransportFactory$EventHandlerToCallListenerAdapter.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V GrpcXdsTransportFactory.java:149
#10 io.grpc.PartialForwardingClientCallListener.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V PartialForwardingClientCallListener.java:39
...
Previous write of size 8 at 0x00008dec9d50 by thread T4 (mutexes: write M0, write M1, write M2, write M3):
#0 io.grpc.internal.FakeClock.forwardTime(JLjava/util/concurrent/TimeUnit;)I FakeClock.java:368
#1 io.grpc.xds.XdsClientFallbackTest.connect_then_mainServerDown_fallbackServerUp()V XdsClientFallbackTest.java:358
...
```
Protobuf is interested in using absl::string_view instead of const
std::string&. Just copy to std::string as the C++17 build isn't yet
operational and that level of performance doesn't matter.
cl/711732759 b/353571051
The soak code grew considerably in 6a92a2a22e. Since it isn't a JUnit
test and doesn't resemble the other tests, it doesn't belong in
AbstractInteropTest. AbstractInteropTest has lots of users, including it
being re-compiled for use on Android, so moving it out makes the
remaining code more clear for the more common cases.
This PR resolves an issue with peer address extraction in the soak
test.
In current `TestServiceClient` implementation, the same
`clientCallCapture` atomic is shared across threads, leading to
incorrect peer extraction. This fix ensures that each thread uses a
local variable for capturing the client call.
This is in service to gRFC A89. Since the gRFC isn't finalized this
purposefully doesn't really do anything yet. The grpc-opentelemetry
change to use this optional label will be done after the gRFC is merged.
grpc-opentelemetry currently has a hard-coded list (one entry) of labels
that it looks for, and this label will need to be added.
b/356167676
These changes reduce connect_then_mainServerDown_fallbackServerUp test
time from 20 seconds to 5 s by faking time for the the does-no-exist
timer.
XdsClientImpl only uses the TimeProvider for CSDS cache details, so any
implementation should be fine. FakeXdsClient provides an implementation,
so might as well use it as it is one less clock to think about.
I noticed an old JDK 8u275 failed on the test because the modification
time's resolution was one second. A newer JDK 8u432 worked fine, so it's
not really a problem for me, but increasing the time difference is
cheap. I used two seconds as that's the resolution available on FAT
(which is unlikely to be TMPDIR, even on Windows).
Since approximately the LBv2 API (the current API) was introduced, gRPC
won't use a transport until it is ready. Long ago, transports could be
used before they were ready and these old tests were not waiting for the
negotiator to complete before starting. We need them to wait for the
handshake to complete to avoid a test-only data race in getAttributes()
noticed by TSAN.
Throwing away data frames in the Noop handshaker is necessary to act
like a normal handshaker; they don't allow data frames to pass until the
handshake is complete. Without the handling, it goes through invalid
code paths in NettyClientHandler where a terminated transport becomes
ready, and a similar data race.
```
Write of size 4 at 0x00008db31e2c by thread T37:
#0 io.grpc.netty.NettyClientHandler.handleProtocolNegotiationCompleted(Lio/grpc/Attributes;Lio/grpc/InternalChannelz$Security;)V NettyClientHandler.java:517
#1 io.grpc.netty.ProtocolNegotiators$GrpcNegotiationHandler.userEventTriggered(Lio/netty/channel/ChannelHandlerContext;Ljava/lang/Object;)V ProtocolNegotiators.java:937
#2 io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(Ljava/lang/Object;)V AbstractChannelHandlerContext.java:398
#3 io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(Lio/netty/channel/AbstractChannelHandlerContext;Ljava/lang/Object;)V AbstractChannelHandlerContext.java:376
#4 io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(Ljava/lang/Object;)Lio/netty/channel/ChannelHandlerContext; AbstractChannelHandlerContext.java:368
#5 io.grpc.netty.ProtocolNegotiators$ProtocolNegotiationHandler.fireProtocolNegotiationEvent(Lio/netty/channel/ChannelHandlerContext;)V ProtocolNegotiators.java:1107
#6 io.grpc.netty.ProtocolNegotiators$WaitUntilActiveHandler.channelActive(Lio/netty/channel/ChannelHandlerContext;)V ProtocolNegotiators.java:1011
...
Previous read of size 4 at 0x00008db31e2c by thread T4 (mutexes: write M0, write M1, write M2, write M3):
#0 io.grpc.netty.NettyClientHandler.getAttributes()Lio/grpc/Attributes; NettyClientHandler.java:345
#1 io.grpc.netty.NettyClientTransport.getAttributes()Lio/grpc/Attributes; NettyClientTransport.java:387
#2 io.grpc.netty.NettyClientTransport.newStream(Lio/grpc/MethodDescriptor;Lio/grpc/Metadata;Lio/grpc/CallOptions;[Lio/grpc/ClientStreamTracer;)Lio/grpc/internal/ClientStream; NettyClientTransport.java:198
#3 io.grpc.netty.NettyClientTransportTest$Rpc.<init>(Lio/grpc/netty/NettyClientTransport;Lio/grpc/Metadata;)V NettyClientTransportTest.java:953
#4 io.grpc.netty.NettyClientTransportTest.huffmanCodingShouldNotBePerformed()V NettyClientTransportTest.java:631
...
```
```
Read of size 4 at 0x00008f983a3c by thread T4 (mutexes: write M0, write M1):
#0 io.grpc.netty.NettyClientHandler.getAttributes()Lio/grpc/Attributes; NettyClientHandler.java:345
#1 io.grpc.netty.NettyClientTransport.getAttributes()Lio/grpc/Attributes; NettyClientTransport.java:387
#2 io.grpc.netty.NettyClientTransport.newStream(Lio/grpc/MethodDescriptor;Lio/grpc/Metadata;Lio/grpc/CallOptions;[Lio/grpc/ClientStreamTracer;)Lio/grpc/internal/ClientStream; NettyClientTransport.java:198
#3 io.grpc.netty.NettyClientTransportTest$Rpc.<init>(Lio/grpc/netty/NettyClientTransport;Lio/grpc/Metadata;)V NettyClientTransportTest.java:973
#4 io.grpc.netty.NettyClientTransportTest$Rpc.<init>(Lio/grpc/netty/NettyClientTransport;)V NettyClientTransportTest.java:969
#5 io.grpc.netty.NettyClientTransportTest.handlerExceptionDuringNegotiatonPropagatesToStatus()V NettyClientTransportTest.java:425
...
Previous write of size 4 at 0x00008f983a3c by thread T56:
#0 io.grpc.netty.NettyClientHandler$FrameListener.onSettingsRead(Lio/netty/channel/ChannelHandlerContext;Lio/netty/handler/codec/http2/Http2Settings;)V NettyClientHandler.java:960
...
```
* Move creating the retry timer in handleRpcStreamClosed to as late as possible and call `close` so that the `call` is cancelled.
Also add some debug logging.
If the control plane sends a resource type the client doesn't understand
at-the-moment, the control plane will still expect the client to include
the nonce if the client subscribes to the type in the future.
This most easily happens when unsubscribing the last resource of a type.
Which meant 1cf1927d1 was insufficient.
Internal* classes should generally be accessors that are used outside of
the package/project. Only one attribute was used outside of xds, so
leave only that one attribute in InternalXdsAttributes. One attribute
was used by the internal.security package, so move the definition to the
same package to reduce the circular dependencies.
There's no reason to use the interface outside of
XdsClientImpl/ControlPlaneClient. Since XdsClientImpl implements the
interface directly, its methods are still public. That can be a future
cleanup.
The module metadata in Guava causes the -jre version to be selected even
when you choose the -android version. Gradle did not give any clues that
this was happening, and while
`println(configurations.compileClasspath.resolve())` shows the different
jar in use, most other diagonstics don't. dependencyInsight can show you
this is happening, but only if you know which dependency has a problem
and read Guava's module metadata first to understand the significance of
the results.
You could argue this is a Guava-specific problem. I was able to get
parts of our build working with attributes and resolutionStrategy
configurations mentioned at
https://github.com/google/guava/releases/tag/v32.1.0 , so that only
Guava would be changed. But it was fickle giving poor error messages or
silently swapping back to the -jre version.
Given the weak debuggability, the added complexity, and the lack of
value module metadata is providing us, disabling module metadata for our
entire build seems prudent.
See https://github.com/google/guava/issues/7575
These repositories are already included from the main build.gradle, so
they don't do anything. Much less do they need to be defined twice in
the same file.
In e08b9db20 we added `@DoNotCall` annotations to some call sites, but
Bazel used an older version of ErrorProne that complained at times it
shouldn't. The minimum version of Bazel we test/support is now Bazel 6,
well past Bazel 3.4+.
This avoids the dependency on animalsniffer-annotations. grpc-api, and
particularly grpc-context, are used many low-level places and it is
beneficial for them to be very low dependency. This brings grpc-context
back to zero-dependency.
grpc-binder's upcoming AndroidIntentNameResolver needs to know the target Android user so it can resolve target URIs in the correct place. Unfortunately, Android's built in intent:// URI scheme has no way to specify a user and in fact the android.os.UserHandle object can't reasonably be encoded as a String at all.
We solve this problem by extending NameResolver.Args with the same type-safe and domain-specific Key<T> pattern used by CallOptions, Context and CreateSubchannelArgs. New "custom" arguments could apply to all NameResolvers of a certain URI scheme, to all NameResolvers producing a particular type of java.net.SocketAddress, or even to a specific NameResolver subclass.
In 61f19d707a I swapped the signatures to use the version catalog. But I
failed to preserve the `@signature` extension and it all seemed to
work... But in fact all the animalsniffer tasks were completing as
SKIPPED as they lacked signatures. The build.gradle changes in this
commit are to fix that while still using version catalog.
But while it was broken violations crept in. Most violations weren't
too important and we're not surprised went unnoticed. For example, Netty
with TLS has long required the Java 8 API
`setEndpointIdentificationAlgorithm()`, so using `Optional` in the same
code path didn't harm anything in particular. I still swapped it to
Guava's `Optional` to avoid overuse of `@IgnoreJRERequirement`.
One important violation has not been fixed and instead I've disabled the
android signature in api/build.gradle for the moment. The violation is
in StatusException using the `fillInStackTrace` overload of Exception.
This problem [had been noticed][PR11066], but we couldn't figure out
what was going on. AnimalSniffer is now noticing this and agreeing with
the internal linter. There is still a question of why our interop tests
failed to notice this, but given they are no longer running on pre-API
level 24, that may forever be a mystery.
[PR11066]: https://github.com/grpc/grpc-java/pull/11066
StructOrError is a more generic API, but we have StatusOr now so we
don't want new usages of StructOrError. Moving StructOrError out of
io.grpc.xds.client will make it easier to delete StructOrError once
we've migrated to StatusOr in the future.
TRANSPORT_SOCKET_NAME_TLS should also move, but it wasn't immediately
clear to me where it should go.
ObjectPool is our standard solution for dealing with the
sometimes-shutdown resources. This was implemented by a contributor not
familiar with regular tools.
There are wider changes that can be made here, but I chose to just do a
smaller change because this class is used by GrpclbNameResolver.
The channel log is shared by many components and is poorly suited to
the noise of per-RPC events. This commit restricts RLS usage of the
logger to no more frequent than cache entry events. This may still be
too frequent, but should substantially improve the signal-to-noise and
we can do further rework as needed.
Many of the log entries were poor because they lacked enough context.
They weren't even clear they were from RLS. The cache entry events now
regularly include the request key in the logs, allowing you to follow
events for specific keys. I would have preferred using the hash code,
but NumberFormat is annoying and toString() may be acceptable given its
convenience.
This commit reverts much of eba699ad. Those logs have not proven to be
helpful as they produce more output than can be reasonably stored.
This was noticed because of a CallOptionsTest flake that had a
surprising error:
```
expected : 59.983387319
but was : 59.983387319
outside tolerance in seconds: 0.01
```
The target UserHandle is best modeled as part of the SocketAddress not the Channel since it's part of the server's location.
This change allows a NameResolver to select different target users over time within a single Channel.
The goal of this PR is to increase the test coverage of the C2P E2E load test by improving the rpc_soak and channel_soak tests to support concurrency.
**rpc_soak:**
The client performs many large_unary RPCs in sequence over the same channel. The test can run in either a concurrent or non-concurrent mode, depending on the number of threads specified (soak_num_threads):
- Non-Concurrent Mode: When soak_num_threads = 1, all RPCs are performed sequentially on a single thread.
- Concurrent Mode: When soak_num_threads > 1, the client uses multiple threads to distribute the workload. Each thread performs a portion of the total soak_iterations, executing its own set of RPCs concurrently.
**channel_soak:**
Similar to rpc_soak, but this time each RPC is performed on a new channel. The channel is created just before each RPC and is destroyed just after. Note on Concurrent Execution and Channel Creation: In a concurrent execution setting (i.e., when soak_num_threads > 1), each thread performs a portion of the total soak_iterations and creates and destroys its own channel for each RPC iteration.
- createNewChannel Function: In channel_soak, the createNewChannel function is used by each thread to create a new channel before every RPC. This function ensures that each RPC has a separate channel, preventing race conditions by isolating channels between threads. It shuts down the previous channel (if any) and creates a new one for each iteration, ensuring accurate latency measurement per RPC.
- Thread-specific logs will include the thread_id, helping to track performance across threads, especially when each thread is managing its own channel lifecycle.
Generated code for v1alpha was ignored, but not v1. Ignoring v1 reduces
lines being checked from 16,145 to 6,303, significantly improving the
overall code coverage and removing noise. This was noticed because there
was a very clear drop at 0aa976c4 visible in the coveralls.io coverage
graph, the point when v1 was introduced.
When spiffe support was added it caused
tlsClientServer_useSystemRootCerts_validationContext to become flaky.
This is because test execution order was important for whether the race
would occur.
Fixes#11678
This is a step toward removing ResolvedAddresses from ChildLbState,
which isn't actually used by MultiChildLb. Most usages of the EAG usages
can be served more directly without peering into MultiChildLb's
internals or even accessing ChildLbStates, which make the tests less
sensitive to implementation changes. Some changes do leverage the new
behavior of MultiChildLb where it preserves the order of the entries.
This does fix an important bug in shutdown tests. The tests looped over
the ChildLbStates after shutdown, but shutdown deleted all the children
so it looped over an entry collection. Fixing that exposed that
deliverSubchannelState() didn't function after shutdown, as the listener
was removed from the map when the subchannel was shut down. Moving the
listener onto the TestSubchannel allowed having access to the listener
even after shutdown.
A few places in LeastRequestLb lines were just deleted, but that's
because an existing assertion already provided the same check but
without digging into MultiChildLb.
When forwarding from Listener onAddresses to Listener2 continue to use onResult and not onResult2 because the latter requires to be called from within synchronization context and it breaks existing code that didn't need to do so when using the old Listener interface.
feab4e54 removed xds v2 for the Gradle build. Testing with a deploy.jar,
I see the same 4 MB size reduction (31 -> 27 MB) here.
While an orca dependency is deleted in this commit, it is only a direct
dependency. It remains in the :orca target, so doesn't contribute a size
reduction.
PAUSED Looper mode has been the default for many years, maybe around
robolectric 4.5 (9ae9f0b6a6). Explicitly specifying PAUSED Looper mode
is not necessary.
cl/690684542
When java.time.Instant is available use the timestamp from this class in nano precision rather than using System.currentTimeInMillis and converting it to nanos.
Fixes#5494.
Allow using system root certs for server cert validation rather than CA root certs provided by the control plane when the validation context provided by the control plane specifies so.
This reverts commit 99f86835ed.
The change doesn't handle `null` messages, which don't happen with
protobuf, but can happen with other marshallers, especially in tests.
See cl/689445172
This will reopen#5969.
Callers are frequently confused by this message and waste time looking for problems in the client when the root cause is simply a server crash. See b/371447460 for more context.
It is the `Executor appExecutor` that should be given an asynchronous
task, not `CallCredentials.MetadataApplier applier`.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This had been used for a time with a combined inprocess+binder server.
However, just having multiple servers worked fine and this is no longer
used/needed.
If a panic is followed a panic, we'd ignore the second. But if an
exception happens while entering panic mode we may fail to update the
picker with the first error. This is "fine" from a correctness
standpoint; all bets are off when panicking and we've already logged the
first error. But failing RPCs can often be more easily seen than just
the log.
Noticed because of http://yaqs/8493785598685872128
* Add S2AStub cleanup handler.
* Give TLS and Cleanup handlers name + update comment.
* Don't add TLS handler twice.
* Don't remove explicitly, since done by fireProtocolNegotiationEvent.
* plumb S2AStub close to handshake end + add integration test.
* close stub when TLS negotiation fails.
When an ADS stream in closed with a non-OK status after receiving a response, new status will be updated to OK status. This makes the fail behavior consistent with gRFC A57.
* throw IllegalArgumentException in ProtoUtil.
* throw exception in TrustManager in more standard way.
* handle IllegalArgumentException in SslContextFactory.
* Don't throw error on unknown TLS version.
Combined success / error status passed via ResolutionResult to the NameResolver.Listener2 interface's onResult2 method - Addresses in the success case or address resolution error in the failure case now get set in ResolutionResult::addressesOrError by the internal name resolvers.
* Change PickFirstLeafLoadBalancer to only have 1 subchannel at a time if environment variable GRPC_SERIALIZE_RETRIES == true.
Cache serializingRetries value so that it doesn't have to look up the flag every time.
Clear the correct task when READY in processSubchannelState and move the logic to cancelScheduledTasks
Cleanup based on PR review
remove unneeded checks for shutdown.
* Fix previously broken tests
* Shutdown previous subchannel when run off end of index.
* Provide option to disable subchannel retries to let PFLeafLB take control of retries.
* InternalSubchannel internally goes to IDLE when sees TF when reconnect is disabled.
Remove an extra index.increment in LeafLB
When running on the JDK, it is quite normal for Conscrypt not to be
present. We'll end up using the JDK 9 ALPN API and everything will be
fine. On Android, it would be extremely rare for someone to completely
remove the default Android security providers, so the warning was almost
never going to trigger on that platform anyway.
A map of children is still needed, but is created temporarily on update.
The order of children is currently preserved, but we could use regular
HashMaps if that is not useful.
* Combine MtlsToS2ChannelCredentials and S2AChannelCredentials.
* Check if file exists.
* S2AChannelCredentials API requires credentials used for client-s2a channel.
* remove MtlsToS2A library in BUILD.
* Don't check state twice.
* Don't check for file existence in tests.
Instead of doing a dance of supplementing config so the later
createChildAddressesMap() won't delete children, just look at the
existing children and don't delete any that shouldn't be deleted.
* Use StandardCharsets in FakeS2AServerTest.
* Use add instead of offer in S2AStub.
* remove dead code in ProtoUtil.java.
* Mark convertTlsProtocolVersion as VisibleForTesting.
* S2AStub doesn't return responses at front of queue.
* Remove global SHARED_RESOURCE_CHANNELS.
* Don't suppress RethrowReflectiveOperationExceptionAsLinkageError.
* Update javadoc.
* Make clear which certs are used in tests + add how to regenerate.
1. Removing $ when looking for the commit 'Start of development cycle...' because it produces empty result with the $. It seems how the squash was done may influence whether $ will work or not.
2. Added an explicit git push instruction at step 5 of tagging and what base branch to use, since it will cause conflict with the default base branch used of master.
The main goal was to make sure subchannels went CONNECTING only after a
connection was requested (since the test doesn't transition to
CONNECTING from TF). That helps guarantee that the test is using the
expected subchannel.
The missing ClusterImplLB.requestConnection() doesn't actually matter
much, as cluster manager doesn't propagate connection requests.
* Added null check for xdsClient in onSubChannelState. This avoids NPE
for xdsClient when LB is shutdown and onSubChannelState is called later
as part of listener callback. As shutdown is racy and eventually consistent,
this check would avoid calculating locality after LB is shutdown.
* Mark S2A public APIs as experimental.
* Rename S2AChannelCredentials createBuilder API to newBuilder.
* Remove usage of AdvancedTls.
* Use InsecureChannelCredentials.create instead of Optional.
* Invoke Thread.currentThread().interrupt() in a InterruptedException block.
* S2AHandshakerServiceChannel doesn't use custom event loop.
* use executorPool.
* log when channel not shutdown.
* use a cached threadpool.
* update non-executor version.
Move unused and unimportant fields to local variables. pickUnusedPort()
is inherently racy, so avoid using it when unnecessary. The channel's
default executor is fine to use, but if you don't like it
directExecutor() would be an option too. But blocking stub doesn't even
use the executor for unary RPCs. Thread.join() does not propagate
exceptions from the Thread; it just waits for the thread to exit.
Add opentelemetry tracing API, guarded by environmental variable(disabled by default).
Use server interceptor to explicitly propagate span to the application thread.
unix.sh is shared by multiple OSes and environments. Clear JAVA_HOME,
since we never want to use that as PATH is more reliable, better
supported, and more typical.
* use an attribute from resolved addresses IS_PETIOLE_POLICY to control whether or not health checking is supported so that top level versions can't do any health checking, while those under petiole policies can.
Fixes#11413
Detachable lets a buffer outlive its original lifetime. The new lifetime
is application-controlled. If the application fails to read/close the
stream, then the leak detector wouldn't make clear what code was
responsible for the buffer's lifetime. With this touch, we'll be able to
see detach() was called and thus know the application needs debugging.
Realized when looking at b/364531464, although I think the issue is
unrelated.
This makes ClusterManagerLB more straight-forward, focusing on just the
things that are relevant to it, and it avoids specialized map key
handling in updateChildrenWithResolvedAddresses().
The child policy config should be refreshed every address update, so it
shouldn't be stored in the ChildLbState. In addition, none of the
current usages actually used what was stored in the ChildLbState in a
meaningful way (it was always null).
ResolvedAddresses was also removed from createChildLbState(), as nothing
in it should be needed for creation; it varies over time and the values
passed at creation are immutable.
While child LB policies are unlikey to change for each cluster name (RLS
returns regular cluster names, so should be unique), and the
configuration for CDS policies won't change, RLS configuration can
definitely change.
It doesn't do anything.
Call scheduleNextConnection() unconditionally since it is responsible
for checking if `enableHappyEyeballs == true`. It's also surprising to
check in the CONNECTING case but not the IDLE case.
It is trivial to avoid the exception from
addressIndex.getCurrentAddress(). The log message was inaccurate, as the
subchannel might have been TRANSIENT_FAILURE. The only important part of
the condition was whether the subchannel was the current subchannel.
It will never throw, because it would only throw if helper is null, but
helper is checkNotNull()ed in the constructor. It could have checked for
a null return value instead; since it hasn't been, it is clear we don't
need this check.
Bazel had the dependency added because of #5046, where Guava was
depending on it as compile-only and Bazel build have "unknown enum
constant" warnings. Guava now has a compile dependency on j2objc, so
this workaround is no longer needed. There are currently no version skew
issues in Gradle, which was the only usage.
Since 04474970 RingHashLB has not used
acceptResolvedAddressesInternal(). At the time that was needed because
deactivated children were part of MultiChildLB. But in 9de8e443, the
logic of RingHashLB and MultiChildLB.acceptResolvedAddressesInternal()
converged, so it can now swap back to using the base class for more
logic.
One LB no longer needs to extend ChildLbState and one has to start, so
it is a bit of a wash. There are more LBs that need the auto-request
logic, but if we have an API where subclasses override it without
calling super then we can't change the implementation in the future.
Adding behavior on top of a base class allows subclasses to call super,
which lets the base class change over time.
There was no point to using subchannels as keys to
subchannelToReportListenerMap, as the listener is per-child. That meant
the keys would be guaranteed to be known ahead-of-time and the
unsynchronized getOrCreateOrcaListener() during picking was unnecessary.
The picker still stores ChildLbStates to make sure that updating weights
uses the correct children, but the picker itself no longer references
ChildLbStates except in the constructor. That means weight calculation
is moved into the LB policy, as child.getWeight() is unsynchronized, and
the picker no longer needs a reference to helper.
A package-private class isn't visible and `@Internal` is stronger than
experimental. The only way users should use WRR is via the
weight_round_robin string, and that's already not suffixed with
_experimental.
Closes#9885
* otel tracing: add binary format, grpcTraceBinContextPropagator
* exception handling, use api base64 encoder omit padding
remove binary format abstract class in favor of binary marshaller
Some addresses are equal even though their toString is different
(InetSocketAddress ignores the hostname when it has an address). And
some addresses are not equal even though their toString might be the
same (AnonymousInProcessSocketAddress doesn't override toString()).
InetSocketAddress/InetAddress do not cache the toString() result. Thus,
even in the worst case that uses a HashSet, this should use less memory
than the earlier approach, as no strings are formatted. It probably also
significantly improves performance in the reasonably common case when an
Endpoint is created just for looking up a key, because the string
creation in the constructor isn't then amorized.
updateChildrenWithResolvedAddresses(), for example, creates n^2 Endpoint
objects for lookups.
It came up in #11073, and I saw it could use a little updating. Notably,
I'm linking to a guide to what Git commit messages should look like. I
also tried to make the language less heavy-handed and demanding.
This reverts commit 9ba2f9dec5.
It causes a channel panic due to unimplemented onResult2().
```
java.lang.UnsupportedOperationException: Not implemented.
at io.grpc.NameResolver$Listener2.onResult2(NameResolver.java:257)
at io.grpc.internal.DnsNameResolver$Resolve.lambda$run$0(DnsNameResolver.java:334)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:333)
```
b/356669977
They share very little code, and we really don't want RoundRobinLb to be
public and non-final. Originally, WRR was expected to share much more
code with RR, and even delegated to RR at times. The delegation was
removed in 111ff60e. After dca89b25, most of the sharing has been moved
out into general-purpose tools that can be used by any LB policy.
FixedResultPicker now has equals to makes it as a EmptyPicker
replacement. RoundRobinLb still uses EmptyPicker because fixing its
tests is a larger change. OutlierDetectionLbTest was changed because
FixedResultPicker is used by PickFirstLeafLb, and now RoundRobinLb can
squelch some of its updates for ready pickers.
`cncf/xds`: Sync protos to the latest imported version
cncf/xds@024c85f (commit 2024-07-23, cl/655545156).
Should be a noop, just a routine xDS proto update to make upcoming
RLQS-related imports simpler, see related #11401.
Note that CEL is only added as a bazel dependency as now it's required
to build cncf/xds. Actual third-party source import will be done in
the follow up PR, where RLQS dependencies are added to the import
scripts.
Otherwise, the server will continue sending updates and if we
re-subscribe to the last resource, the server won't re-send it. Also
completely remove the per-type state, as it could only add confusion.
`envoyproxy/envoy`: Sync protos to the latest imported version
ab911ac2ff
(commit 2024-07-06, cl/651956889).
Should be a noop, just a routine xDS proto update to make upcoming
RLQS-related imports simpler.
Introducing NameResolver listener method "Status Listener2::onResult2(ResolutionResult)" that returns Status of the acceptance of the name resolution by the load balancer, and the Name Resolver will call this method for both success and error cases.
From gRFC A58:
> When less than two subchannels have load info, all subchannels will
> get the same weight and the policy will behave the same as round_robin
We don't include protobuf in IO_GRPC_GRPC_JAVA_ARTIFACTS, so there might
not actually be an alias available for it to @com_google_protobuf. While
we could add it, it is easier to use the @com_google_protobuf references
directly.
This was preventing `bazel query 'deps(//...)' from succeeding, because
it couldn't find javalite.
Since Bazel 6 [1], Bazel has used com_google_protobuf for javalite. We
only used the other repo because Bazel expected it, which was because
Protobuf split out javalite to a separate branch for a while. Since
everything is now reunified, we can use a singular protobuf repo.
1. abdb1d6bfe
V1 version of the proto reflection service, as the v1.alpha service has been deprecated.
* Create V1 alpha service wrapping underlying V1 service, by modifying the ServerServiceDefinition.
* Create ProtoReflectionService for the v1alpha proto by producing a ServerServiceDefinition constructed from that of the v1 service but with the service and method names and proto descriptors modified.
Issue #6724.
Java 8 isn't installed, and was needed by the old Android SDK. With the
current SDK, it can work on Java 11 but it needs some dependencies
installed.
Python 2.7 isn't available any more, but instead of porting to Python 3,
it was just replaced with a curl command.
The GSON upgrade slightly changed an error string, so the test was
updated to be less of a change detector.
Some OpenTelemetry dependencies are alpha versions, so needed an
adjustment in build.gradle to accept the versions. Similarly, Undertow
includes Final in its version numbers which needs to be accepted.
CentOS 7 became end-of-life on July 1st and is no longer working. We now
dynamically link against libstdc++, as RHEL 8 doesn't support static
linking: https://access.redhat.com/articles/rhel8-abi-compatibility
We now use objdump in check-artifact for all linux architectures. This
avoids using a mix of objdump and ldd. ldd shows transitive
dependencies, which is less convenient.
This is to replace switchTo(), to allow composing GracefulSwitchLb with
other helpers like MultiChildLb. It also prevents users of
GracefulSwitchLb from needing to use ServiceConfigUtil.
opencensus-proto is old generated code, which is not compatible with
protobuf-java 4.27.2 and may not be fixed since the project is dead.
Since it is unused, I think this doesn't cause any trouble for
downstream users trying to use protobuf-java 4.x. Related to #11015.
Add gRPC OpenTelemetry example. The example uses Prometheus exporter to export metrics and can be verified locally.
It also provides an example using LoggingMetricExporter to export and log the metrics using java.util.logging.
* Eliminate NPE after recovering from a temporary name resolution failure.
* Add test case for 2 failing subchannels to make sure it causes channel to go into TF.
Allocating this executor before BinderServer even exists is convoluted and actually leaks if the built server is never actually start()ed. Instead, have BinderServer own this executor directly, with a lifetime from start() until termination. Pass it to the ServerAuthInterceptor via TransportAuthorizationState Attribute instead of at construction time.
Using --runs_per_test=1000, this changes the flake rate of TlsTest from
2% to 0%.
While I believe it is possible to write a reliable test for this
(including noticing the SSLSocket behavior), it was becoming too
invasive so I gave up.
Fixes#11012
These are overrides of BinderTransport itself, so not used elsewhere.
They are essentially private. It was scary seeing `@GuardedBy` for a
public method. I copied the annotation to the base class to make sure
ErrorProne could verify the calls.
The test was added in e4e7f3a06 when InProcess stopped returning a
Runnable from start(). In c5a63a1 we realized (indirectly) that there's
no point in using the Runnable any more.
This test failed with Binder (which seems to have been using the
Runnable unnecessarily), and InProcess, Netty, and OkHttp don't use the
Runnable. Instead of fixing it, we'll just move toward stopping using
Runnable.
I'm not removing the Runnable usage from Binder in this commit because
this test is currently causing CI failures and I don't want to do a
behavior change when fixing it.
There are no longer any devices (virtual or otherwise) that support API
level 21, 22, or 23. Google Play services is still supporting API level
21 (although there is a pattern of notifying of dropped levels in July,
and dropping them in August).
This hasn't been needed since f8f569e07, when InternalSubchannel stopped
calling start() with a lock held. Note that also means no transport
needs to return a Runnable (but some still are).
I had noticed in e4e7f3a06 that it was safe for InProcess to call the
listener directly within start(), but I didn't notice this Javadoc that
said it wasn't allowed.
CachingRlsLbClient already calls it with a lock held. The only reason
the cache needs to manage the lock itself is for the periodic cleanup.
Let the consumer of the cache handle the timer.
Returning the runnable did nothing, as both the start method and the
runnable are run within the synchronization context. I believe the
Runnable used to be required in the previous implementation of
ManagedChannelImpl (the lock-based implementation before we created
SynchronizationContext).
This fixes a NPE seen in ServerImpl because the server expects proper
ordering of transport lifecycle events.
```
Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException: Cannot invoke "java.util.concurrent.Future.cancel(boolean)" because "this.handshakeTimeoutFuture" is null
at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.transportReady(ServerImpl.java:440)
at io.grpc.inprocess.InProcessTransport$4.run(InProcessTransport.java:215)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
```
b/338445186
fea577c80 disabled an optimization that some tests notice, as it can
change execution order. This restores the old behavior, at slight
expense to seeing relationship between in-use tracking and idle mode.
This will be used by the metadata exchange of CSM. When recording
per-attempt metrics, we really need per-attempt data and can't leverage
ClientInterceptors.
8844cf7b8 triggered a regression where a new RPC wouldn't cause the
channel to exit idle mode, if an RPC was still progressing on an old
transport. This was already possible previously, but was racy.
8844cf7b8 made it less racy and more obvious.
The two added `exitIdleMode()` calls in this commit are companions to
those in `enterIdleMode()`, which detect whether the channel should
immediately exit idle mode.
Noticed in cl/635819804.
Verifies that latest versions of Tomcat/Undertow/Jetty pass
integration tests - I manually verified that all ignored tests still
fail.
Two tests failed in Jetty, it appears that the integration test
anticipates that the server implementation is willing to send larger
trailers than the client SETTINGS frame allows for. Since the server
refuses to send too large of headers/trailers, the client does not
receive the too-large payloads, and doesn't fail with the expected
message. This change was introduced in Jetty 10.0.15/11.0.11. Those
tests are ignored.
* Fix 3d party dependency use_repo
* remove protobuf as it is already added as module dep
* fix
* fix
* fix
* return com_google_protobuf_javalite archive and use it in MODULE.bazel
DelayedClientTransport already had to handle all the cases, so
ManagedChannelImpl picking was acting only as an optimization.
Optimizing DelayedClientTransport to avoid the lock when not queuing
makes ManagedChannelImpl picking entirely redundant, and allows us to
remove the duplicate race-handling logic.
This avoids double-picking when queuing, where ManagedChannelImpl does a
pick, decides to queue, and then DelayedClientTransport re-performs the
pick because it doesn't know which pick version was used. This was
noticed with RLS, which mutates state within the picker.
Previously, picker was likely null if entering backoff soon after
start-up. This prevented the picker from being updated and directing
queued RPCs to the fallback. It would work for new RPCs if RLS returned
extremely rapidly; both ManagedChannelImpl and DelayedClientTransport do
a pick before enqueuing so the ManagedChannelImpl pick could request
from RLS and DelayedClientTransport could use the response. So the test
uses a delay to purposefully avoid that unlikely-in-real-life case.
Creating a resolving OOB channel for InProcess doesn't actually change
the destination from the parent, because InProcess uses directaddress.
Thus the fakeRlsServiceImpl is now being added to the fake backend
server, because the same server is used for RLS within the test.
b/333185213
Some APIs were marked experimental but had internal APIs in their
surface. These were all changed to internal. And then the internal APIs
were mostly hidden from generated documentation.
All these APIs will eventually become public and maybe even stable. But
they need some iteration before we're ready for others to start using
them.
* Change HappyEyeballs flag default value to false since some G3 users are seeing problems.
Put the flag logic in a common place for PickFirstLeafLoadBalancer & WRR's test.
* Set expected requestConnection count based on whether happy eyeballs is enabled or not
* Disable new PickFirstLB
* Fix test expectations to handle both new and old PF LB paths.
OpenTelemetryModule is renamed to GrpcOpenTelemetry. The Builder is now
`final`, although that should only impact mocks as it had a private
constructor.
Fixes#10591
The optional label API was added in 4c78a974 and xds_cluster_impl was
plumbed in 077dcbf9.
From gRFC A78:
> ### Optional xDS Locality Label
>
> When xDS is used, it is desirable for some metrics to include an optional
> label indicating which xDS locality the metrics are associated with.
> We want to provide this optional label for the metrics in both the
> existing per-call metrics defined in [A66] and in the new metrics for
> the WRR LB policy, described below.
>
> If locality information is available, the value of this label will be of
> the form `{region="${REGION}", zone="${ZONE}", sub_zone="${SUB_ZONE}"}`,
> where `${REGION}`, `${ZONE}`, and `${SUB_ZONE}` are replaced with the
> actual values. If no locality information is available, the label will
> be set to the empty string.
>
> #### Per-Call Metrics
>
> To support the locality label in the per-call metrics, we will provide
> a mechanism for LB picker to add optional labels to the call attempt
> tracer. We will then use this mechanism in the `xds_cluster_impl`
> policy's picker to set the locality label. ...
>
> This label will be available on the following per-call metrics:
> - `grpc.client.attempt.duration`
> - `grpc.client.attempt.sent_total_compressed_message_size`
> - `grpc.client.attempt.rcvd_total_compressed_message_size`
This is needed by gRFC A78 for xds metrics, and for RLS metrics. Since
gauges need to acquire a lock (or other synchronization) in the
callback, the callback allows batching multiple gauges together to avoid
acquiring-and-requiring such locks.
Unlike other metrics, gauges are reported on-demand to the MetricSink.
This means not all sinks will receive the same data, as the sinks will
ask for the gauges at different times.
This will be used for gRFC A66's OTel per-RPC metric label:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`).
The majority of the changes are to move target computation from
ManagedChannelImpl into the builder. A small hack API was added to
ManagedChannelBuilder to get the target to create an interceptor.
This should preserve all the existing behavior of GlobalInterceptors as
used by grpc-gcp-observability, including it disabling the implicit
OpenCensus integration.
Both the old and new API are internal. I hid Configurator and
ConfiguratorRegistry behind Internal-prefixed classes, like had been
done with GlobalInterceptors to further discourage use until the API is
ready.
GlobalInterceptorsTest was modified to become ConfiguratorRegistryTest.
As part of gRFC A78:
> To support the locality label in the WRR metrics, we will extend the
> `weighted_target` LB policy (see A28) to define a resolver attribute
> that indicates the name of its child. This attribute will be passed
> down to each of its children with the appropriate value, so that any
> LB policy that sits underneath the `weighted_target` policy will be
> able to use it.
xds_cluster_impl is involved because it uses the child names in the
AddressFilter, which must match the names used by weighted_target.
Instead of using Locality.toString() in multiple policies and assuming
the policies agree, we now have xds_cluster_impl decide the locality's
name and pass it down explicitly. This allows us to change the name
format to match gRFC A78:
> If locality information is available, the value of this label will be
> of the form `{region="${REGION}", zone="${ZONE}",
> sub_zone="${SUB_ZONE}"}`, where `${REGION}`, `${ZONE}`, and
> `${SUB_ZONE}` are replaced with the actual values. If no locality
> information is available, the label will be set to the empty string.
This adds the following components that are required for gRPC A79
non-per-call metrics architecture.
- MetricSink implementation for gRPC OpenTelemetry
- Configurator for plumbing per call metrics ClientInterceptor and
ServerStreamTracer.Factory via unified OpenTelemetryModule.
Integrates the new features of the the Kokoro PSM Interop install library introduced in grpc/psm-interop#73.
Nearly all common functionality was moved from per-language/per-branch PSM Interop build scripts to [psm_interop_kokoro_lib.sh](https://github.com/grpc/psm-interop/blob/main/.kokoro/psm_interop_kokoro_lib.sh):
1. The list of tests in the each test suite
2. Per-test-suite flag customization
3. `run_test` methods
4. `build_docker_images_if_needed` methods
5. Generic `build_test_app_docker_images` methods (simple docker build + docker push + docker tag). grpc-java is one exception, as it doesn't run docker directly, but a cloudbuild flow.
Now all PSM Interop jobs share the same buildscripts by all test suites:
1. buildscript that invokes the test: `psm-interop-test-{language}.sh` (configured as `build_file` in the build cfg)
2. buildscript that builds the xDS test client/server and publishes them as a Docker image: `psm-interop-build-{language}.sh` (conventional name called from `psm_interop_kokoro_lib.sh`)
`psm-interop-test-{language}.sh`:
1. Sets `GRPC_LANGUAGE`, `BUILD_SCRIPT_DIR` environment variables.
2. Downloads the shared `psm_interop_kokoro_lib.sh` from the main branch of the psm-interop repo.
3. Sources `psm-interop-build-{language}.sh`
4. Calls `psm::run "${PSM_TEST_SUITE}"` (`PSM_TEST_SUITE` configured in the cfg file).
`psm-interop-build-{language}.sh`:
1. Defines `psm::lang::build_docker_images` which is called from `psm_interop_kokoro_lib.sh`.
2. Invokes any repo-specific logic.
3. May use `psm::build::docker_images_generic` for generic Docker build, tag, push, or provide implement its own build/publish method.
References:
- b/288578634
- See the full list of the new features at grpc/psm-interop#73.
- Additional fixes to the shared lib: grpc/psm-interop#78, grpc/psm-interop#79
gRFC A78 has WRR and pick-first include a `grpc.target` label, defined
in A66:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`). For channels such as inprocess channels
> where a target URI is not available, implementations can synthesize a
> target URI.
Since 06df25b65d, WRR has been calling this method, and it will get an
exception. We don't want WRR to be broken until we have MetricRecorder
fully plumbed.
As part of gRFC A78:
> To support the locality label in the per-call metrics, we will provide
> a mechanism for LB picker to add optional labels to the call attempt
> tracer.
* added MetricRecorderImpl and unit tests for MetricInstrumentRegistry
* updated MetricInstrumentRegistry to use array instead of ArrayList
* renamed record<>Counter APIs to add<>Counter. Added check for mismatched label values
* added lock for instruments array
`getMinEvictionTime()` was fixed to make sure only deltas were used for
comparisons (`a < b` is broken; `a - b < 0` is okay). It had also
returned `0` by default, which was meaningless as there is no epoch for
`System.nanoTime()`. LinkedHashLruCache now passes the current time into
a few more functions since the implementations need it and it was
sometimes already available. This made it easier to make some classes
static.
Instead of having docs in RefCountedChildPolicyWrapperFactory saying
that every method was guarded by a lock, I added `@GuardedBy("lock")`
within CachingRlsLbClient, so now it is clearly not thread-safe and the
lock protects access. The AtomicLong was replaced with a long since
1) there was no multi-threading and 2) the logic was not atomic-safe
which was misleading.
In OpenCensus recording an attempt was delayed in order to wait for
inboundUncompressedSize(). But we don't need that in OpenTelemetry, and
could have removed this code when copying from OpenCensus.
Today, deframer errors cancel the stream without communicating a status code
to the peer. This change causes deframer errors to trigger a best-effort
attempt to send trailers with a status code so that the peer understands
why the stream is being closed.
Fixes#3996
`sendGrpcFrame` owns the buffer in `SendGrpcFrameCommand`. If the frame is not handed off to netty, it needs to be released in the method.
https://github.com/grpc/grpc-java/issues/11115
It is easy to manage these things outside of MultiChildLb and it makes
the shared code easier and use less memory. In particular, we don't want
to use many instances of GracefulSwitchLb in virtually every policy
simply because it was needed in one or two cases.
Adds interfaces required for recording metrics from gRPC components. And added API to get `MetricRecorder` in `LoadBalancer.Helper` and add `MetricSink` to `ManagedChannelBuilder`.
Handles Netty write frame failures caused by issues in the Netty
itself.
Normally we don't need to do anything on frame write failures because
the cause of a failed future would be an IO error that resulted in
the stream closure. Prior to this PR we treated these issues as a
noop, except the initial headers write on the client side.
However, a case like netty/netty#13805 (a bug in generating next
stream id) resulted in an unclosed stream on our side. This PR adds
write frame future failure handlers that ensures the stream is
cancelled, and the cause is propagated via Status.
Fixes#10849
The text between the GRPC_DEPS_{START,END} must be identical in
formatting. Probably not a problem in general and not necessarily bad.
But it is simplistic.
Eric waking up this morning:
> We need more sed.
This started with combining handleNewRequest with asyncRlsCall, but that
emphasized pre-existing synchronization issues and trying to fix those
exposed others. It was hard to split this into smaller commits because
they were interconnected.
handleNewRequest was combined with asyncRlsCall to use a single code
flow for handling the completed future while also failing the pick
immediately for thottled requests. That flow was then reused for
refreshing after backoff and data stale. It no longer optimizes the RPC
completing immediately because that would not happen in real life; it
only happens in tests because of inprocess+directExecutor() and we don't
want to test a different code flow in tests. This did require updating
some of the tests.
One small behavior change to share the combined asyncRlsCall with
backoff is we now always invalidate an entry after the backoff.
Previously the code could replace the entry with its new value in one
operation if the asyncRlsCall future completed immediately. That only
mattered to a single test which now sees an EXPLICIT eviction.
SynchronizationContext used to provide atomic scheduling in
BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was
only accessed from the sync context. The same was true for calling up
the LB tree with `updateBalancingState()`. In particular, adding entries
to the cache during a pick could evict entries without running the
cleanup methods within the context, as well as the RLS channel
transitioning from TRANSIENT_FAILURE to READY. This was replaced with
using a bare Future with a lock to provide atomicity.
BackoffCacheEntry no longer uses the current time and instead waits for
the backoff timer to actually run before considering itself expired.
Previously, it could race with periodic cleanup and get evicted before
the timer ran, which would cancel the timer and forget the
backoffPolicy. Since the backoff timer invalidates the entry, it is
likely useless to claim it ever expires, but that level of behavior was
preserved since I didn't look into the LRU cache deeply.
propagateRlsError() was moved out of asyncRlsCall because it was not
guaranteed to run after the cache was updated. If something was already
running on the sync context, then RPCs would hang until another update
caused updateBalancingState().
Some methods were moved out of the CacheEntry classes to avoid
shared-state mutation in constructors. But if we add something in a
factory method, we want to remove it in a sibling method to the factory
method, so additional code is moved for symmetry. Moving shared-state
mutation ouf of constructors is important because 1) it is surprising
and 2) ErrorProne doesn't validate locking within constructors. In
general, having shared-state methods in CacheEntries also has the
problem that ErrorProne can't validate CachingRlsLbClient calls to
CacheEntry. ErrorProne can't know that "lock" is already held because
CacheEntry could have been created from a _different instance_ of
CachingRlsLbClient and there's no way for us to let ErrorProne prove it
is the same instance of "lock".
DataCacheEntry still mutates global state that requires a lock in its
constructor, but it is less severe of a problem and it requires more
choices to address.
According to the docs, I can use bazel to build examples, but
retry-example is not supported in bazel config. So If you'll try to
build this example with bazel, you'll get an error:
```
examples git:(master) ✗ bazel build :retrying-hello-world-server :retrying-hello-world-client
ERROR: Skipping ':retrying-hello-world-client': no such target '//:retrying-hello-world-client': target 'retrying-hello-world-client' not declared in package '' defined by /Users/rostik404/projects/grpc-java/examples/BUILD.bazel
ERROR: no such target '//:retrying-hello-world-client': target 'retrying-hello-world-client' not declared in package '' defined by /Users/rostik404/projects/grpc-java/examples/BUILD.bazel
INFO: Elapsed time: 0.331s
INFO: 0 processes.
ERROR: Build did NOT complete successfully
```
* Add option in OkHttpServerBuilder
* Add value as MAX_CONCURRENT_STREAM setting in settings frame sent by the server to the client per connection
* Enforce limit by sending a RST frame with REFUSED_STREAM error
The recommended way to load dependencies from `rules_jvm_external`
is to make use of the `@maven` workspace, and the most readable
way of doing that is to use the `artifact` macro provides.
This removes the need to generate the "compat" namespaces, which
`rules_jvm_external` provided for backwards compatibility with
older releases. This change also sets things up for supporting
`bzlmod`: this requires all workspaces accessed by a library to
be named "up front" in the `MODULE.bazel` file. This way, the
only repo that needs to be exported is `@maven`, rather than the
current huge list.
The name resolver takes some time before it returns addresses. While waiting the channel will be IDLE instead of the proper CONNECTING. This generally doesn't matter since RPCs behave similarly for IDLE and CONNECTING, but is confusing for users when watching channel.getState() closely.
Fixes#10517.
Including a Status description makes it easier to debug subchannel
closure issues if it's clear that a subchannel became unavailable because
of an outlier detection ejection.
We provided extra details when the RPC is killed by CallOptions'
Deadline, but didn't do the same for Context.
To avoid duplicating code, things were restructured, including the
threading. There are more code flows now, but I think the
multi-threading came out more obvious and less error-prone. I didn't
change the status when the deadline is already expired, because the
text is shared with DelayedClientCall and AbstractInteropTest doesn't
distinguish between the two cases.
This is a roll-forward that avoids a NPE when cancel() is called
without an earlier call to start().
As seen at b/300991330
* Allow the queued byte threshold for a Stream to be ready to be configurable
- on clients this is exposed by setting a CallOption
- on servers this is configured by calling a method on ServerCall or ServerStreamListener
* Have EDS resource parse the additional addresses from envoy message
* Update respositories.bzl to point to current grpc-proto instead of a 2021 version.
* Update respositories.bzl to point to recent cncf/xds and envoyproxy/data-plane-api
* Add cncf_upda to repositories.bzl
We provided extra details when the RPC is killed by CallOptions'
Deadline, but didn't do the same for Context.
To avoid duplicating code, things were restructured, including the
threading. There are more code flows now, but I think the
multi-threading came out more obvious and less error-prone. I didn't
change the status when the deadline is already expired, because the
text is shared with DelayedClientCall and AbstractInteropTest doesn't
distinguish between the two cases.
As seen at b/300991330
To make it stable, this PR hides protobuf from being exposed via the
API.
Note: this breaks ABI of `CsdsService.streamClientStatus` and
`CsdsService.fetchClientStatus`, but these methods should not
normally be called by the user.
Closes#8016.
The gradle wrapper was removed from example-oauth because we don't want
to maintain the wrapper copy in each example (at least right now it
doesn't make sense for it to be the only other example to have the
gradle wrapper).
The local race passes `rlsPicker` to the channel before
CachingRlsLbClient is finished constructing. `RlsPicker` can use
multiple of the fields not yet initialized. This seems not to be
happening in practice, because it appears like it would break things
very loudly (e.g., NPE).
The remote race seems incredibly hard to hit, because it requires an RPC
to complete before the pending data tracking the RPC is added to a map.
But with if a system is at 100% CPU utilization, maybe it can be hit. If
it is hit, all RPCs needing the impacted cache entry will forever be
buffered.
This removes a grpc-ism environment variable. Note that the logger is
still registered under XdsClientImpl. That could maybe change, but it is
a bit unclear what it should become and it seemed better for this to
have no behavior changes.
This method is only needed sometimes, and with time will be needed less
and less. Don't require new types to implement it, instead relying on
control planes to use the new approach.
`project(':grpc-netty').configurations` requires the grpc-netty project
to be configured, which requires evaluationDependsOn. Without
evaluationDependsOn, project loading order is arbitrary and you can get
random errors after small configuration changes.
Or rather, server response is ambiguous and this usage is not generally
what we mean when we say it. The example shows how to get an error for
any failed RPC, not just those coming from a failing server.
The existing comment caused confusion at
https://stackoverflow.com/a/78104828
The xDS library only honored names retrieved from the inner resource
containers, but for wrapped resources the outer layer could contain the
required name. This commit prefers the name on the wrapped container
over the inner resource name.
SelfSignedCertificate is not available on Java 17 because
OpenJdkSelfSignedCertGenerator is not available. This only impacted
tests.
AccessController is being removed, and these locations are doing simple
reflection which is unlikely to require it even when a security policy
is in effect. There's other places we do reflection without the
AccessController, so either no security policies care or the users can
update their policies to allow it.
xDS v2 support was dropped about a year ago, but the xds package still
had a few xDS v2 usages. This PR:
- Removes all leftover usages of xDS v2 classes in gprc-xds
- Removes all imported xDS v2 protos and their leaf dependencies:
- Removes xDS v2 generated services
- Makes minor improvements to the xds import script output
### Before
```sh
# Imported 154 protos.
❯ find . -iname "*xds*.jar" -exec du -h {} \; | col -x
13M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-original.jar
6.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-sources.jar
388K ./build/libs/grpc-xds-1.63.0-SNAPSHOT-javadoc.jar
14M ./build/libs/grpc-xds-1.63.0-SNAPSHOT.jar
```
### After
```sh
# Imported 86 protos.
❯ find . -iname "*xds*.jar" -exec du -h {} \; | col -x
9.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-original.jar
4.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT-sources.jar
388K ./build/libs/grpc-xds-1.63.0-SNAPSHOT-javadoc.jar
9.1M ./build/libs/grpc-xds-1.63.0-SNAPSHOT.jar ```
Reduction:
- Number of protos: 44%
- Jar size: 35%
In addition to removing a test that only applies to KitKat, switch tests
that require 19 to not specifying the SDK version as we only support min
sdk version 21, which has the required API.
Also removes the SDK version check from isProfileOwner, to trigger a
runtime exception when too low of an SDK version is used.
This allows the checkForUpdates task to notice the dependencies and
suggest updates.
I plan to upgrade some of the servers after this change in hopes it
reduces test flakiness.
We can just compare the Deadline instances instead of asserting that
very little time has passed during the test. Real time probably still
matters in the test, but only insofar that the deadline is not expired
by the time ClientCallImpl sees it.
This fixes a test failure seen in the emulated aarch64 CI. Note that the
message says "ns" when it should say "ms", but this change deletes the
code with that typo.
```
java.lang.AssertionError: timeout: 548 ns
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at io.grpc.internal.ClientCallImplTest.assertTimeoutBetween(ClientCallImplTest.java:1102)
at io.grpc.internal.ClientCallImplTest.contextDeadlineShouldBePropagatedToStream(ClientCallImplTest.java:828)
```
`isolatedResourceDeletions()` has failed with a timeout waiting on
onChanged when running under TSAN. TSAN can slow things down, so let's
increase the timeout to ensure it isn't just timeout flake.
`-link` does I/O to download the package list, for every javadoc
invocation. There is no caching, so this happens many times per build.
Swap to offline mode to avoid spamming the servers, and avoid build
failures if the servers aren't entirely healthy.
We had 'test.dependsOn', but it is only run if the golden tasks are
configured, which they generally won't be because nothing depends on
them. This prevented the test from actually running. This bug was
introduced in 0ff9f37b because previously the golden tasks were eagerly
constructed.
Remove the "extraPackage" argument because it is a constant and it
confused me for a bit wondering when it was necessary.
It wasn't actually being used. Since Java 8u252 in early 2020 we've been
using ALPN from the JDK. The Jetty ALPN Agent has been a noop.
We do keep the Jetty ALPN support in the code and tests, but we don't
have the infrastructure to actually run it.
When we implement A71, we're no longer going to have a single xds
client, but instead one per channel target. Add that parameter now, even
though it is unused, to avoid managing the (internal) API breakage when
we implement fallback.
The DNS lookups are taking considerable time on the Windows CI (~11s),
which causes the test to time out:
```
Wanted but not invoked:
ldsResourceWatcher.onError(<any>);
-> at io.grpc.xds.XdsClientImplTestBase.sendToNonexistentHost(XdsClientImplTestBase.java:3733)
Actually, there were zero interactions with this mock.
at io.grpc.xds.XdsClientImplTestBase.sendToNonexistentHost(XdsClientImplTestBase.java:3733)
```
The ARM build, which uses an emulator, has had this test succeed, so the
failure seems unrelated to CPU usage. We want to avoid external I/O
anyway during tests, so removing the DNS lookup is good.
The TSAN comment referenced XdsClientImplTestBase.sendToNonexistentHost,
but the test no longer calls fakeClock.forwardTime so the comment was
out-of-date. Change the comment to make clear the race involved.
The Java 8 runtime is end of support. Leaving this a gae-jdk8 for now.
The gae-jdk8 was because AppEngine changed dramatically from Java 7 to
Java 8. Nowadays the versions are more in line with OpenJDK and not very
different from each other.
Fixes#10925
I'm trying to upgrade to a newer Windows Kokoro image, but the new one
has an old vswhere installed that breaks Gradle. Our old image doesn't
have vswhere at all. If vswhere isn't found, this rename prints some
errors, but the bat script continues executing. So this change is
compatible with both the older and newer image.
In 0d39bf50 the ReadyPicker was changed holding List<Subchannel> to
List<ChildLbState>, but ChildLbState mutates over time and is not
synchronized. We want the picker to have a snapshot of the data, so copy
the data from ChildLbState instead of using it directly.
Unfortunately the tests depended on the ChildLbState a bit, so we need
to save the EAG only to use it in tests. That's okay for now, but in the
future we'll probably want to remove that unnecessary memory usage.
Add the 'fake' dependency to grpc-netty instead of grpc-core.
grpc-okhttp already depends on grpc-util and probably would be fine
without round_robin on Android.
There's not actually a circular dependency, but some tools can't handle
the compile vs runtime distinction. Such tools are broken, but fixes
have been slow and this approach works with no real downfalls.
Works around #10576#10701
* netty: improve server handling of writes to reset streams
A server stream can be reset by the client while server writes are still queued. After the stream is reset, the netty connection will forget the stream object. The `NettyServerHandler` must deal with that situation. `sendResponseHandlers` already had some code to do that. This change standardizes that code and adds it to `sendGrpcFrame`. This fixes a potential bug where a `SendGrpcFrameCommand` with `endOfStream=true` would raise an `AssertionError` if written to a reset stream. (This bug is not currently reachable because `endOfStream=false` for all server `SendGrpcFrameCommand` objects.)
* Do not call into the encoder when we know the stream is gone.
* add final, change method permissions, add javadoc, cleanup unneeded, move updateOverallBalancingState to ClusterManagerLB and make it abstract
* Restructure to eliminate the flags as protected methods
* Move methods around so that the candidates for override are near the top.
* Reorder picker methods lower
Http2OkHttp is now unnecessary, as Http2Test tests OkHttp client to
Netty server. receivedDataForFinishedStream() was the only remaining
unique test and it seems already covered by AbstractInteropTest these
days.
We prefer to test using the stable APIs, as they are what our users
should be using. Http2Client continues using NettyChannelBuilder because
it is intended to test grpc-netty.
Retryable was added in google-auth-library 1.5.3 to make clear the
situations that deserve a retry of the RPC. Upgrading to that caused
problems because of transitive dependency issues syncing into Google so
it was reverted in 369f87be. google-auth-library 1.11.0 changed the
approach to avoid the transitive dependency updates. cl/601545581
upgraded to 1.22.0 inside Google. Bump to that version and swap away
from the imprecise IOException heuristic. go/auth-correct-retry
Fixes#6808
The RetryTest was flaky, and it seems to have been caused by the client
and server getting assigned to the same event loop. Separating the two
reduces the flake rate from ~3% to less than 0.1% (no flakes in a 1000).
While I was here fixing the executors, I reduced the number of threads
created and shut down the threads after they are no longer used. This
had no impact to the flake rate (no flakes in 1000).
`envoyproxy/envoy`: Sync protos to the latest imported version
147e6b9523
(commit 2024-01-24, cl/604403196).
Should be a noop, just a routine xDS proto update to make upcoming
RLQS-related imports simpler.
Minor refactor to the tlsContextManager to not expose itself on the xdsClientImpl constructor.
This is to allow people who plugins xdsTransportFactory to use the API easily.
Currently few of the interfaces needed to define and start a watch for a xDS resource are package private, which can't be used externally outside of io.grpc.xds. Exposing them outside allows users to define their own custom resources and start a watch along with the default supported resources.
Also as part of this change, move an Exception defined in the XdsClientImpl into XdsResourceType. As XdsClientImpl is an implementation package, it makes more sense to expose it via the XdsResourceType class.
I initially omitted the visibility modifier because this class began as an interface. Since it moved to an abstract class, we must make it public so it can be overriden by subclasses in the integrator's packages.
Part of #10566.
Splits the :grpc-android-interop-testing:assembleDebug and
:grpc-android-interop-testing:assembleDebugAndroidTest build
targets with hopes of avoiding OOMs.
- Multiple test cases assumed all messages would arrive on a single MessageProducer but this isn't guaranteed by the API contract.
- testBadTransactionStreamThroughput_b163053382 was writing `serverCallsCompleted` on one thread and reading it on another without synchronization. A deeper problem was that waiting for the call to complete on the server doesn't mean messages are immediately available on the client.
- Replaced 100ms polling loops with wait()/notifyAll()
- Close() InputStreams that we read as required by the MessageProducer#next contract.
Passes with --runs_per_test=1000
To support runfiles, the rule has to track more than just the
executable. `files_to_run` has both the runfile and executable
information (as separate fields), as does `files`, (combined as depset).
So using those when able is inherently "safe." `files_to_run.executable`
is only the executable, so does not propagate dependency information,
so we make sure to pass `files` to the rule in addition.
(`files_to_run.executable` is formatted into a string, so it wouldn't
carry depset information anyway.)
As originally noticed in cl/597962426
* services: Remove deprecated `io.grpc.services.BinaryLogs`
* services: Remove //services:binarylog bazel target
Was deprecated more than 2 years ago in bab1fe3.
`io.grpc.protobuf.services.BinaryLogs` should be used instead.
Experimental tracking ticket:
https://github.com/grpc/grpc-java/issues/4017
6efa9ee3 added `volatile` to `attributes` after TSAN detected a data
race that was added in 91d15ce4. The race was because attributes may be
read from another thread after `transportReady()`, and the
post-filtering assignment occurred after `transportReady()`. The code
now filters the attributes separately so they are updated before calling
`transportReady()`.
Original TSAN failure:
```
Read of size 4 at 0x0000cd44769c by thread T23:
#0 io.grpc.netty.NettyClientHandler.getAttributes()Lio/grpc/Attributes; NettyClientHandler.java:327
#1 io.grpc.netty.NettyClientTransport.getAttributes()Lio/grpc/Attributes; NettyClientTransport.java:363
#2 io.grpc.netty.NettyClientTransport.newStream(Lio/grpc/MethodDescriptor;Lio/grpc/Metadata;Lio/grpc/CallOptions;[Lio/grpc/ClientStreamTracer;)Lio/grpc/internal/ClientStream; NettyClientTransport.java:183
#3 io.grpc.internal.MetadataApplierImpl.apply(Lio/grpc/Metadata;)V MetadataApplierImpl.java:74
#4 io.grpc.auth.GoogleAuthLibraryCallCredentials$1.onSuccess(Ljava/util/Map;)V GoogleAuthLibraryCallCredentials.java:141
#5 com.google.auth.oauth2.OAuth2Credentials$FutureCallbackToMetadataCallbackAdapter.onSuccess(Lcom/google/auth/oauth2/OAuth2Credentials$OAuthValue;)V OAuth2Credentials.java:534
#6 com.google.auth.oauth2.OAuth2Credentials$FutureCallbackToMetadataCallbackAdapter.onSuccess(Ljava/lang/Object;)V OAuth2Credentials.java:525
...
Previous write of size 4 at 0x0000cd44769c by thread T24:
#0 io.grpc.netty.NettyClientHandler$FrameListener.onSettingsRead(Lio/netty/channel/ChannelHandlerContext;Lio/netty/handler/codec/http2/Http2Settings;)V NettyClientHandler.java:920
#1 io.netty.handler.codec.http2.DefaultHttp2ConnectionDecoder$FrameReadListener.onSettingsRead(Lio/netty/channel/ChannelHandlerContext;Lio/netty/handler/codec/http2/Http2Settings;)V DefaultHttp2ConnectionDecoder.java:515
...
```
io.grpc.util.CertificateUtils does much of the same thing as xds's
CertificateUtils, but also supports EC keys. The xds code pre-dates the
grpc-util class, so it isn't surprising it wasn't using it.
There's a good number of usages of the xds CertificateUtils, so I just
got rid of the duplicate implementation, but didn't yet bother changing
callers io.grpc.util.
* Provide a default implementation for new method added to ManagedTransport.Listener to support ClientTransportFilters
* Relax test constraint to reduce flakiness due to timing.
* Add test for listener.filterTransport.
Currently, if caching is enabled (as is often the case) and AsyncSecurityPolicy returns a failed future, then this future is cached forever, without giving the SecurityPolicy implementation a chance to be retried. Going forward, new invocations will trigger new security checks if the last one could not be completed successfuly.
This fixes two bugs in outbound message size checking:
* When thet checke failed, the thrown StatusRuntimeException with a status code
of RESOURCE_EXHAUSTED was been caught and rewrapped in another
StatusRuntimeException but this time with status code UNKNOWN.
* This applies the max outbound message size check to messages prior to, and
after compression, since compression of a message that is smaller than the
maximum send size can result in a larger message that exceeds the maximum
send size.
gcr.io/distroless/java:8 is no longer being updated. Java 8 isn't a
distroless option any more. Java 11 and 17 are options, but only Java 17
with Debian 12. The main alternative is to stick with Java 8 and use
something like docker.io/library/eclipse-temurin:8-jre . But there
doesn't seem to be much need to use an old JDK for these containers.
jib needed updating to support the oci manifest format used in the
updated image.
Originally you had to confirm that awaitTermination() returned true, but
that was annoying and useless, especially after calling shutdownNow().
The behavior was changed in ce2ae1fb because the awaitTermination()
detection logic could prevent the channel from getting garbage
collected.
Fixes#10732
The KeepAliveManager clamps the keepalive and keepalive time values such that the result is larger than the minimum values specified here. Therefore, a second check here is unnecessary.
All the changes outside libs.versions.toml and examples were
because of ErrorProne. It didn't actually find anything to fix; signal
vs noise has gotten pretty bad with the newer checks.
Status was changed for ErrorProne's SuperCallToObjectMethod. With the
old code it didn't notice the trivial implementation. The fail-for-test
code wasn't used, so it was easiest to just remove it.
Some of the libs had their versions inlined; now that we have
:checkForUpdates it isn't much of a risk for versions to diverge when
there's only a few artifacts sharing a version. If we need 4+ artifacts
to have the same version, then it makes sense to still use a shared
version.
Dependencies not upgraded: google-auth-libray, mockito, netty, cronet
Removing the $ prompt makes it easier to copy+paste. At no point are we
running as root, so there's no # vs $ distinction, not that many people
would even notice the difference.
There is a risk here that not all commands end up getting run. When
pasting multiple commands at once, gradle or another tool might read in
stdin and discard them. But it's probably not worth continuing the
copy-each-command-separately-and-avoiding-the-$.
This change has health checking consumer (new pick first) to install a listener through and health checking producer (outlier detection and client health checking) producing health checks. Health notification chain is built reusing the previous connectivity state chain.
Pickfirst installs the health listener, and is capable of detecting when no health checking producer is installed in the system. In that case, it sets health status to be READY so that health system is no-op.
This commit makes a small change to BinlogHelper to make it compatible with the Protobuf Java Lite runtime.
In the Lite runtime, the `addXBuilder` for repeated fields is not available. Instead, the `addX` method must be used with a manually-constructed Builder.
* Handle slow security policies without blocking gRPC threads.
- Introduce PendingAuthListener to handle a ListenableFuture<Status>, progressing the gRPC through each stage in sequence once the future completes and is OK.
- Move unit tests away from `checkAuthorizationForService` and into `checkAuthorizationForServiceAsync` since that should be the only method called in production now.
- Some tests in `ServerSecurityPolicyTest` had their expectations updated; they previously called synchornous APIs that transformed failed `ListenableFuture<Status>` into one or another status. Now, we call the sync API, so those transformations do not happen anymore, thus the test needs to deal with failed futures directly.
- I couldn't figure out if this PR needs extra tests. AFAICT `BinderSecurityTest` should already cover the new codepaths, but please let me know otherwise.
This removes the benefit of including the PR number in the title without
also requiring using github APIs to query the PR number. It still
provides the same details about the change, and indirectly links to the
PR if the user wants to see the review.
Gradle is forcing a move away from using 'project' during task excution
and because of some interactions there, this is easiest by making them
real classes. That makes them start looking quite strange in the build
file, so they are now moved to buildSrc/. We could have continued using
Groovy, but it is weird in some ways that are more apparent when making
classes and not just scripting. Instead, they were converted to Java.
They are compatible with delayed configuration resolution as well.
We already do this for WRR. Notably, we are no longer trying to avoid
the modulus each pick. It was of questionable value, and removing it is
necessary to continue sharing the same integer when the list size
changes.
The change means we can implement a stronger isEquivalentTo() by
comparing the AtomicInteger references. It is strong enough that the
operation aligns with normal equals(). Using equals() instead of
isEquivalentTo() also made more obvious an equals() optimization that
uses the hashCode() before the more expensive HashSet creation; equals()
should now be very fast except when they are (very likely) equal.
They are a lot faster. Instead of 1-3 minutes of test execution, I now
see 2-22 seconds. There still may be 3 minutes of overhead for the
gcloud command to complete, but the reduction is noticable in the total
execution time. And it seems the tests are actually being run, as there
is some flakiness. The flakiness appears to be at a lower rate.
The script was slightly reorganized to make it easier to copy commands
to run locally.
Note that this uses Pixel2 and Pixel3. Also swap 26-27 from Nexus6P to
Pixel2. We tend to prefer the latest (virtual) device for each API
level.
The current models and their supported API levels are available via:
```
gcloud firebase test android models list --filter=form=virtual
```
Pixel2.arm supports 31-32, but is beta, so I didn't swap to it. It also
supports the preview 33.
This may help some to move closer to Providers. It especially helps
cases where `NameResolverFactory`s aren't returning `InetSocketAddress`,
as it allows them to override `getProducedSocketAddressTypes()`, which
will now fail starting in 15fc70be.
The behavior purposefully mirrors that of Netty's
AbstractHttp2ConnectionHandlerBuilder.decoderEnforceMaxRstFramesPerWindow().
That API is not available to our code as we extend the
Http2ConnectionHandler, but we want our API to be able to delegate to
Netty's in the future if that ever becomes possible.
An OutlierDetectionLoadBalancer child load balancer might decided to
shut down any subchannel it is tracking. We need to make sure that those
subchannels are removed from the outlier detection tracker map to avoid
a memory leak.
Try to manage the fact that runtime permissions could be granted externally by the user after a hasPermissions() SecurityPolicy check has already been made on a transport.
Huffman in the datacenter doesn't add much value in the common cases. It could be useful to turn on huffman based on the connection latency (say, >10ms means "assume cross-datacenter") but the Netty API doesn't lend it to that. The savings here aren't huge and it is expensive; the table provides the biggest savings.
Adds a new module grpc-opentelemetry that integrates OpenTelemetry and focuses on metrics.
OpenTelemetry APIs are used for instrumenting metrics collection. Users are expected to provide SDK with implementations.
If no SDK is passed, by default gRPC uses OpenTelemetry.noop().
It was introduced in 15fc70be but unused. It could be "used" from
inprocess: targets, but the in-process transport wasn't registered, so
would fail.
We do want an in-process name resolver, but we need to agree no the URI
format cross-language before we introduce it.
* Update picker logic per A61 that it no longer pays attention to the first 2 elements, but rather takes the first ring element not in TF and uses that.
---------
Pulled in by rebase:
Eric Anderson (android: Remove unneeded proguard rule 44723b6)
Terry Wilson (stub: Deprecate StreamObservers b5434e8)
Per https://developer.android.com/about/versions/14#beta-3, Android U has reached the platform stability milestone which means that all external APIs are finalized.
We can thus bump the compileSdkVersion to 34 (U) and begin using APIs added there. We leave targetSdkVersion unchanged for now to avoid the broader evaluation of whether deeper changes may be necessary as part of the upgrade; this simply allows compile-time access to newer APIs without changing runtime behavior.
See b/274061424
This is currently the only place where we return Status.UNKNOWN with no description, which makes is harder to debug and differentiate from statuses originated from non-grpc sources.
This PR enriches ServerImpl's #internalClose `Status.UNKNOWN` with description `Application error processing RPC`.
* core, netty, okhttp: implement new logic for nameResolverFactory API in channelBuilder
fix ManagedChannelImpl to use NameResolverRegistry instead of NameResolverFactory
fix the ManagedChannelImplBuilder and remove nameResolverFactory
* Integrate target parsing and NameResolverProvider searching
Actually creating the name resolver is now delayed to the end of
ManagedChannelImpl.getNameResolver; we don't want to call into the name
resolver to determine if we should use the name resolver.
Added getDefaultScheme() to NameResolverRegistry to avoid needing
NameResolver.Factory.
---------
Co-authored-by: Eric Anderson <ejona@google.com>
Instead of a boolean, we now return a Status object. Status.OK
represents accepted addresses and other non-acceptance. This allows the
LB to provide more information about why a set of addresses were not
acceptable.
The status will later be sent to the name resolver as well to allow it
to also better react to to bad addresses.
This is the async variant of SecurityPolicy, allowing callers to implement security checks based on slow calls that aren't meant to block the gRPC thread.
BinderTransportSecurity.checkAuthorization **STILL** blocks while attempting to resolve the ListenableFuture<Status> it gets from the policy object. That will still be adressed in a follow-up.
Relate issue: #10566
Logging to the static instance would result in application logs filling
up if the Orca service is not available.
We'd like to have the logging on the subchannelLogger, so we make it
visible on demand.
Also succeed Orca logging test if log message present. Using
contains over containsExactly seems more reasonable.
Allow a security policy to returns a `ListenableFuture<Status>` that
callers can implement to perform slower auth checks (like network
calls, disk I/O etc.) without necessarily blocking the gRPC calling
thread.
Partially addresses: https://github.com/grpc/grpc-java/issues/10566
This adds the ability to disable the installation of stream tracers in
each test call in AbstractInteropTest. The tracers are stored in a list
and used to make assertions by normal tests, but a long running stress
test will accumulate too many entries in this list and the heap gets
wuickly filled up.
Adds support for specifying either google default or compute engine
"custom" credentials on the command line. This works like it does in
TestServiceClient.
Another feature from TestServiceClient is also included - the channel
builder is created differently when the server port is 0. This avoids
some ipv6 address parsing shenanigans.
Starting from version 4.5.0 Mockito uses the Java stream APIs, which are
not available on Android API levels < 24. This has been causing the
Android integration tests for API levels 21, 22 and 23 to fail.
- `ForwardingServerBuilder`
- `ForwardingChannelBuilder` - will be deprecated immidiatelly after
stabilization
- `ForwardingChannelBuilder2` - should be used instead of
`ForwardingChannelBuilder`
Introduces a flag to enable the logging of collected metrics. In some
cases it can be more practical to analyze the metrics from a log file
after the fact, instead of connecting to the metrics server to collect
the metrics as the client is running.
This breaks the ABI of the classes listed below.
Users that recompiled their code using grpc-java [`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) (released on
Feb 23, 2021) and later, ARE NOT AFFECTED.
Users that compiled their source using grpc-java earlier than
[`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) need to
recompile when upgrading to grpc-java `v1.59.0`. Otherwise the code
will fail on runtime with `NoSuchMethodError`. For example, code:
```java
NettyChannelBuilder.forTarget("localhost:100").maxRetryAttempts(2);
```
Will fail with
> `java.lang.NoSuchMethodError: 'io.grpc.internal.AbstractManagedChannelImplBuilder
io.grpc.netty.NettyChannelBuilder.maxRetryAttempts(int)'`
**Affected classes**
Class `AbstractManagedChannelImplBuilder` is deleted, and no longer in
the class hierarchy of the channel builders:
- `io.grpc.netty.NettyChannelBuilder`
- `io.grpc.okhttp.OkhttpChannelBuilder`
- `grpc.cronet.CronetChannelBuilder`
This breaks the ABI of the classes listed below.
Users that recompiled their code using grpc-java [`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) (released on
Feb 23, 2021) and later, ARE NOT AFFECTED.
Users that compiled their source using grpc-java earlier than
[`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) need to
recompile when upgrading to grpc-java `v1.59.0`. Otherwise the code
will fail on runtime with `NoSuchMethodError`. For example, code:
```java
NettyServerBuilder.forPort(80).directExecutor();
```
Will fail with
> `java.lang.NoSuchMethodError: 'io.grpc.internal.AbstractServerImplBuilder
io.grpc.netty.NettyServerBuilder.directExecutor()'`
**Affected classes**
Class `AbstractServerImplBuilder` is deleted, and no longer in the
class hierarchy of the server builders:
- `io.grpc.netty.NettyServerBuilder`
- `io.grpc.inprocess.InProcessServerBuilder`
This prevents grpc-util from being exposed on the classpath when
compiling code using grpc-okhttp. grpc-core is still needed because of
AbstractManagedChannelImplBuilder.
FixedResultPicker can be used in more situations. Note that
WrrLocalityLoadBalancerTest's test was changed non-trivially. The
noChildLb test was particularly nasty as it assumed
LoadBalancer.ErrorPicker had same toString() as
GracefulSwitchLoadBalancer's ErrorPicker.
This reverts commit 5f480de2ee. We're
seeing it thrown in (b/301552213):
```
java.util.concurrent.RejectedExecutionException
at io.grpc.stub.ClientCalls$ThreadlessExecutor.execute(ClientCalls.java:761)
at io.grpc.internal.RetriableStream$Sublistener.closed(RetriableStream.java:954)
at io.grpc.internal.ForwardingClientStreamListener.closed(ForwardingClientStreamListener.java:34)
at io.grpc.internal.InternalSubchannel$CallTracingTransport$1$1.closed(InternalSubchannel.java:691)
at io.grpc.internal.DelayedStream$DelayedStreamListener$4.run(DelayedStream.java:510)
at io.grpc.internal.DelayedStream$DelayedStreamListener.delayOrExecute(DelayedStream.java:462)
at io.grpc.internal.DelayedStream$DelayedStreamListener.closed(DelayedStream.java:507)
at io.grpc.internal.AbstractClientStream$TransportState.closeListener(AbstractClientStream.java:458)
at io.grpc.internal.AbstractClientStream$TransportState.-$$Nest$mcloseListener(AbstractClientStream.java)
at io.grpc.internal.AbstractClientStream$TransportState$1.run(AbstractClientStream.java:441)
at io.grpc.internal.AbstractClientStream$TransportState.deframerClosed(AbstractClientStream.java:278)
at io.grpc.okhttp.OkHttpClientStream$TransportState.deframerClosed(OkHttpClientStream.java:297)
at io.grpc.internal.MessageDeframer.close(MessageDeframer.java:234)
at io.grpc.internal.MessageDeframer.closeWhenComplete(MessageDeframer.java:192)
at io.grpc.internal.AbstractStream$TransportState.closeDeframer(AbstractStream.java:201)
at io.grpc.internal.AbstractClientStream$TransportState.transportReportStatus(AbstractClientStream.java:444)
at io.grpc.okhttp.OkHttpClientTransport.startGoAway(OkHttpClientTransport.java:899)
at io.grpc.okhttp.OkHttpClientTransport.-$$Nest$mstartGoAway(OkHttpClientTransport.java)
at io.grpc.okhttp.OkHttpClientTransport$ClientFrameHandler.run(OkHttpClientTransport.java:1123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
at java.lang.Thread.run(Thread.java:1012)
```
In ac35ab6 the logic in xDS Name resolver was changed to support encoded
authorities. This seems to cause an issue for xdstp replacements which
would percent encode the authority for the replacement causing double
encoding.
For example:
URI = xds:///path/to/service
Authority = path%2Fto%2Fservice
xdstp resource = xdstp:///envoy.config.listener.v3.Listener/path%252Fto%252Fservice
Here the authority is encoded due to slashes and during replacement we
percent encode it again causing %2F to change to %252F. To avoid this
issue, use the encoded authority only for the getServiceAuthority() API
and for all other use cases retain the unencoded authority.
It is slow, and just tries a bunch of possibilities. Leave it around for
developers to run locally, but it isn't precise enough to run on every
build. This change reduces the build of servlet in isolation by
1 minute. I suspect the gains are larger in a full grpc build because it
is very CPU-intensive; it uses 600% CPU on my 8-thread CPU when running
by itself.
AndroidComponentAddress now accepts an Intent with merely a package
restriction, not a full ComponentName. This lets clients avoid hard
coding Service class names that they don't control.
Fixes#9062
When a memory leak occurs, it is really helpful to have access records
to understand where the buffer was being held when it leaked. retain()
when we create the NettyReadableBuffer already creates an access record
the ByteBuf, so here we track when the ByteBuf is passed to another
thread.
See #8330
Blocking can be confused with the blocking stub, which is unrelated. I'm
purposefully not saying "it is used only for X," as that isn't what we
need from the API. But it is still helpful to users to describe the
sorts of things that use it.
Fixes#10508
OverrideAuthority() is the modern alternative to rewriting the
InetAddress, and can also work if we change the channel creation to use
target strings instead. We can use test CAs now without resorting to the
Netty-specific APIs. After this change, only 4 classes in
interop-testing depend on Netty.
* Use internalClose instead of close when sendMessage has a RuntimeException.
* Change argument to internalClose to a Throwable instead of a Status.
* Rename internalClose to handleInternalError
Interop-testing is both binaries and a library. It hadn't been updated
to use java-library and to expose API-surface dependencies to
dependents.
This fixes the error:
```
> Task :grpc-gae-interop-testing-jdk8:javadoc FAILED
javadoc: error - An internal exception has occurred.
(com.sun.tools.javac.code.Symbol$CompletionFailure: class file for io.grpc.internal.testing.StreamRecorder not found)
Please file a bug against the javadoc tool via the Java bug reporting page
(http://bugreport.java.com) after checking the Bug Database (http://bugs.java.com)
for duplicates. Include error messages and the following diagnostic in your report. Thank you.
com.sun.tools.javac.code.Symbol$CompletionFailure: class file for io.grpc.internal.testing.StreamRecorder not found
1 error
```
Instead of adding explicit dependsOn to compileJava and sourcesJar,
adding the sync task to the sourceSet automatically propagates the
dependency to all consumers of the code.
* Mark MultiChildLoadBalancer as Internal. Cannot move to the internal package because of its use of classes in the util package.
* Exclude MultiChildLoadBalancer from javadoc generation.
* Fix javadoc creation.
The scripts used `git rev-parse --show-toplevel` so it appeared they
could be used from any directory. But references to "GIT_BASE_DIR"
weren't absolute, so it did matter the starting directory. And it
mattered in a big way for xds/import.sh as if you ran it from the
grpc-java directory it would delete the xds directory in grpc-java, not
third_party.
The trap that deleted the GIT_BASE_DIR was very broken. In addition to
potentially deleting the wrong directory, it was unnecessary because
that directory was in tmpdir. But you can only have one trap per signal,
so this unnecessary trap disabled the trap that deleted tmpdir.
The script needed a full clone because it needed to check out a specific
commit. To work with --depth 1 you have to use some convoluted syntax.
But just downloading a tar.gz is easy and seems should work fine on Mac.
protoc-gen-validate/import.sh didn't have the trap problem, but seemed
to have drifted from the other scritps. All the scripts were synced to
match.
Even though we don't really use japicmp, the configuration was resolving
artifacts even when the task was not executed. After some clean up, the
configuration looks more ordinary.
While not easy to use because java_grpc_library() uses a fixed
toolchain, it is possible for downstream users to apply a patch to the
repo to add their own annotation processors. This feature was added
inside Google so exporting it reduces the diff between internal and
external and causes no harm.
cl/280287611
This allows okhttp to service the Grpc.newServerBuilderForPort() API.
Note that, unlike Netty, it will throw if you try to use
ServerBuilder.forPort().
This fixes android-interop-testing which was broken by c9864a119.
OutlierDetectionLoadBalancer did not delegate calls to an existing
ClientStreamTracer from the tracer it installed. This change has the OD
tracer delegate all calls to the underlying one.
We have the OkHttp server these days, so we don't need to use Netty. Use
the generic API instead of hard-coding OkHttp.
We've seen some recent interop failures. We aren't entirely sure what is
going on, but we have seen some Netty usages in logcat. Since we don't
even want Netty on Android, just get rid of it and even if it doesn't
help with the failures things are better dependency-wise.
The README describes the released version of gRPC. But the most recent
release doesn't use protobuf 3.23.4. Telling people to use protoc 3.23.4
is asking for breakages because they will be running with protobuf-java
3.22.3, which is a downgrade.
The bump to 3.23.4 was incorrectly done in #10359.
Most changes are migrating from conventions to the equivalent
extensions. JMH, AppEngine, and Jib plugins trigger future compatibility
warnings with `--warning-mode all`.
The movement of configurations was to allow sourceSets to create the
configurations and then we just configure them. When configurations were
before sourceSets, we'd implicitly create the configuration.
The examples were _not_ updated to the newer Gradle, although the
non-Android examples work with the newer Gradle. The Android examples
use an older Android Gradle Plugin which will need to be upgraded first.
https://github.com/grpc/grpc-java/issues/10445
While JacocoReportBase.getExecutionData() is properly annotated with
`@InputFiles` and is a FileCollection, which can propagate dependencies,
the JacocoPlugin's configuration of the task [goes through a
File][barefile], which doesn't propagate task dependency information.
This is overcome in the plugin by mustRunAfter. This may have been on
purpose to allow jacocoTestReport to report on the subset of tests
already run, but is atypical these days.
Because of that, we need to copy the mustRunAfter from the other
JacocoTestReports. The dependsOn weren't doing anything because there
were no dependsOn; that configuration is quite old so it's hard to say
when they stopped functioning.
This fixes warnings like:
```
- Gradle detected a problem with the following location: 'grpc-java/core/build/jacoco/test.exec'. Reason: Task ':grpc-all:jacocoTestReport' uses this output of task ':grpc-core:test' without declaring an explicit or implicit dependency. This can lead to incorrect results being produced, depending on what order the tasks are executed. Please refer to https://docs.gradle.org/7.6/userguide/validation_problems.html#implicit_dependency for more details about this problem.
- Gradle detected a problem with the following location: 'grpc-java/testing/build/resources/main'. Reason: Task ':grpc-all:jacocoTestReport' uses this output of task ':grpc-testing:processResources' without declaring an explicit or implicit dependency. This can lead to incorrect results being produced, depending on what order the tasks are executed. Please refer to https://docs.gradle.org/7.6/userguide/validation_problems.html#implicit_dependency for more details about this problem.
```
I was able to trigger the warnings by:
```
./gradlew clean
./gradlew :grpc-core:test
./gradlew :grpc-all:jacocoTestReport
./gradlew :grpc-core:test :grpc-all:jacocoTestReport
```
[barefile]: bc4029063c/subprojects/jacoco/src/main/java/org/gradle/testing/jacoco/tasks/JacocoReportBase.java (L162-L168)
In d654707 we swapped compiling the uploaded artifacts to Java 11. This
caused ABI issues with ByteBuffer, like clear() returning ByteBuffer
instead of Buffer.
There are source-level approaches to avoid the accidental ABI dependency
on Java 11, but we have no tool able to detect such breakages.
We use Animalsniffer for similar cases, but it fails to detect these[1].
Since we have no tool, source-level approaches can't gain the necessary
confidence that all incompatibility fixes have been resolved.
Java has had javac-level ways to address this, but they used to require
setting bootclasspath. Since Java 9, though, they made it easier and we
can use --release, which does exactly what we need.
Fixes#10432
1. https://github.com/mojohaus/animal-sniffer/issues/77
The bootstrapping code currently does not log zone and subZone from locality correctly, and only logs region. This commit fixes the logging message format.
* Eliminate NPE by skipping further processing when stream is defined, but doesn't have a property for streamKey (header processing identified an error)
Fixes#10364
* Add unit test for missing content type
Our benchmarks saw a serious performance decrease with the upgrade from
Netty 4.1.88 to 4.1.94. The problem was tracked down to a single PR in
4.1.94, so we avoid 4.1.94 for now.
Fixes#10401
* Eliminate NPE by skipping further processing when stream is defined, but doesn't have a property for streamKey (header processing identified an error)
Fixes#10364
* Add unit test for missing content type
Also convert to testFixtures, like we did elsewhere.
grpc-testing had previously been listed, but because of the missing
comma, it was not an actual dependency.
Guava 32.1.0 added Gradle Module Metadata to their artifact (they still
use Maven for their build). Because of that, we're now seeing less
normal dependencies that were confusing checkUpperBoundDeps. Obviously
'null' is not the version we were looking for.
We aren't upgrading to Guava 32.1.x yet as there's still other problems.
```
Execution failed for task ':grpc-census:checkUpperBoundDeps'.
> Maven version skew: com.google.guava:guava-parent (32.1.1-jre != null) Bad version dependency path: [project ':grpc-census', com.google.guava:guava:32.1.1-android] Run './gradlew :grpc-census:dependencies --configuration runtimeClasspath' to diagnose
```
grpc-context is empty, so isn't needed. Including it also causes the
warning:
```
> Task :grpc-all:javadoc
Resolution of the configuration :grpc-context:compileClasspath was attempted from a context different than the project context. Have a look at the documentation to understand why this is a problem and how it can be resolved. This behaviour has been deprecated and is scheduled to be removed in Gradle 8.0. See https://docs.gradle.org/7.6/userguide/viewing_debugging_dependencies.html#sub:resolving-unsafe-configuration-resolution-errors for more details.
```
Encode the service authority before passing it into gRPC util in the xDS name resolver to handle xDS requests which might contain multiple slashes. Example: xds:///path/to/service:port.
As currently the underlying Java URI library does not break the encoded authority into host/port correctly simplify the check to just look for '@' as we are only interested in checking for user info to validate the authority for HTTP.
This change also leads to few changes in unit tests that relied on this check for invalid authorities which now will be considered valid.
Just like #9376, depending on Guava packages such as URLEscapers or PercentEscapers leads to internal failures(Ex: Unresolvable reference to com.google.common.escape.Escaper from io.grpc.internal.GrpcUtil). To avoid these issues create an in house version that is heavily inspired by grpc-go/grpc.
Wrapping the DnsNameResolver in DnsNameResolverProvider can cause
problems to external name resolvers that delegate to a DnsResolver
already wrapped in RetryingNameResolver. ManagedChannelImpl would
end up wrapping these name resolvers again, causing an exception
later from a RetryingNameResolver safeguard that checks for double
wrapping.
* Sort the policies in a rule by policy name when parsing from proto. This fixes the server sending a GOAWAY when an LDS update with no changes other than ordering is received.
* Remove use of deprecated method setSourceIp
* Fix style issues
* Update RbacFilterTest.java
`StatusException` thrown from `checkedStart()` may have `trailers`. Therefore, `CheckedForwardingClientCall` should pass the `trailers` to `responseListener.onClose()`.
I moved items that could be done immediately after the release up into
the main release flow. I also stripped some outdated or unnecessary text
to make it quicker to follow.
Instead of assignment, it now seems to be a FileCollection that you
modify. This has probably been broken since f458f22.
```
> Could not create task ':grpc-auth:japicmp'.
> Cannot set the value of read-only property 'oldClasspath' for task ':grpc-auth:japicmp' of type me.champeau.gradle.japicmp.JapicmpTask.
```
The change works with or without the files(), but the conversion is just
not needed anymore as it is handled by FileCollection.
Motivation:
When multiple NameResolvers are created, the Classloader is scanned every time trying to figure out if the Platform is Android. This expensive work could be done only once.
Modification:
Cache isAndroid resolution in a constant.
Result:
Less expensive multiple NameResolvers instantiation.
ManagedCahnnelImpl did not make sure to use a RetryingNameResolver if
authority was not overriden. This was not a problem for DNS name
resolution as the DNS name resolver factory explicitly returns a
RetryingNameResolver. For polling name resolvers that do not do this in
their factories (like the grpclb name resolver) this meant not having retry
at all.
Currently, the gRPC compiler isn't properly using the fully qualified
string name `java.lang.String` instead of `String`. Update the generator
to use the `$String$` alias to avoid compile issues with protobuf
messages called String.
Fixes#10316.
There's been minor version skew between Java and C++ many times because
certain releases are one-language-only. And now we have more severe
skew, where we can't readily upgrade to newer C++ Protobuf versions
because of build complexity. Let's just remove the version, and have the
canonical C++ Protobuf version live in COMPILING.md.
See #10317
* context, all: move Context classes to grpc-api
clean up grpc-context since it has no source code: only add dep on grpc-api
add exclusion for all transitive deps of grpc-api - only guava
exclude grpc-context as a dependency from grpc-alts because all context code is in grpc-api now
api: 1.7 as target Java version for Context source-set of grpc-api
* core, census: fix the issues with android project pulling in old grpc-context version
* api,context: make changes to bazel build files to account for context code moving from context to api
Note that this changes the JDK used to compile releases to Java 11. That
should only impact the appearance of the Javadoc.
This adds the Android SDK to the build container, removing the
dependency on the Android SDK being available on the CI host. This
allows running on newer Kokoro images. 'Android' and 'Android interop'
CIs still depend on the Android SDK being available on the host, but
since they aren't used as part of the release process, they can more
easily migrate off Kokoro as part of future work.
This also causes Android components to now be built with -Werror, as we
use -PfailOnWarnings=true in unix.sh but were missing it from the
Android build invocations.
Gradle will auto-download the necessary version of build-tools. We don't
want to download it ourselves because the version we specify might not
even be used. Looking at logs, we were previously downloading a version
that was unused.
We now fork javac to avoid OOM. The build fails 2/3 times before the
forking, and 0/3 after.
Plumbing through sourceSet lets cross-project dependencies work the same
way as artifacts published to Maven. This fixes an issue for
interop-testing where build/install would include all the raw files from
thirdparty in addition to the grpc-xds.jar. For example:
build/install/grpc-interop-testing/lib/com/github/xds/data/orca/v3/OrcaLoadReport$1.class
b/288577812
Since 44847bf4e, when we upgraded our JUnit version, the JUnit
exclusions have probably not been necessary. e0ac97c4f upgraded
Robolectric to a version that had the auto.service problem fixed.
Apparently our Kokoro image has this done already, and my laptop as
well. But the newer Kokoro image and other computers like my desktop
don't have it already.
* Suppress duplicate children and NACK if detect loops (child is an ancestors of the current CDS aggregate).
* Handle diamond shaped aggregations (same cluster appears under 2 distinct parents that doesn't create a loop).
The PipeSocket was convenient and avoided real I/O, but the
shutdown/close while connecting/handshaking tests were triggering a
Socket bug in Java (https://bugs.openjdk.org/browse/JDK-8278326). Using
a real socket doesn't trigger the bug because the test stops sharing
state with the code under test.
Fixes#10228
```
Details
==================
WARNING: ThreadSanitizer: data race (pid=4528)
Write of size 1 at 0x0000cfb9d5f4 by thread T36 (mutexes: write M0):
#0 java.net.Socket.setCreated()V Socket.java:687
#1 java.net.AbstractPlainSocketImpl.create(Z)V AbstractPlainSocketImpl.java:149
#2 java.net.Socket.createImpl(Z)V Socket.java:477
#3 java.net.Socket.getImpl()Ljava/net/SocketImpl; Socket.java:540
#4 java.net.Socket.setTcpNoDelay(Z)V Socket.java:998
#5 io.grpc.okhttp.OkHttpServerTransport.startIo(Lio/grpc/internal/SerializingExecutor;)V OkHttpServerTransport.java:164
#6 io.grpc.okhttp.OkHttpServerTransport.lambda$start$0(Lio/grpc/internal/SerializingExecutor;)V OkHttpServerTransport.java:159
#7 io.grpc.okhttp.OkHttpServerTransport$$Lambda$56.run()V ??
#8 io.grpc.internal.SerializingExecutor.run()V SerializingExecutor.java:133
#9 java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V ThreadPoolExecutor.java:1130
#10 java.util.concurrent.ThreadPoolExecutor$Worker.run()V ThreadPoolExecutor.java:630
#11 java.lang.Thread.run()V Thread.java:830
#12 (Generated Stub) <null>
Previous read of size 1 at 0x0000cfb9d5f4 by thread T35 (mutexes: write M1, write M2):
#0 java.net.Socket.close()V Socket.java:1512
#1 io.grpc.okhttp.OkHttpServerTransportTest$PipeSocket.close()V OkHttpServerTransportTest.java:1384
#2 io.grpc.okhttp.OkHttpServerTransportTest.clientCloseDuringHandshake()V OkHttpServerTransportTest.java:290
```
This updates the version of boringssl and removes the dependency on APR.
netty-tcnative 2.0.56.Final uses APR 1.7.0, so is in scope for
CVE-2021-35940, CVE-2022-28331, and CVE-2022-24963. netty-tcnative is
not actually vulnerable. The binary does not include apr_socket_sendv(),
apr_encode_*(), apr_pencode_*(), apr_decode_*(), apr_pdecode_*(). The
binary does include apr_time_exp_*() but it is unused code.
Unfortunately --gc-sections wasn't used during compilation.
apr_time_now() is used, but that just calls gettimeofday() and is not
vulnerable.
There's no panic here, but this updates netty-tcnative just a few weeks
before we would have ordinarily done so. Bumping the version makes life
easier for everyone.
These code locations just needed a generic gRPC server, without any
transport-specific configuration. Use vanilla ManagedServerBuilder
instead of hard-coding to Netty, as we would suggest to our users.
We configured TestGrid to file bug separately for each
failed sub-target, if we still fail the main target,
TestGrid will fail duplicate bugs.
The same change in core:
https://github.com/grpc/grpc/pull/33222.
The flag was documented in https://github.com/grpc/grpc/pull/33180 .
This flag will be useful for some RLS integration tests where we need
to have the client set certain headers for routing purposes.
The previous syntax for just adding your own keys doesn't seem to work,
but was similar to the approach of using `dict(d, foo=bar)`. You can't
have '.' and ':' in a key that way though. The doc was written before
Bazel 1.0 and in newer Bazel versions you can just use | to concatenate.
Fixes#10203
Ciphers have been "fast enough" for testing since early Java 8 updates;
we haven't needed to override ciphers since we dropped Java 7 support.
Java 8u252 had ALPN, so Conscrypt or Jetty ALPN hasn't been necessary
for basic testing for a while. We still want specialized testing for
Conscrypt, but only tests testing Conscrypt need to care.
The thisUsesUnstableApi() method was earlier deprecated and the
@ExperimentalApi annotation removed. Adding @ExperimentalApi back
to make it clear that this method can (and will) later be removed.
This fixes the warning during the test run:
```
warning: No SupportedSourceVersion annotation found on io.grpc.testing.protobuf.SimpleServiceTest$AnnotationProcessor, returning RELEASE_6.
warning: Supported source version 'RELEASE_6' from annotation processor 'io.grpc.testing.protobuf.SimpleServiceTest$AnnotationProcessor' less than -source '1.8'
```
It is forbidden internally. Error message:
Mocking types which have complex contracts or are easy to construct by
other means is forbidden:
'executor' is mocking 'interface java.util.concurrent.Executor'. Use a
real executor. Mocks of Executor don't execute submitted tasks at all,
which leads to suppressed errors, deadlocks, and brittle tests..
- Update to xDS Test Client and xDS test server Docker images to `eclipse-temurin:11-jre`.
- Perform software update so that we install patches for latest vulnerabilities.
This avoids the (often missing) evaluationDependsOn and fixes using
results from other projects without propagating those through
Configuration. It also reduces the number of useless classes pulled in
by down-stream tests, reducing the probability of rebuilds.
The expectation of fixtures is they help testing down-stream code that
use the classes in main. That applies to all the classes here except for
FakeClock and StaticTestingClassLoader. It would also apply to many
internal classes in grpc-testing, but let's consider cleaning that up
future work.
The pinning is unreliable in Maven and ignored by Gradle. I'm not at all
convinced that we are pinning/not pinning in appropriate projects. The
pinning also serves less of a purpose since we started encouraging the
BOM and grpc-netty-shaded. Netty's HTTP/2 API has also become somewhat
stable compared to its earlier history. If we notice an up-tick in
version skew, we can reinstate it.
The pinning is annoying in the build.gradle code and causes Maven/Gradle
to download the version list once a day, which can be troublesome to
users unaware of how to tell the tools to work offline.
It also opens our users to platform issues like seen in #10043
and #10086 where Maven Central's version list was incorrectly generated.
Or like #9664 where Gradle Plugin's repository caches packages from
JCenter but the version list is not as cachable so exposed us to JCenter
instability.
This fixes#8357, by way of "we think we won't worry any more." See
90db93b9 when it was originally introduced. And issues
like #8337, #3634.
LoadBalancers in general should never throw, but
parseLoadBalancingConfig() in particular has a return value to
communicate the error. Throwing can be a bit unpredictable, but at its
most trivial form causes a channel panic. There's no reason to throw
explicitly and calls to JsonUtil have to be protected by a try-catch
because it can throw.
TlsTesting.loadCert() is a public API and so should be preferred over
our internal utility. It avoids creating a temp file that has to be
deleted by a shutdown hook. Usages that needed a file were not migrated.
ReadyPicker hasn't been necessary since 111ff60e, when we stopped
calling super.pickSubchannel(). EmptyPicker was only used in tests, and
we can just compare the class name instead of doing an instanceof check.
Unfortunately, calling getClass() caused Java to start casting the
return value of pickerCaptor.getValue() based on its generics. Captors
don't verify the type they capture, so using any type other than
SubchannelPicker for the pickerCaptor is misleading and hides a cast.
each+configure serves no purpose, because the task has already been
realized with the each. This was just a faulty conversion in 0ff9f37b9e,
although it was because I thought the each was specialized API from the
protobuf plugin, not the normal container each.
We have two versions of multiple projects, the non-shadow and the
shadow. When using project() references, things are mostly fine,
although we may test with a different version than our users would use.
However, when a dependency from Maven Central depends on a grpc
artifact, it pulls in the non-shadow configuration and messes up
gradle's tracking. We shouldn't be using the non-shadow configuration,
so this issue is sort of a blessing to improve the test reliability.
Gradle 7 will notice that there's a missing dependency, but it is not
deterministic and is very slow to test. Starting in Gradle 8 it is much
more reliably triggered which greatly helped the testing efforts. The
error in Gradle 7 looks like:
```
> Task :grpc-gcp-observability:interop:startScripts
Execution optimizations have been disabled for task ':grpc-gcp-observability:interop:startScripts' to ensure correctness due to the following reasons:
- Gradle detected a problem with the following location: 'grpc-java/netty/shaded/build/libs/grpc-netty-shaded-1.54.0-SNAPSHOT-original.jar'. Reason: Task ':grpc-gcp-observability:interop:startScripts' uses this output of task ':grpc-netty-shaded:jar' without declaring an explicit or implicit dependency. This can lead to incorrect results being produced, depending on what order the tasks are executed. Please refer to https://docs.gradle.org/7.6/userguide/validation_problems.html#implicit_dependency for more details about this problem.
```
Fixes#9992
* xds: handle the handlerRemoved callback to skip updateSslContext processing
In handlerAdded we submit a callback to updateSslContext but before the
callback is executed the handler could be removed (e.g. bad connection)
in which case the callback should skip all of the processing.
Also added a unit test to check there is no exception.
If provided with the new PickFirstLoadBalancerConfig,
PickFirstLoadBalancer will shuffle the list of addresses it receives
from the name resolver.
PickFirstLoadBalancerProvider will now support the new config
if enabled by an env variable.
When the subchannel is transitioning from TRANSIENT_FAILURE to either
IDLE or CONNECTING we will not update the LB state. Additionally, if
the subchannel becomes idle we request a new connection so that the
subchannel will keep on trying to establish a connection.
All BinderTransport transactions are oneway which means uncaught
Exceptions during processing are merely logged locally and not
propagated to the peer. Instead, we add a top level catch block
that handles the unexpected by shutting down the whole transport. This
makes our peer aware of the problem immediately (instead of relying on a
deadline) and gives clients a fresh transport instance to handle any
retries.
Under normal conditions the child LB of `ClusterImplLoadBalancer` does
not fluctuate, based on the field used to configure load balancing in
the xDS `Cluster` proto it is either:
1. `WrrLocalityLoadBalancer` if the newer `load_balancing_policy` field
is used
2. `WeightedTargetLoadBalancer` if the legacy `lb_policy` field is used
`ClusterImplLoadBalancer` currently assumes that this child does not
change and so does not change the child LB when the name resolver sends an
update. If the control plane does switch to using a different field for
LB config, that update will have an LB config meant for the other child
LB type. This will result in a ClassCastException and a channel panic.
To address this, `ClusterImplLoadBalancer` will now use
`GracefulSwitchLoadBalancer` and makes sure if the child policy changes
the correct LB implementation is switched to.
The internal build rule for this client already adds these dependencies, and is what related tests for these dependencies have been relying on for a while.
We're starting to use the per-release published interop client docker images for those tests too now, and these use the gradle build rules.
The xds server can take a really long time to start if the xds resources
are slow to load. Ideally the management server would be available
during this time so we can inspect the server. The server health still
won't go to SERVING until the xds server starts, which is appropriate.
There's still plenty more that could be done, but I want to keep this on
the simpler/less-invasive side and it'd just delay these changes for no
real benefit.
Previously builds were done with Ubuntu 16.04, and now we are using
18.04. Thus the generated binaries will no longer work for
Ubuntu 16.04 and Debian 9 users, both of which are outside of their
support window and aren't supported by Abseil. RHEL users are
unaffected, as the binaries already didn't work on RHEL 7 and they will
remain working with RHEL 8. FWIW, Ubuntu 18.04 will leave its support
window in June.
The point of the sorting is to reduce the chances of merge conflicts. I
greatly prefer verboseness over cleverness in examples, but the tasks
can only be sorted manually and there's so many of them.
It is counter-productive to do this for the examples that have their own
project folder, as there's so few tasks in that case that they don't
need to be ordered.
The version used by protoc-gen-grpc-java will be upgraded separately,
because of large C++ build changes necessary. But that won't impact
users at all. We are upgrading to protoc 22.3; only the grpc plugin is
not upgraded.
Bazel is upgraded for both Java and C++.
If a child load balancer rejects the addresses it if given all we can do
is to trigger a name resolution refresh and hope for a better set of
addresses.
This removes some steps from the release process. These two locations
aren't special enough in way that deserves manually changing the version
each release.
This will replace the kokoro-based CI. We need to upgrade the image
Kokoro runs on, but it is easier to add a GitHub Actions CI than to run
a one-off test on a new image with Kokoro.
A clean run takes 10 minutes and a cached run takes 8.5. So it is a
little bit slower than the 5.5 minutes on Kokoro, but still pretty
quick.
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.
The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.
This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.
Fixes#9185
This flag is added in the U SDK, which is still under development. Since it's just a numeric constant, we copy the value until it is stable and mark the API is experimental, with appropriate warnings about depending on it from production code.
A follow-up change will be made after SDK finalization to point to the official constant (or otherwise update to match any SDK changes), at which point we can remove the `@ExternalApi` annotation.
See b/274061424
There was recently a failure with the Tomcat test in servlet/jakarta:
```
io.grpc.servlet.jakarta.TomcatInteropTest > pingPong FAILED
java.lang.AssertionError at AbstractInteropTest.java:845
Caused by: io.grpc.StatusRuntimeException at Status.java:539
...
* What went wrong:
Execution failed for task ':grpc-servlet-jakarta:tomcat10Test'.
> There were failing tests. See the report at: file:///home/runner/work/grpc-java/grpc-java/servlet/jakarta/build/reports/tests/tomcat10Test/index.html
```
But we couldn't get more details because servlet/jakarta didn't match
the artifact glob.
LoadWorkerTest.runUnaryBlockingClosedLoop and Http2NettyTest.tlsInfo are
failing every CI run. It appears they are the unfortunate tests run
first, so are slowest to start as classloading proceeds. There's
definitely other tests that probably need adjustment, but fixing these
two gives us some hope of having a green run occasionally.
* removed populating monitored resource to k8s_conatiner by default for logging; Delegating the resource detection to cloud logging library instead (enabled by default)
* remove kubernetes resource detection logic from observability
Currently the code maintains one LoadStatsManager2 that collects all
stats. The problem with this is that in a federation situation there
will be multiple LrsClients that will be periodically picking up stats
from the manager and sending them to their respective control planes.
This creates a first-come-first-serve situation where the stats get
randomly distributed across the control planes.
This change creates separate LoadStatsManagers dedicated to their own
control planes, thus assuring no stats will get lost.
xds: Correctly start LRS clients in federation situations
The old code used a single member variable to indicate if load reporting
had already been started by XdsClientImpl. This boolean was used to
avoid starting a LoadReportClient more than twice. This works fine with
a single control plane server.
The problem occurs in federation situations where there is more than one
control plane and thus more than one LoadReportClient. Once the first
LoadReportClient is started, the member variable boolean is flipped to
true and no other LoadReportClients would be started.
This change removes the boolean member variable and relies on the fact
that starting an already started LoadReportClient is a no-op.
Provides a server with both a greet service and the health service.
Client has an example of using the health service directly through the unary call
<a href="https://github.com/grpc/grpc-java/blob/master/services/src/main/proto/grpc/health/v1/health.proto">check</a>
to get the current health. It also utilizes the health of the server's greet service
indirectly through the round robin load balancer, which uses the streaming rpc
<strong>watch</strong> (you can see how it is done in
{@link io.grpc.protobuf.services.HealthCheckingLoadBalancerFactory}).
* Fix order dependent test by changing the initializations and comparison so that elapsed time isn't as significant in identifying whether it was the context or call option's duration that was used.
fixes b/271122310
The coveralls task has been silently failing since we migrated to GitHub
Actions, away form Travis-CI:
```
no COVERALLS_REPO_TOKEN environmental variable found
no available CI service
> Task :grpc-all:coveralls
BUILD SUCCESSFUL in 23s
7 actionable tasks: 1 executed, 6 up-to-date
```
We'd rather not deal with private tokens, but the Coveralls GitHub
Action [only supports lcov][1] which makes it unhelpful for Java.
Looking deeper, yep, we [aren't the only ones impacted][2]:
[1]: https://github.com/marketplace/actions/coveralls-github-action
[2]: https://github.com/coverallsapp/github-action/issues/22
* Added s390x platform support
* Adapt to existing platform naming scheme
* Updated s390_64 library whitelist
* Use g++ compiler version 8.x for s390x
* Introduced dedicated Docker container for building s390x artifacts Minor fix
---------
Signed-off-by: Dirk Haubenreisser <haubenr@de.ibm.com>
Co-authored-by: Eric Anderson <ejona@google.com>
This PR adds a default custom tag for metrics, irrespective of custom
tags being present in the observability configuration.
OpenCensus by default adds a custom tag
[opencenus_task](https://docs.google.com/document/d/1sWC-XD277cM0PXxAhzJKY2X0Uj2W7bVoSv-jvnA0N8Q/edit?resourcekey=0-l-wqh1fctxZXHCUrvZv2BQ#heading=h.xy85j580eik0)
for metrics which gets overriden if custom tags are set.
The unique custom tag is required to ensure the uniqueness of the
Timeseries. The format of the default custom tag is:
`java-{PID}@{HOSTNAME}`, if `{PID}` is not available a random number
will be used.
This commit adds sleep in `close()` for metrics and/or traces to be
flushed before closing observability.
Currently sleep is set to 2 * [Metrics export interval (30 secs)].
This commit adds trace information (TraceId, SpanId and TraceSampled)
fields to LogEntry, when both logging and tracing are enabled in
gcp-observability.
For server-side logs, span information was readily available using
Span.getContext() propagated via `io.grpc.Context`. Similar approach is
not feasible for client-side architecture.
Client SpanContext which has all the information required to be added
to logs is propagated to the logging interceptor via `io.grpc.CallOptions`.
This provides an example on how a client can specify a deadline for an RPC. Also covers how deadlines are propagated to further RPCs a server might make.
Extensive README, a server that exposes channelz and has pauses, and a client that uses multiple channels also exposes channelz service and has a 30 second delay to allow people to run the grpcdebug tool.
Fixit b/259286633
This commit uses [OpenCensus Annotation][] to report message size
[bytes] for inbound/received messages in traces.
`addMessageEvent` API which is currently used expects both uncompressed
and compressed message (optional) sizes to be reported at the same.
Since decompression for messages happens at a later point in time,
reporting compressed message as is and reporting uncompressed size as
`-1` renders the size as _0 bytes received_ in cloud tracing front end.
As a workaround, we add _two annotations for each received message_:
* For compressed message size
* For uncompressed message size (when it is available)
This commit also removes `addMessageEvents` a flag introduced in
PR #9485 to temporarily suppress message events for gcp-observability.
[OpenCensus Annotation]: https://www.javadoc.io/static/io.opencensus/opencensus-api/0.31.0/io/opencensus/trace/Annotation.html
Allows using Android's LocalSocket via a Socket adapter. Such an adapter
isn't generally 100% safe, since some methods may not have any effect,
but we know what methods are called by gRPC's okhttp transport and can
update the adapter or the transport as appropriate.
After #9937 was merged, the Java observability tests start to fail.
This System.exit(0) call in the existing Interop client main() method
prevented execution to continue in the new combined Observability
Interop test binary here. (The new binary is calling the old binary's
main() method.)
This is the latest version of the plugin supported by the Gradle version
in use at the moment (7.6).
Note that this also upgrades the R8 optimizer to a version (4.0.48) that
now uses "full mode" optimization by default.
This also splits off Android projects to run under Java 11 (Gradle
plugin requirement) while the other projects continue to run under Java
8.
We will be running this with Java 11, which has elevated some earlier
warnings to errors that fail the build. We don't want the build to fail
because of this.
* View -> Tool Windows -> Gradle -> Edit Run Configuration -> Defaults -> JUnit -> Before lauch -> + -> Run Gradle task, enter the task in the build.gradle that sets the javaagent.
Step 1 must be taken, otherwise by the default JUnit Test Runner running a single test in IDE will trigger all the tests.
## Guidelines for Pull Requests
How to get your contributions merged smoothly and quickly.
- Create **small PRs** that are narrowly focused on **addressing a single concern**. We often times receive PRs that are trying to fix several things at a time, but only one fix is considered acceptable, nothing gets merged and both author's & review's time is wasted. Create more PRs to address different concerns and everyone will be happy.
- For speculative changes, consider opening an issue and discussing it first. If you are suggesting a behavioral or API change, consider starting with a [gRFC proposal](https://github.com/grpc/proposal).
- Provide a good **PR description** as a record of **what** change is being made and **why** it was made. Link to a github issue if it exists.
- Don't fix code style and formatting unless you are already changing that line to address an issue. PRs with irrelevant changes won't be merged. If you do want to fix formatting or style, do that in a separate PR.
- Unless your PR is trivial, you should expect there will be reviewer comments that you'll need to address before merging. We expect you to be reasonably responsive to those comments, otherwise the PR will be closed after 2-3 weeks of inactivity.
- Maintain **clean commit history** and use **meaningful commit messages**. See [maintaining clean commit history](#maintaining-clean-commit-history) for details.
- For speculative changes, consider opening an issue and discussing it to avoid
wasting time on an inappropriate approach. If you are suggesting a behavioral
structure. Have a good **commit description** as a record of **what** and
**why** the change is being made. Link to a GitHub issue if it exists. The
commit description makes a good PR description and is auto-copied by GitHub if
you have a single commit when creating the PR.
If your change is mostly for a single module (e.g., other module changes are
trivial), prefix your commit summary with the module name changed. Instead of
"Add HTTP/2 faster-than-light support to gRPC Netty" it is more terse as
"netty: Add faster-than-light support".
- Don't fix code style and formatting unless you are already changing that line
to address an issue. If you do want to fix formatting or style, do that in a
separate PR.
- Unless your PR is trivial, you should expect there will be reviewer comments
that you'll need to address before merging. Address comments with additional
commits so the reviewer can review just the changes; do not squash reviewed
commits unless the reviewer agrees. PRs are squashed when merging.
- Keep your PR up to date with upstream/master (if there are merge conflicts, we can't really merge your change).
- **All tests need to be passing** before your change can be merged. We recommend you **run tests locally** before creating your PR to catch breakages early on. Also, `./gradlew build` (`gradlew build` on Windows) **must not introduce any new warnings**.
@ -13,14 +13,14 @@ gRPC-Java - An RPC library and framework
</table>
[](https://gitter.im/grpc/grpc?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[](https://github.com/grpc/grpc-java/actions/workflows/testing.yml?branch=master)
For information on gRPC Security Policy and reporting potentional security issues, please see [gRPC CVE Process](https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md).
For information on gRPC Security Policy and reporting potentional security
gRPC supports a number of different mechanisms for asserting identity between an client and server. This document provides code samples demonstrating how to provide SSL/TLS encryption support and identity assertions in Java, as well as passing OAuth2 tokens to services that support it.
gRPC supports a number of different mechanisms for asserting identity between an
client and server. This document provides code samples demonstrating how to
provide SSL/TLS encryption support and identity assertions in Java, as well as
passing OAuth2 tokens to services that support it.
Provider][] to ensure your application has an up-to-date OpenSSL library with
the necessary cipher-suites and a reliable ALPN implementation. This requires
[updating the security provider at runtime][config-psdsp].
Although ALPN mostly works on newer Android releases (especially since 5.0),
there are bugs and discovered security vulnerabilities that are only fixed by
upgrading the security provider. Thus, we recommend using the Play Service
Dynamic Security Provider for all Android versions.
*Note: The Dynamic Security Provider must be installed **before** creating a gRPC OkHttp channel. gRPC's OkHttpProtocolNegotiator statically initializes the security protocol(s) available to gRPC, which means that changes to the security provider after the first channel is created will not be picked up by gRPC.*
*Note: The Dynamic Security Provider must be installed **before** creating a
gRPC OkHttp channel. gRPC statically initializes the security protocol(s)
available, which means that changes to the security provider after the first
gRPC historically supported Jetty ALPN for ALPN on Java 8. While functional, it
suffers from poor performance and breakages when the JRE is upgraded.
When mis-matched to the JRE version, it can also produce unpredictable errors
that are hard to diagnose. When using it, it became common practice that any
time we saw a TLS failure that made no sense we would blame a Jetty ALPN/JRE
version mismatch and we were overwhelmingly correct. The Jetty ALPN agent makes
it much easier to use, but we still strongly discourage Jetty ALPN's use.
When using Jetty ALPN with Java 8, realize that performance will be 2-10% that
of the other options due to a slow AES GCM implementation in Java.
#### Configuring Jetty ALPN in Web Containers
Some web containers, such as [Jetty](https://www.eclipse.org/jetty/documentation/current/jetty-classloading.html) restrict access to server classes for web applications. A gRPC client running within such a container must be properly configured to allow access to the ALPN classes. In Jetty, this is done by including a `WEB-INF/jetty-env.xml` file containing the following:
```xml
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Configure PUBLIC "-//Mort Bay Consulting//DTD Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
Server server = Grpc.newServerBuilderForPort(8443, creds)
.addService(serviceImplementation)
.build();
server.start();
.build()
.start();
```
If the issuing certificate authority is not known to the client then a properly
configured SslContext or SSLSocketFactory should be provided to the
NettyChannelBuilder or OkHttpChannelBuilder, respectively.
configured trust manager should be provided to TlsChannelCredentials and used to
construct the channel.
## Mutual TLS
[Mutual authentication][] (or "client-side authentication") configuration is similar to the server by providing truststores, a client certificate and private key to the client channel. The server must also be configured to request a certificate from clients, as well as truststores for which client certificates it should allow.
Negotiated client certificates are available in the SSLSession, which is found in the `TRANSPORT_ATTR_SSL_SESSION` attribute of <ahref="https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/Grpc.java">Grpc</a>. A server interceptor can provide details in the current Context.
Negotiated client certificates are available in the SSLSession, which is found
in the `Grpc.TRANSPORT_ATTR_SSL_SESSION` attribute of the call. A server
interceptor can provide details in the current Context.
```java
// The application uses this in its handlers
public final static Context.Key<SSLSession> SSL_SESSION_CONTEXT = Context.key("SSLSession");
// The application uses this in its handlers.
public static final Context.Key<MySecurityInfo> SECURITY_INFO = Context.key("my.security.Info");
@Override
public <ReqT,RespT> ServerCall.Listener<ReqT> interceptCall(ServerCall<ReqT,RespT> call,
@ -318,8 +299,12 @@ public <ReqT, RespT> ServerCall.Listener<ReqT> interceptCall(ServerCall<ReqT, Re
if (sslSession == null) {
return next.startCall(call, headers);
}
// This interceptor can provide a centralized policy to process the client's
// certificate. Avoid exposing low-level details (like SSLSession) and
// instead provide a higher-level concept like "authenticated user."
@ -358,10 +343,6 @@ If on Fedora 30 or later and you see "libcrypt.so.1: cannot open shared object
file: No such file or directory". Run `dnf -y install libxcrypt-compat` to
install the necessary dependency.
If you are running inside of an embedded Tomcat runtime (e.g., Spring Boot),
then some versions of `netty-tcnative-boringssl-static` will have conflicts and
won't work. You must use gRPC 1.4.0 or later.
Most dependency versioning problems can be solved by using
`io.grpc:grpc-netty-shaded` instead of `io.grpc:grpc-netty`, although this also
limits your usage of the Netty-specific APIs. `io.grpc:grpc-netty-shaded`
@ -413,24 +394,24 @@ grpc-netty version | netty-handler version | netty-tcnative-boringssl-static ver
1.44.x-1.47.x | 4.1.72.Final | 2.0.46.Final
1.48.x-1.49.x | 4.1.77.Final | 2.0.53.Final
1.50.x-1.53.x | 4.1.79.Final | 2.0.54.Final
1.54.x- | 4.1.87.Final | 2.0.56.Final
1.54.x-1.55.x | 4.1.87.Final | 2.0.56.Final
1.56.x | 4.1.87.Final | 2.0.61.Final
1.57.x-1.58.x | 4.1.93.Final | 2.0.61.Final
1.59.x | 4.1.97.Final | 2.0.61.Final
1.60.x-1.66.x | 4.1.100.Final | 2.0.61.Final
1.67.x-1.70.x | 4.1.110.Final | 2.0.65.Final
1.71.x-1.74.x | 4.1.110.Final | 2.0.70.Final
1.75.x- | 4.1.124.Final | 2.0.72.Final
_(grpc-netty-shaded avoids issues with keeping these versions in sync.)_
### OkHttp
If you are using gRPC on Android devices, you are most likely using `grpc-okhttp` transport.
If you are using gRPC on Android devices, you are most likely using
`grpc-okhttp` transport.
Find the dependency tree (e.g., `mvn dependency:tree`), and look for versions of:
- `io.grpc:grpc-okhttp`
- `com.squareup.okhttp:okhttp`
If you don't have `grpc-okhttp`, you should add it as a dependency.
If you have both `io.grpc:grpc-netty` and `io.grpc:grpc-okhttp`, you may also have issues. Remove `grpc-netty` if you are on Android.
If you have `okhttp` version below 2.5.0, then it may not work with gRPC.
It is OK to have both `okhttp` 2.x and 3.x since they have different group name and under different packages.
Find the dependency tree (e.g., `mvn dependency:tree`), and look for
`io.grpc:grpc-okhttp`. If you don't have `grpc-okhttp`, you should add it as a
dependency.
# gRPC over plaintext
@ -441,17 +422,12 @@ An option is provided to use gRPC over plaintext without TLS. While this is conv
The following code snippet shows how you can call the Google Cloud PubSub API using gRPC with a service account. The credentials are loaded from a key stored in a well-known location or by detecting that the application is running in an environment that can provide one automatically, e.g. Google Compute Engine. While this example is specific to Google and it's services, similar patterns can be followed for other service providers.