Commit Graph

6837 Commits

Author SHA1 Message Date
Eric Anderson 25199e9df9
xds: XdsDepManager should ignore updates after shutdown
This prevents a NPE and subsequent channel panic when trying to build a
config (because there are no watchers, so waitingOnResource==false)
without any listener and route.
```
java.lang.NullPointerException: Cannot invoke "io.grpc.xds.XdsDependencyManager$RdsUpdateSupplier.getRdsUpdate()" because "routeSource" is null
    at io.grpc.xds.XdsDependencyManager.buildUpdate(XdsDependencyManager.java:295)
    at io.grpc.xds.XdsDependencyManager.maybePublishConfig(XdsDependencyManager.java:266)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:899)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:888)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.notifyWatcher(XdsClientImpl.java:929)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.lambda$onData$0(XdsClientImpl.java:837)
    at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
```

I think this fully-fixes the problem today, but not tomorrow.
subscribeToCluster() is racy as well, but not yet used.

This was noticed when idleTimeout was firing, with some other code
calling getState(true) to wake the channel back up. That may have made
this panic more visible than it would be otherwise, but that has not
been investigated.

b/412474567
2025-04-23 09:18:08 -07:00
Kannan J 7952afdd56
Add some documentation to StatusOr.equals regarding how underlying statuses are compared, to avoid any confusion, as suggested in issue #11949. (#12036)
Add some documentation to StatusOr.equals regarding how underlying statuses are compared, to avoid any confusion, as suggested in issue #11949.
2025-04-23 18:42:24 +05:30
Abhishek Agrawal 6cd007d0d0
xds: add the missing xds.authority metric (#12018)
This completes the [XDS client metrics](https://github.com/grpc/proposal/blob/master/A78-grpc-metrics-wrr-pf-xds.md#xdsclient) by adding the remaining grpc.xds.authority metric.
2025-04-22 14:34:51 +05:30
Eric Anderson 9619453799
Implement grpc.lb.backend_service optional label
This completes gRFC A89. 7162d2d66 and fc86084df had already implemented
the LB plumbing for the optional label on RPC metrics. This observes the
value in OpenTelemetry and adds it to WRR metrics as well.

https://github.com/grpc/proposal/blob/master/A89-backend-service-metric-label.md
2025-04-21 06:17:43 -07:00
Abhishek Agrawal 53de8a72ca
Update README etc to reference 1.72.0 (#12025) 2025-04-17 17:37:08 +05:30
MV Shiva 7a08fdb7f9
xds: float LRU cache across interceptors (#11992) 2025-04-17 07:26:40 +05:30
Kurt Alfred Kluever 84bd01454b context: Remove mention of "epoch" from Ticker.nanoTime() javadocs, plus other minor touchups
In Java, when people hear "epoch", they think "unix epoch".

cl/747082451
2025-04-15 13:00:35 -07:00
Eric Anderson 65d0bb8a4d
xds: Enable deprecation warnings
The security code referenced fields removed from gRFC A29 before it was
finalized.

Note that this fixes a bug in CommonTlsContextUtil where
CombinedValidationContext was not checked. I believe this was the only
location with such a bug as I audited all non-test usages of
has/getValidationContext() and confirmed they have have a corresponding
has/getCombinedValidationContext().
2025-04-11 08:25:21 -07:00
Eric Anderson f79ab2f16f api: Remove deprecated SubchannelPicker.requestConnection()
It has been deprecated since cec9ee368, six years ago. It was replaced
with LoadBalancer.requestConnection().
2025-04-09 12:51:33 -07:00
Eric Anderson a6aec2769e auth: Use pre-existing private key in test
Generating a KeyPair is very expensive when running with TSAN, because
TSAN keeps the JVM in interpreted mode. This speeds up the test running
on my desktop from .368s to .151s; faster, but nobody cares. With TSAN,
the speedup is from 150-500s to 4-6s. Within Google the test was timing
out because it was taking so long. While we can increase the timeout,
it seems better to speed up the test in this easy way.
2025-04-08 11:03:45 -07:00
Eric Anderson 2db4852e23 core: Loop over interceptors when computing effective interceptors
A post-merge review of 8516cfef9 suggested this change and the comment
had been lost in my inbox.
2025-04-07 21:33:55 -07:00
jiangyuan 54d37839a3
stub: trailersFromThrowable() metadata should be copied (#11979)
If the same exception is passed to multiple RPCs, then the results will
race.

Fixes #11973
2025-04-07 14:34:57 -07:00
Eric Anderson aae52de3b8 stub: Add RunWith(JUnit4) to support varied environments
Some JUnit environments require the RunWith annotation. Notably
Blaze/Bazel needs it.
2025-04-04 12:28:37 -07:00
Kannan J a13fca2bf2
xds: ClusterResolverLoadBalancer handle update for both resolved addresses and errors via ResolutionResult (#11997) 2025-04-04 22:08:29 +05:30
Alex Panchenko edc2bf7346
stub: Utility method StreamObservers.nextAndComplete() that does both onNext and onComplete (#11778) 2025-04-04 19:39:35 +05:30
Eric Anderson 5ca4d852ae
core: Avoid Set.removeAll() when passing a possibly-large List (#11994)
See #11958
2025-04-04 17:46:37 +05:30
Abhishek Agrawal d4c46a7f1f
refactor: prevents global stats config freeze in ConfiguratorRegistry.getConfigurators() (#11991) 2025-04-04 11:23:08 +05:30
MV Shiva c8d1e6e39c
xds: listener type validation (#11933) 2025-04-03 11:22:26 +05:30
MV Shiva 84c7713b2f
xds: propagate audience from cluster resource in gcp auth filter (#11972) 2025-04-02 16:29:55 +05:30
Eric Anderson 908f9f19cd
core: Delete the long-deprecated GRPC_PROXY_EXP (#11988)
"EXP" stood for experimental and all documentation that referenced it made it clear it was experimental. It's been some years since we started logging a message when it was used to say it will be deleted. There's no time like the present to delete it.
2025-04-02 16:24:32 +05:30
Eric Anderson 8ca7c4ef1f
core: Delete stale SuppressWarnings("deprecated") for ATTR_LOAD_BALANCING_CONFIG (#11982)
ATTR_LOAD_BALANCING_CONFIG was deleted in bf7a42dbd.
2025-04-02 16:22:00 +05:30
Kannan J c28a7e3e06
okhttp: Per-rpc call option authority verification (#11754) 2025-04-02 10:10:41 +05:30
Abhishek Agrawal 8f6a16f846
Start 1.73.0 development cycle (#11987) 2025-04-01 16:27:39 +05:30
Eric Anderson 2448c8b6b9
util: Replace BUFFER_PICKER with FixedResultPicker
I think at some point there were more usages in the tests. But now it
is pretty easy.

PriorityLb.ChildLbState.picker is initialized to
FixedResultPicker(NoResult). So now that GracefulSwitchLb is using the
same picker, equals() is able to de-dup an update.
2025-03-28 12:49:36 -07:00
Eric Anderson 2e260a4bbc util: Graceful switch to new LB when leaving CONNECTING
Previously it would wait for the new LB to enter READY. However, that
prevents there being an upper-bound on how long the old policy will
continue to be used. The point of graceful switch is to avoid RPCs
seeing increased latency when we swap config. We don't want it to
prevent the system from becoming eventually consistent.
2025-03-28 15:18:10 +00:00
Alex Panchenko 7507a9ec06
core: Use java.time.Time.getNano in InstantTimeProvider without reflection (#11977)
Fixes #11975
2025-03-26 13:49:21 +05:30
Abhishek Agrawal a332eddc13
fix: cleans up FileWatcherCertificateProvider in XdsSecurityClientServerTest 2025-03-26 11:43:05 +05:30
jiangyuan 350f90e1a3
services: Avoid cancellation exceptions when notifying watchers that already have their connections cancelled (#11934)
Some clients watching health status can cancel their watch and `HealthService` when trying to notify these watchers were getting CANCELLED exception because there was no cancellation  handler set on the `StreamObserver`. This change sets the cancellation handler that removes the watcher from the set of watcher clients to be notified of the health status.
2025-03-25 17:42:28 +05:30
Eric Anderson 3961a923ac
core: Log any exception during panic because of exception
panic() calls a good amount of code, so it could get another exception.
The SynchronizationContext is running on an arbitrary thread and we
don't want to propagate this secondary exception up its stack (to be
handled by its UncaughtExceptionHandler); it we wanted that we'd
propagate the original exception.

This second exception will only be seen in the logs; the first exception
was logged and will be used to fail RPCs.

Also related to http://yaqs/8493785598685872128 and b692b9d26
2025-03-24 14:32:53 -07:00
Ashley Zhang 1958e42370
xds: add support for custom per-target credentials on the transport (#11951) 2025-03-21 15:19:40 -07:00
yifeizhuang 94f8e93691
otel tracing: fix span names (#11974) 2025-03-21 15:19:25 -07:00
Alex Panchenko d60e6fc251
Replace usages of deprecated ExpectedException in grpc-api and grpc-core (#11962) 2025-03-21 13:00:24 +05:30
Eric Anderson d2d72cda83
xds: Expose filter names to filter instances (#11971)
This is to support gRFC A83 xDS GCP Authentication Filter:
> Otherwise, the filter will look in the CDS resource's metadata for a
> key corresponding to the filter's instance name.
2025-03-21 11:01:16 +05:30
Eric Anderson bb120a8cbb xds: Assert XdsNR's cluster ref counting is consistent
It is much harder to debug refcounting problems when we ignore
impossible situations. So make such impossible cases complain loudly so
the bug is obvious.
2025-03-19 13:47:02 -07:00
Eric Anderson bc3c764058 xds: Include XdsConfig as a CallOption
This allows Filters to access the xds configuration for their own
processing. From gRFC A83:

> This data is available via the XdsConfig attribute introduced in A74.
> If the xDS ConfigSelector is not already passing that attribute to the
> filters, it will need to be changed to do so.
2025-03-19 09:04:27 -07:00
Abhishek Agrawal a57c14a51e
refactor: Stops exception allocation on channel shutdown
This fixes #11955.

Stops exception allocation and its propagation on channel shutdown.
2025-03-19 09:27:34 +05:30
Eric Anderson e80c197455
xds: Use XdsDependencyManager for XdsNameResolver
Contributes to the gRFC A74 effort.
https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md

The alternative to using Mockito's ArgumentMatcher is to use Hamcrest.
However, Hamcrest did not impress me. ArgumentMatcher is trivial if you
don't care about the error message.

This fixes a pre-existing issue where ConfigSelector.releaseCluster
could revert the LB config back to using cluster manager after releasing
all RPCs using a cluster have committed.

Co-authored-by: Larry Safran <lsafran@google.com>
2025-03-18 14:05:01 -07:00
MV Shiva e388ef3975
documentation: upgrade to junit 4.13.2 (#11967) 2025-03-18 18:43:03 +05:30
Dennis Shao b69bd64ce7
Populate the pb::java feature extension to gprc proto plugin (#11885)
Populate the pb::java feature extension to the protoc plugins that require Protobuf Java feature resolution for the  edition.
2025-03-17 18:46:28 +05:30
Alex Panchenko fca1d3cf43
servlet: set description for CANCELLED status (#11927) 2025-03-12 14:09:49 +05:30
MV Shiva 2f52a00364
netty: Swap to UniformStreamByteDistributor (#11954) 2025-03-11 22:39:54 +05:30
Kannan J 2191557582
Update README etc to reference 1.71.0 (#11940) 2025-03-11 16:05:39 +05:30
Arjan Singh Bal 4933cddd00
Fix typo in dualstack example (#11916) 2025-03-11 16:05:05 +05:30
Kannan J 24b9f6ff0d
Update psm-dualstack.cfg (#11950)
120 minutes has not been sufficient, causing frequent VM timeout errors in the test runs: https://testgrid.corp.google.com/grpc-psm-java#v1.67.x&width=20&graph-metrics=test-duration-minutes&include-filter-by-regex=psm-dualstack$
2025-03-10 12:33:45 +05:30
Emmanuel Ferdman 61a110d962
examples: Update in-process sources in examples (#11952)
Update in-process sources location in examples since they have been migrated from core artifacts.
2025-03-10 05:20:20 +00:00
Eric Anderson f3f054a0a4 xds: Log cluster_manager config update before applying config
It is confusing/harder to read the logs when the
activations/deactivations because of the config happen before the log
entry describing the new config.
2025-03-07 14:37:37 -08:00
Eric Anderson d82613a74c
xds: Fix cluster selection races when updating config selector
Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).

The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819ac6, updateBalancingState() updates the picker immediately.

cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.

The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819ac6 along
with this commit reduced it to 0 failures in 1000 runs.

Discovered when investigating b/394850611
2025-03-07 10:33:35 -08:00
Eric Anderson ca4819ac6d core: Apply ManagedChannelImpl's updateBalancingState() immediately
ffcc360ba adjusted updateBalancingState() to require being run within
the sync context. However, it still queued the work into the sync
context, which was unnecessary. This re-entering the sync context
unnecessarily delays the new state from being used.
2025-03-06 12:31:10 -08:00
Sergii Tkachenko a6a041e415
xds: Support filter state retention
This PR adds support filter state retention in Java. The mechanism
will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.

### Filter instance lifecycle
#### xDS gRPC clients
New filter instances are created per combination of:
1. `XdsNameResolver` instance,
2. Filter name+typeUrl as configured in 
   HttpConnectionManager (HCM) http_filters.

Existing client-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
  HCM that is missing filter configuration for name+typeUrl
  combination of this instance.
- All filter instances when watched LDS resource is missing from an
  LDS update.
- All filter instances name resolver shutdown.

#### xDS-enabled gRPC servers
New filter instances are created per combination of:
1. Server instance,
2. FilterChain name,
3. Filter name+typeUrl as configured in FilterChain's HCM.http_filters

Filter instances of Default Filter Chain is tracked separately per:
1. Server instance,
2. Filter name+typeUrl in default_filter_chain's HCM.http_filters.

Existing server-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
  FilterChain with HCM.http_filters that is missing configuration for
  filter name+typeUrl.
- All filter instances associated with the FilterChain when an LDS
  update no longer contains FilterChain's name.
- All filter instances when watched LDS resource is missing from an
  LDS update.
- All filter instances on server shutdown.

### Related
- Part 1: #11883
2025-03-06 10:32:08 -08:00
MV Shiva 602aece081
xds: avoid unnecessary dns lookup (#11932) 2025-03-06 16:04:53 +05:30