Commit Graph

1118 Commits

Author SHA1 Message Date
MV Shiva bfb55b5553
xds: Add GcpAuthenticationFilter to FilterRegistry (#12075) (#12086) 2025-05-23 14:57:04 +05:30
MV Shiva e772265530
xds: Enable least request by default (#12054) (#12062) 2025-05-15 10:56:14 +05:30
vinodhabib 6baac45bd2
xds: Fix pretty-print of Cluster with WrrLocality and LB policies (#12037) 2025-05-12 12:44:14 +05:30
Eric Anderson 80cc988b3c
xds: Use acceptResolvedAddresses() for WeightedTarget children (#12053)
Convert the tests to use acceptResolvedAddresses() as well.
2025-05-08 11:34:16 +05:30
Kim Jin Young 12aaf88d86
Fix comment's typo (#12045) 2025-05-05 22:32:31 +05:30
Eric Anderson 25199e9df9
xds: XdsDepManager should ignore updates after shutdown
This prevents a NPE and subsequent channel panic when trying to build a
config (because there are no watchers, so waitingOnResource==false)
without any listener and route.
```
java.lang.NullPointerException: Cannot invoke "io.grpc.xds.XdsDependencyManager$RdsUpdateSupplier.getRdsUpdate()" because "routeSource" is null
    at io.grpc.xds.XdsDependencyManager.buildUpdate(XdsDependencyManager.java:295)
    at io.grpc.xds.XdsDependencyManager.maybePublishConfig(XdsDependencyManager.java:266)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:899)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:888)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.notifyWatcher(XdsClientImpl.java:929)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.lambda$onData$0(XdsClientImpl.java:837)
    at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
```

I think this fully-fixes the problem today, but not tomorrow.
subscribeToCluster() is racy as well, but not yet used.

This was noticed when idleTimeout was firing, with some other code
calling getState(true) to wake the channel back up. That may have made
this panic more visible than it would be otherwise, but that has not
been investigated.

b/412474567
2025-04-23 09:18:08 -07:00
Abhishek Agrawal 6cd007d0d0
xds: add the missing xds.authority metric (#12018)
This completes the [XDS client metrics](https://github.com/grpc/proposal/blob/master/A78-grpc-metrics-wrr-pf-xds.md#xdsclient) by adding the remaining grpc.xds.authority metric.
2025-04-22 14:34:51 +05:30
Eric Anderson 9619453799
Implement grpc.lb.backend_service optional label
This completes gRFC A89. 7162d2d66 and fc86084df had already implemented
the LB plumbing for the optional label on RPC metrics. This observes the
value in OpenTelemetry and adds it to WRR metrics as well.

https://github.com/grpc/proposal/blob/master/A89-backend-service-metric-label.md
2025-04-21 06:17:43 -07:00
MV Shiva 7a08fdb7f9
xds: float LRU cache across interceptors (#11992) 2025-04-17 07:26:40 +05:30
Eric Anderson 65d0bb8a4d
xds: Enable deprecation warnings
The security code referenced fields removed from gRFC A29 before it was
finalized.

Note that this fixes a bug in CommonTlsContextUtil where
CombinedValidationContext was not checked. I believe this was the only
location with such a bug as I audited all non-test usages of
has/getValidationContext() and confirmed they have have a corresponding
has/getCombinedValidationContext().
2025-04-11 08:25:21 -07:00
Eric Anderson f79ab2f16f api: Remove deprecated SubchannelPicker.requestConnection()
It has been deprecated since cec9ee368, six years ago. It was replaced
with LoadBalancer.requestConnection().
2025-04-09 12:51:33 -07:00
Kannan J a13fca2bf2
xds: ClusterResolverLoadBalancer handle update for both resolved addresses and errors via ResolutionResult (#11997) 2025-04-04 22:08:29 +05:30
MV Shiva c8d1e6e39c
xds: listener type validation (#11933) 2025-04-03 11:22:26 +05:30
MV Shiva 84c7713b2f
xds: propagate audience from cluster resource in gcp auth filter (#11972) 2025-04-02 16:29:55 +05:30
Eric Anderson 2448c8b6b9
util: Replace BUFFER_PICKER with FixedResultPicker
I think at some point there were more usages in the tests. But now it
is pretty easy.

PriorityLb.ChildLbState.picker is initialized to
FixedResultPicker(NoResult). So now that GracefulSwitchLb is using the
same picker, equals() is able to de-dup an update.
2025-03-28 12:49:36 -07:00
Abhishek Agrawal a332eddc13
fix: cleans up FileWatcherCertificateProvider in XdsSecurityClientServerTest 2025-03-26 11:43:05 +05:30
Ashley Zhang 1958e42370
xds: add support for custom per-target credentials on the transport (#11951) 2025-03-21 15:19:40 -07:00
Eric Anderson d2d72cda83
xds: Expose filter names to filter instances (#11971)
This is to support gRFC A83 xDS GCP Authentication Filter:
> Otherwise, the filter will look in the CDS resource's metadata for a
> key corresponding to the filter's instance name.
2025-03-21 11:01:16 +05:30
Eric Anderson bb120a8cbb xds: Assert XdsNR's cluster ref counting is consistent
It is much harder to debug refcounting problems when we ignore
impossible situations. So make such impossible cases complain loudly so
the bug is obvious.
2025-03-19 13:47:02 -07:00
Eric Anderson bc3c764058 xds: Include XdsConfig as a CallOption
This allows Filters to access the xds configuration for their own
processing. From gRFC A83:

> This data is available via the XdsConfig attribute introduced in A74.
> If the xDS ConfigSelector is not already passing that attribute to the
> filters, it will need to be changed to do so.
2025-03-19 09:04:27 -07:00
Eric Anderson e80c197455
xds: Use XdsDependencyManager for XdsNameResolver
Contributes to the gRFC A74 effort.
https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md

The alternative to using Mockito's ArgumentMatcher is to use Hamcrest.
However, Hamcrest did not impress me. ArgumentMatcher is trivial if you
don't care about the error message.

This fixes a pre-existing issue where ConfigSelector.releaseCluster
could revert the LB config back to using cluster manager after releasing
all RPCs using a cluster have committed.

Co-authored-by: Larry Safran <lsafran@google.com>
2025-03-18 14:05:01 -07:00
Eric Anderson f3f054a0a4 xds: Log cluster_manager config update before applying config
It is confusing/harder to read the logs when the
activations/deactivations because of the config happen before the log
entry describing the new config.
2025-03-07 14:37:37 -08:00
Eric Anderson d82613a74c
xds: Fix cluster selection races when updating config selector
Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).

The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819ac6, updateBalancingState() updates the picker immediately.

cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.

The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819ac6 along
with this commit reduced it to 0 failures in 1000 runs.

Discovered when investigating b/394850611
2025-03-07 10:33:35 -08:00
Sergii Tkachenko a6a041e415
xds: Support filter state retention
This PR adds support filter state retention in Java. The mechanism
will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.

### Filter instance lifecycle
#### xDS gRPC clients
New filter instances are created per combination of:
1. `XdsNameResolver` instance,
2. Filter name+typeUrl as configured in 
   HttpConnectionManager (HCM) http_filters.

Existing client-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
  HCM that is missing filter configuration for name+typeUrl
  combination of this instance.
- All filter instances when watched LDS resource is missing from an
  LDS update.
- All filter instances name resolver shutdown.

#### xDS-enabled gRPC servers
New filter instances are created per combination of:
1. Server instance,
2. FilterChain name,
3. Filter name+typeUrl as configured in FilterChain's HCM.http_filters

Filter instances of Default Filter Chain is tracked separately per:
1. Server instance,
2. Filter name+typeUrl in default_filter_chain's HCM.http_filters.

Existing server-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
  FilterChain with HCM.http_filters that is missing configuration for
  filter name+typeUrl.
- All filter instances associated with the FilterChain when an LDS
  update no longer contains FilterChain's name.
- All filter instances when watched LDS resource is missing from an
  LDS update.
- All filter instances on server shutdown.

### Related
- Part 1: #11883
2025-03-06 10:32:08 -08:00
MV Shiva 602aece081
xds: avoid unnecessary dns lookup (#11932) 2025-03-06 16:04:53 +05:30
MV Shiva 12197065fe
xds: xDS-based HTTP CONNECT configuration (#11861) 2025-03-06 13:40:18 +05:30
Sergii Tkachenko 1a2285b527
xds: ensure server interceptors are created in a sync context (#11930)
`XdsServerWrapper#generatePerRouteInterceptors` was always intended
to be executed within a sync context. This PR ensures that by calling
`syncContext.throwIfNotInThisSynchronizationContext()`.

This change is needed for upcoming xDS filter state retention because
the new tests in XdsServerWrapperTest flake with this NPE:

> `Cannot invoke "io.grpc.xds.client.XdsClient$ResourceWatcher.onChanged(io.grpc.xds.client.XdsClient$ResourceUpdate)" because "this.ldsWatcher" is null`
2025-03-03 14:28:36 -08:00
Eric Anderson 57124d6b29 Use acceptResolvedAddresses() in easy cases
We want to move away from handleResolvedAddresses(). These are "easy" in
that they need no logic. LBs extending ForwardingLoadBalancer had the
method duplicated from handleResolvedAddresses() and swapped away from
`super` because ForwardingLoadBalancer only forwards
handleResolvedAddresses() reliably today. Duplicating small methods was
less bug-prone than dealing with ForwardingLoadBalancer.
2025-02-20 21:25:55 -08:00
Eric Anderson 110c1ff0d6 xds: Use acceptResolvedAddresses() for PriorityLb children
PriorityLb should propagate config problems up to the name resolver so
it can refresh.
2025-02-20 16:35:54 -08:00
Daniel Liu 892144dcac
xds: explicitly set request hash key for the ring hash LB policy
Implements [gRFC A76: explicitly setting the request hash key for the
ring hash LB policy][A76]
* Explictly setting the request hash key is guarded by the
  `GRPC_EXPERIMENTAL_RING_HASH_SET_REQUEST_HASH_KEY` environment
  variable until API stabilized. 

Tested:
* Verified end-to-end by spinning up multiple gRPC servers and a gRPC
  client that injects a custom service (load balancing) config with
  `ring_hash_experimental` and a custom `request_hash_header` (with
  NO associated value in the metadata headers) which generates a random
  hash for each request to the ring hash LB. Verified picks/RPCs are
  split evenly/uniformly across all backends.
* Ran affected unit tests with thread sanitizer and 1000 iterations to
  prevent data races.

[A76]: https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md#explicitly-setting-the-request-hash-key
2025-02-19 20:25:33 -08:00
Sergii Tkachenko 2b87b01651
xds: Change how xDS filters are created by introducing Filter.Provider (#11883)
This is the first step towards supporting filter state retention in
Java. The mechanism will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.

In Java, xDS HTTP filters are backed by classes implementing
`io.grpc.xds.Filter`, from here just called "Filters". To support
Filter state retention (next PR), Java's xDS implementation must be
able to create unique Filter instances per:
- Per HCM
  `envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager`
- Per filter name as specified in
  `envoy.extensions.filters.network.http_connection_manager.v3.HttpFilter.name`

This PR **does not** implements Filter state retention, but lays the
groundwork for it by changing how filters are registered and
instantiated. To achieve this, all existing Filter classes had to be
updated to the new instantiation mechanism described below.

Prior to these this PR, Filters had no livecycle. FilterRegistry
provided singleton instances for a given typeUrl. This PR introduces
a new interface `Filter.Provider`, which instantiates Filter classes.
All functionality that doesn't need an instance of a Filter is moved
to the Filter.Provider. This includes parsing filter config proto
into FilterConfig and determining the filter kind
(client-side, server-side, or both).

This PR is limited to refactoring, and there's no changes to the
existing behavior. Note that all Filter Providers still return
singleton Filter instances. However, with this PR, it is now possible
to create Providers that return a new Filter instance each time
`newInstance` is called.
2025-02-18 10:47:01 -08:00
Eric Anderson 713607056e util: Use acceptResolvedAddresses() for MultiChildLb children
A failing Status from acceptResolvedAddresses means something is wrong
with the config, but parts of the config may still have been applied.
Thus there are now two possible flows: errors that should prevent
updateOverallBalancingState() and errors that should have no effect
other than the return code. To manage that, MultChildLb must always be
responsible for calling updateOverallBalancingState().
acceptResolvedAddressesInternal() was inlined to make that error
processing easier. No existing usages actually needed to have logic
between updating the children and regenerating the picker.

RingHashLb already was verifying that the address list was not empty, so
the short-circuiting when acceptResolvedAddressesInternal() returned an
error was impossible to trigger. WrrLb's updateWeightTask() calls the
last picker, so it can run before acceptResolvedAddressesInternal(); the
only part that matters is re-creating the weightUpdateTimer.
2025-02-18 07:33:49 -08:00
Larry Safran 41dd0c6d73
xds:Cleanup to reduce test flakiness (#11895)
* don't process resourceDoesNotExist for watchers that have been cancelled.

* Change test to use an ArgumentMatcher instead of expecting that only the final result will be sent since depending on timing there may be configs sent for clusters being removed with their entries as errors.
2025-02-14 10:23:54 -08:00
Larry Safran 764a4e3f08
xds: Cleanup by moving methods in XdsDependencyManager ahead of classes (#11890)
* Move private methods ahead of classes
2025-02-11 14:34:46 -08:00
Larry Safran ade2dd2038
xds: Change XdsClusterConfig to have children field instead of endpoint (#11888)
* Change XdsConfig to match spec with a `children` object holding either `a list of leaf cluster names` or `an EdsUpdate`.  Removed intermediate aggregate nodes from `XdsConfig.clusters`.
2025-02-11 12:38:52 -08:00
Sergii Tkachenko bd6af59221
xds: improve code readability of server FilterChain parsing
- Improve code flow and variable names
- Reduce nesting
- Add comments between logical blocks
- Add comments explaining some xDS/gRPC nuances
2025-02-10 17:14:07 -08:00
Larry Safran 67fc2e156a
Add new classes for eliminating xds config tears (#11740)
* Framework definition to support A74
2025-02-07 16:33:17 -08:00
Eric Anderson 199a7ea3e8
xds: Improve XdsNR's selectConfig() variable handling
The variables from the do-while are no longer initialized to let the
compiler verify that the loop sets each. Unnecessary comparisons to null
are also removed and is more obvious as the variables are never set to
null. Added a minor optimization of computing the RPCs path once instead
of once for each route. The variable declarations were also sorted to
match their initialization order.

This does fix an unlikely bug where if the old code could successfully
matched a route but fail to retain the cluster, then when trying a
second time if the route was _not_ matched it would re-use the prior route
and thus infinite-loop failing to retain that same cluster.

It also adds a missing cast to unsigned long for a uint32 weight. The old
code would detect if the _sum_ was negative, but a weight using 32 bits
would have been negative and never selected.
2025-02-05 10:37:22 -08:00
Eric Anderson 04f1cc5845 xds: Make XdsNR.RoutingConfig.empty a constant
The field was made final in 4b52639aa but was soon reverted in 3ebb3e192
because of what I assume was a bad merge conflict resolution. The field
has contained an immutable object since its introduction in d25f5acf1,
so it is pretty likely to remain a constant in the future.
2025-01-30 15:10:12 -08:00
Eric Anderson c506190b0f
xds: Reuse filter interceptors across RPCs
This moves the interceptor creation from the ConfigSelector to the
resource update handling.

The code structure changes will make adding support for filter
lifecycles (for RLQS) a bit easier. The filter lifecycles will allow
filters to share state across interceptors, and constructing all the
interceptors on a single thread will mean filters wouldn't need to be
thread-safe (but their interceptors would be thread-safe).
2025-01-30 12:43:51 -08:00
Eric Anderson b3db8c2489 xds: Allow FaultFilter's interceptor to be reused
This is the only usage of PickSubchannelArgs when creating a filter's
ClientInterceptor, and a follow-up commit will remove the argument and
actually reuse the interceptors. Other filter's interceptors can
already be reused.

There doesn't seem to be any significant loss of legibility by making
FaultFilter a more ordinary interceptor, but the change does cause the
ForwardingClientCall to be present when faultDelay is configured,
independent of whether the fault delay ends up being triggered.

Reusing interceptors will move more state management out of the RPC path
which will be more relevant with RLQS.
2025-01-29 14:21:53 -08:00
Kannan J 0f5503ebb1
xds: Include max concurrent request limit in the error status for concurre… (#11845)
Include max concurrent request limit in the error status for concurrent connections limit exceeded
2025-01-23 21:40:21 +05:30
Eric Anderson 495a8906b2 xds: Fix fallback test FakeClock TSAN failure
d65d3942e increased the test speed of
connect_then_mainServerDown_fallbackServerUp by using FakeClock.
However, it introduced a data race because FakeClock is not thread-safe.
This change injects a single thread for gRPC callbacks such that
syncContext is run on a thread under the test's control.

A simpler approach would be to expose syncContext from XdsClientImpl for
testing. However, this test is in a different package and I wanted to
avoid adding a public method.

```
  Read of size 8 at 0x00008dec9d50 by thread T25:
    #0 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Lio/grpc/internal/FakeClock$ScheduledTask;JLjava/util/concurrent/TimeUnit;)V FakeClock.java:140
    #1 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;)Ljava/util/concurrent/ScheduledFuture; FakeClock.java:150
    #2 io.grpc.SynchronizationContext.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/ScheduledExecutorService;)Lio/grpc/SynchronizationContext$ScheduledHandle; SynchronizationContext.java:153
    #3 io.grpc.xds.client.ControlPlaneClient$AdsStream.handleRpcStreamClosed(Lio/grpc/Status;)V ControlPlaneClient.java:491
    #4 io.grpc.xds.client.ControlPlaneClient$AdsStream.lambda$onStatusReceived$0(Lio/grpc/Status;)V ControlPlaneClient.java:429
    #5 io.grpc.xds.client.ControlPlaneClient$AdsStream$$Lambda+0x00000001004a95d0.run()V ??
    #6 io.grpc.SynchronizationContext.drain()V SynchronizationContext.java:96
    #7 io.grpc.SynchronizationContext.execute(Ljava/lang/Runnable;)V SynchronizationContext.java:128
    #8 io.grpc.xds.client.ControlPlaneClient$AdsStream.onStatusReceived(Lio/grpc/Status;)V ControlPlaneClient.java:428
    #9 io.grpc.xds.GrpcXdsTransportFactory$EventHandlerToCallListenerAdapter.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V GrpcXdsTransportFactory.java:149
    #10 io.grpc.PartialForwardingClientCallListener.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V PartialForwardingClientCallListener.java:39
    ...

  Previous write of size 8 at 0x00008dec9d50 by thread T4 (mutexes: write M0, write M1, write M2, write M3):
    #0 io.grpc.internal.FakeClock.forwardTime(JLjava/util/concurrent/TimeUnit;)I FakeClock.java:368
    #1 io.grpc.xds.XdsClientFallbackTest.connect_then_mainServerDown_fallbackServerUp()V XdsClientFallbackTest.java:358
    ...
```
2025-01-22 16:00:00 -08:00
Eric Anderson fc86084df5 xds: Rename grpc.xds.cluster to grpc.lb.backend_service
The name is being changed to allow the value to be used in more metrics
where xds-specifics are awkward.
2025-01-17 17:16:32 -08:00
MV Shiva b44ebce45d
xds: Envoy proto sync to 2024-11-11 (#11816) 2025-01-17 14:58:52 +05:30
Eric Anderson 7162d2d661 xds: Pass grpc.xds.cluster label to tracer
This is in service to gRFC A89. Since the gRFC isn't finalized this
purposefully doesn't really do anything yet. The grpc-opentelemetry
change to use this optional label will be done after the gRFC is merged.
grpc-opentelemetry currently has a hard-coded list (one entry) of labels
that it looks for, and this label will need to be added.

b/356167676
2025-01-13 11:54:36 -08:00
Larry Safran 176f3eed12
xds: Enable Xds Client Fallback by default (#11817) 2025-01-10 15:25:15 -08:00
Eric Anderson d65d3942e6 xds: Increase speed of fallback test
These changes reduce connect_then_mainServerDown_fallbackServerUp test
time from 20 seconds to 5 s by faking time for the the does-no-exist
timer.

XdsClientImpl only uses the TimeProvider for CSDS cache details, so any
implementation should be fine. FakeXdsClient provides an implementation,
so might as well use it as it is one less clock to think about.
2025-01-10 08:28:48 -08:00
Eric Anderson 70825adce6 Replace jsr305's GuardedBy with Error Prone's
We should avoid jsr305 and error prone's has the same semantics.
2025-01-10 08:16:48 -08:00
Eric Anderson 7b5d0692cc
Replace jsr305's CheckReturnValue with Error Prone's (#11811)
We should avoid jsr305 and error prone's has the same semantics.

Fixes #8687
2025-01-09 13:45:35 -08:00