Commit Graph

1097 Commits

Author SHA1 Message Date
Eric Anderson f3f054a0a4 xds: Log cluster_manager config update before applying config
It is confusing/harder to read the logs when the
activations/deactivations because of the config happen before the log
entry describing the new config.
2025-03-07 14:37:37 -08:00
Eric Anderson d82613a74c
xds: Fix cluster selection races when updating config selector
Listener2.onResult() doesn't require running in the sync context, so
when called from the sync context it is guaranteed not to do its
processing immediately (instead, it schedules work into the sync
context).

The code was doing an update dance: 1) update service config to add new
cluster, 2) update config selector to use new cluster, 3) update service
config to remove old clusters. But the onResult() wasn't being processed
immediately, so the actual execution order was 2, 1, 3 which has a small
window where RPCs will fail. But onResult2() does run immediately. And
since ca4819ac6, updateBalancingState() updates the picker immediately.

cleanUpRoutes() was also racy because it updated the routingConfig
before swapping to the new config selector, so RPCs could fail saying
there was no route instead of the useful error message. Even with the
opposite order, some RPCs may be executing the while loop of
selectConfig(), trying to acquire a cluster. The code unreffed the
clusters before updating the routingConfig, so those RPCs could go into
a tight loop until the routingConfig was updated. Also, once the
routingConfig was updated to EMPTY those RPCs would similarly
see the wrong error message. To give the correct error message,
selectConfig() must fail such RPCs directly, and once it can do that
there's no need to stop using the config selector in error cases. This
has the benefit of fewer moving parts and more consistent threading
among cases.

The added test was able to detect the race 2% of the time. The slower
the code/machine, the more reliable the test failed. ca4819ac6 along
with this commit reduced it to 0 failures in 1000 runs.

Discovered when investigating b/394850611
2025-03-07 10:33:35 -08:00
Sergii Tkachenko a6a041e415
xds: Support filter state retention
This PR adds support filter state retention in Java. The mechanism
will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.

### Filter instance lifecycle
#### xDS gRPC clients
New filter instances are created per combination of:
1. `XdsNameResolver` instance,
2. Filter name+typeUrl as configured in 
   HttpConnectionManager (HCM) http_filters.

Existing client-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
  HCM that is missing filter configuration for name+typeUrl
  combination of this instance.
- All filter instances when watched LDS resource is missing from an
  LDS update.
- All filter instances name resolver shutdown.

#### xDS-enabled gRPC servers
New filter instances are created per combination of:
1. Server instance,
2. FilterChain name,
3. Filter name+typeUrl as configured in FilterChain's HCM.http_filters

Filter instances of Default Filter Chain is tracked separately per:
1. Server instance,
2. Filter name+typeUrl in default_filter_chain's HCM.http_filters.

Existing server-side filter instances are shutdown:
- A single a filter instance is shutdown when an LDS update contains
  FilterChain with HCM.http_filters that is missing configuration for
  filter name+typeUrl.
- All filter instances associated with the FilterChain when an LDS
  update no longer contains FilterChain's name.
- All filter instances when watched LDS resource is missing from an
  LDS update.
- All filter instances on server shutdown.

### Related
- Part 1: #11883
2025-03-06 10:32:08 -08:00
MV Shiva 602aece081
xds: avoid unnecessary dns lookup (#11932) 2025-03-06 16:04:53 +05:30
MV Shiva 12197065fe
xds: xDS-based HTTP CONNECT configuration (#11861) 2025-03-06 13:40:18 +05:30
Sergii Tkachenko 1a2285b527
xds: ensure server interceptors are created in a sync context (#11930)
`XdsServerWrapper#generatePerRouteInterceptors` was always intended
to be executed within a sync context. This PR ensures that by calling
`syncContext.throwIfNotInThisSynchronizationContext()`.

This change is needed for upcoming xDS filter state retention because
the new tests in XdsServerWrapperTest flake with this NPE:

> `Cannot invoke "io.grpc.xds.client.XdsClient$ResourceWatcher.onChanged(io.grpc.xds.client.XdsClient$ResourceUpdate)" because "this.ldsWatcher" is null`
2025-03-03 14:28:36 -08:00
Eric Anderson 57124d6b29 Use acceptResolvedAddresses() in easy cases
We want to move away from handleResolvedAddresses(). These are "easy" in
that they need no logic. LBs extending ForwardingLoadBalancer had the
method duplicated from handleResolvedAddresses() and swapped away from
`super` because ForwardingLoadBalancer only forwards
handleResolvedAddresses() reliably today. Duplicating small methods was
less bug-prone than dealing with ForwardingLoadBalancer.
2025-02-20 21:25:55 -08:00
Eric Anderson 110c1ff0d6 xds: Use acceptResolvedAddresses() for PriorityLb children
PriorityLb should propagate config problems up to the name resolver so
it can refresh.
2025-02-20 16:35:54 -08:00
Daniel Liu 892144dcac
xds: explicitly set request hash key for the ring hash LB policy
Implements [gRFC A76: explicitly setting the request hash key for the
ring hash LB policy][A76]
* Explictly setting the request hash key is guarded by the
  `GRPC_EXPERIMENTAL_RING_HASH_SET_REQUEST_HASH_KEY` environment
  variable until API stabilized. 

Tested:
* Verified end-to-end by spinning up multiple gRPC servers and a gRPC
  client that injects a custom service (load balancing) config with
  `ring_hash_experimental` and a custom `request_hash_header` (with
  NO associated value in the metadata headers) which generates a random
  hash for each request to the ring hash LB. Verified picks/RPCs are
  split evenly/uniformly across all backends.
* Ran affected unit tests with thread sanitizer and 1000 iterations to
  prevent data races.

[A76]: https://github.com/grpc/proposal/blob/master/A76-ring-hash-improvements.md#explicitly-setting-the-request-hash-key
2025-02-19 20:25:33 -08:00
Sergii Tkachenko 2b87b01651
xds: Change how xDS filters are created by introducing Filter.Provider (#11883)
This is the first step towards supporting filter state retention in
Java. The mechanism will be similar to the one described in [A83]
(https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md#filter-call-credentials-cache)
for C-core, and will serve the same purpose. However, the
implementation details are very different due to the different nature
of xDS HTTP filter support in C-core and Java.

In Java, xDS HTTP filters are backed by classes implementing
`io.grpc.xds.Filter`, from here just called "Filters". To support
Filter state retention (next PR), Java's xDS implementation must be
able to create unique Filter instances per:
- Per HCM
  `envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager`
- Per filter name as specified in
  `envoy.extensions.filters.network.http_connection_manager.v3.HttpFilter.name`

This PR **does not** implements Filter state retention, but lays the
groundwork for it by changing how filters are registered and
instantiated. To achieve this, all existing Filter classes had to be
updated to the new instantiation mechanism described below.

Prior to these this PR, Filters had no livecycle. FilterRegistry
provided singleton instances for a given typeUrl. This PR introduces
a new interface `Filter.Provider`, which instantiates Filter classes.
All functionality that doesn't need an instance of a Filter is moved
to the Filter.Provider. This includes parsing filter config proto
into FilterConfig and determining the filter kind
(client-side, server-side, or both).

This PR is limited to refactoring, and there's no changes to the
existing behavior. Note that all Filter Providers still return
singleton Filter instances. However, with this PR, it is now possible
to create Providers that return a new Filter instance each time
`newInstance` is called.
2025-02-18 10:47:01 -08:00
Eric Anderson 713607056e util: Use acceptResolvedAddresses() for MultiChildLb children
A failing Status from acceptResolvedAddresses means something is wrong
with the config, but parts of the config may still have been applied.
Thus there are now two possible flows: errors that should prevent
updateOverallBalancingState() and errors that should have no effect
other than the return code. To manage that, MultChildLb must always be
responsible for calling updateOverallBalancingState().
acceptResolvedAddressesInternal() was inlined to make that error
processing easier. No existing usages actually needed to have logic
between updating the children and regenerating the picker.

RingHashLb already was verifying that the address list was not empty, so
the short-circuiting when acceptResolvedAddressesInternal() returned an
error was impossible to trigger. WrrLb's updateWeightTask() calls the
last picker, so it can run before acceptResolvedAddressesInternal(); the
only part that matters is re-creating the weightUpdateTimer.
2025-02-18 07:33:49 -08:00
Larry Safran 41dd0c6d73
xds:Cleanup to reduce test flakiness (#11895)
* don't process resourceDoesNotExist for watchers that have been cancelled.

* Change test to use an ArgumentMatcher instead of expecting that only the final result will be sent since depending on timing there may be configs sent for clusters being removed with their entries as errors.
2025-02-14 10:23:54 -08:00
Larry Safran 764a4e3f08
xds: Cleanup by moving methods in XdsDependencyManager ahead of classes (#11890)
* Move private methods ahead of classes
2025-02-11 14:34:46 -08:00
Larry Safran ade2dd2038
xds: Change XdsClusterConfig to have children field instead of endpoint (#11888)
* Change XdsConfig to match spec with a `children` object holding either `a list of leaf cluster names` or `an EdsUpdate`.  Removed intermediate aggregate nodes from `XdsConfig.clusters`.
2025-02-11 12:38:52 -08:00
Sergii Tkachenko bd6af59221
xds: improve code readability of server FilterChain parsing
- Improve code flow and variable names
- Reduce nesting
- Add comments between logical blocks
- Add comments explaining some xDS/gRPC nuances
2025-02-10 17:14:07 -08:00
Larry Safran 67fc2e156a
Add new classes for eliminating xds config tears (#11740)
* Framework definition to support A74
2025-02-07 16:33:17 -08:00
Eric Anderson 199a7ea3e8
xds: Improve XdsNR's selectConfig() variable handling
The variables from the do-while are no longer initialized to let the
compiler verify that the loop sets each. Unnecessary comparisons to null
are also removed and is more obvious as the variables are never set to
null. Added a minor optimization of computing the RPCs path once instead
of once for each route. The variable declarations were also sorted to
match their initialization order.

This does fix an unlikely bug where if the old code could successfully
matched a route but fail to retain the cluster, then when trying a
second time if the route was _not_ matched it would re-use the prior route
and thus infinite-loop failing to retain that same cluster.

It also adds a missing cast to unsigned long for a uint32 weight. The old
code would detect if the _sum_ was negative, but a weight using 32 bits
would have been negative and never selected.
2025-02-05 10:37:22 -08:00
Eric Anderson 04f1cc5845 xds: Make XdsNR.RoutingConfig.empty a constant
The field was made final in 4b52639aa but was soon reverted in 3ebb3e192
because of what I assume was a bad merge conflict resolution. The field
has contained an immutable object since its introduction in d25f5acf1,
so it is pretty likely to remain a constant in the future.
2025-01-30 15:10:12 -08:00
Eric Anderson c506190b0f
xds: Reuse filter interceptors across RPCs
This moves the interceptor creation from the ConfigSelector to the
resource update handling.

The code structure changes will make adding support for filter
lifecycles (for RLQS) a bit easier. The filter lifecycles will allow
filters to share state across interceptors, and constructing all the
interceptors on a single thread will mean filters wouldn't need to be
thread-safe (but their interceptors would be thread-safe).
2025-01-30 12:43:51 -08:00
Eric Anderson b3db8c2489 xds: Allow FaultFilter's interceptor to be reused
This is the only usage of PickSubchannelArgs when creating a filter's
ClientInterceptor, and a follow-up commit will remove the argument and
actually reuse the interceptors. Other filter's interceptors can
already be reused.

There doesn't seem to be any significant loss of legibility by making
FaultFilter a more ordinary interceptor, but the change does cause the
ForwardingClientCall to be present when faultDelay is configured,
independent of whether the fault delay ends up being triggered.

Reusing interceptors will move more state management out of the RPC path
which will be more relevant with RLQS.
2025-01-29 14:21:53 -08:00
Kannan J 0f5503ebb1
xds: Include max concurrent request limit in the error status for concurre… (#11845)
Include max concurrent request limit in the error status for concurrent connections limit exceeded
2025-01-23 21:40:21 +05:30
Eric Anderson 495a8906b2 xds: Fix fallback test FakeClock TSAN failure
d65d3942e increased the test speed of
connect_then_mainServerDown_fallbackServerUp by using FakeClock.
However, it introduced a data race because FakeClock is not thread-safe.
This change injects a single thread for gRPC callbacks such that
syncContext is run on a thread under the test's control.

A simpler approach would be to expose syncContext from XdsClientImpl for
testing. However, this test is in a different package and I wanted to
avoid adding a public method.

```
  Read of size 8 at 0x00008dec9d50 by thread T25:
    #0 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Lio/grpc/internal/FakeClock$ScheduledTask;JLjava/util/concurrent/TimeUnit;)V FakeClock.java:140
    #1 io.grpc.internal.FakeClock$ScheduledExecutorImpl.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;)Ljava/util/concurrent/ScheduledFuture; FakeClock.java:150
    #2 io.grpc.SynchronizationContext.schedule(Ljava/lang/Runnable;JLjava/util/concurrent/TimeUnit;Ljava/util/concurrent/ScheduledExecutorService;)Lio/grpc/SynchronizationContext$ScheduledHandle; SynchronizationContext.java:153
    #3 io.grpc.xds.client.ControlPlaneClient$AdsStream.handleRpcStreamClosed(Lio/grpc/Status;)V ControlPlaneClient.java:491
    #4 io.grpc.xds.client.ControlPlaneClient$AdsStream.lambda$onStatusReceived$0(Lio/grpc/Status;)V ControlPlaneClient.java:429
    #5 io.grpc.xds.client.ControlPlaneClient$AdsStream$$Lambda+0x00000001004a95d0.run()V ??
    #6 io.grpc.SynchronizationContext.drain()V SynchronizationContext.java:96
    #7 io.grpc.SynchronizationContext.execute(Ljava/lang/Runnable;)V SynchronizationContext.java:128
    #8 io.grpc.xds.client.ControlPlaneClient$AdsStream.onStatusReceived(Lio/grpc/Status;)V ControlPlaneClient.java:428
    #9 io.grpc.xds.GrpcXdsTransportFactory$EventHandlerToCallListenerAdapter.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V GrpcXdsTransportFactory.java:149
    #10 io.grpc.PartialForwardingClientCallListener.onClose(Lio/grpc/Status;Lio/grpc/Metadata;)V PartialForwardingClientCallListener.java:39
    ...

  Previous write of size 8 at 0x00008dec9d50 by thread T4 (mutexes: write M0, write M1, write M2, write M3):
    #0 io.grpc.internal.FakeClock.forwardTime(JLjava/util/concurrent/TimeUnit;)I FakeClock.java:368
    #1 io.grpc.xds.XdsClientFallbackTest.connect_then_mainServerDown_fallbackServerUp()V XdsClientFallbackTest.java:358
    ...
```
2025-01-22 16:00:00 -08:00
Eric Anderson fc86084df5 xds: Rename grpc.xds.cluster to grpc.lb.backend_service
The name is being changed to allow the value to be used in more metrics
where xds-specifics are awkward.
2025-01-17 17:16:32 -08:00
MV Shiva b44ebce45d
xds: Envoy proto sync to 2024-11-11 (#11816) 2025-01-17 14:58:52 +05:30
Eric Anderson 7162d2d661 xds: Pass grpc.xds.cluster label to tracer
This is in service to gRFC A89. Since the gRFC isn't finalized this
purposefully doesn't really do anything yet. The grpc-opentelemetry
change to use this optional label will be done after the gRFC is merged.
grpc-opentelemetry currently has a hard-coded list (one entry) of labels
that it looks for, and this label will need to be added.

b/356167676
2025-01-13 11:54:36 -08:00
Larry Safran 176f3eed12
xds: Enable Xds Client Fallback by default (#11817) 2025-01-10 15:25:15 -08:00
Eric Anderson d65d3942e6 xds: Increase speed of fallback test
These changes reduce connect_then_mainServerDown_fallbackServerUp test
time from 20 seconds to 5 s by faking time for the the does-no-exist
timer.

XdsClientImpl only uses the TimeProvider for CSDS cache details, so any
implementation should be fine. FakeXdsClient provides an implementation,
so might as well use it as it is one less clock to think about.
2025-01-10 08:28:48 -08:00
Eric Anderson 70825adce6 Replace jsr305's GuardedBy with Error Prone's
We should avoid jsr305 and error prone's has the same semantics.
2025-01-10 08:16:48 -08:00
Eric Anderson 7b5d0692cc
Replace jsr305's CheckReturnValue with Error Prone's (#11811)
We should avoid jsr305 and error prone's has the same semantics.

Fixes #8687
2025-01-09 13:45:35 -08:00
MV Shiva 1edc4d84d4
xds: Parsing xDS Cluster Metadata (#11741) 2025-01-07 10:03:13 +05:30
Larry Safran 4222f77587
xds:Move creating the retry timer in handleRpcStreamClosed to as late as possible and call close() (#11776)
* Move creating the retry timer in handleRpcStreamClosed to as late as possible and call `close` so that the `call` is cancelled.
Also add some debug logging.
2025-01-06 13:09:42 -08:00
Eric Anderson 6c12c2bd24 xds: Remember nonces for unknown types
If the control plane sends a resource type the client doesn't understand
at-the-moment, the control plane will still expect the client to include
the nonce if the client subscribes to the type in the future.

This most easily happens when unsubscribing the last resource of a type.
Which meant 1cf1927d1 was insufficient.
2025-01-06 11:54:35 -08:00
Eric Anderson 4a0f707331 xds: Avoid depending on io.grpc.xds.Internal* classes
Internal* classes should generally be accessors that are used outside of
the package/project. Only one attribute was used outside of xds, so
leave only that one attribute in InternalXdsAttributes. One attribute
was used by the internal.security package, so move the definition to the
same package to reduce the circular dependencies.
2025-01-03 16:01:10 -08:00
Eric Anderson 1cf1927d1a
xds: Preserve nonce when unsubscribing type
This fixes a regression introduced in 19c9b998.

b/374697875
2025-01-03 12:34:47 -08:00
Eric Anderson 9a712c3f77 xds: Make XdsClient.ResourceStore package-private
There's no reason to use the interface outside of
XdsClientImpl/ControlPlaneClient. Since XdsClientImpl implements the
interface directly, its methods are still public. That can be a future
cleanup.
2025-01-03 11:45:55 -08:00
Benjamin Peterson 8c261c3f28
Fix typo in deprecated blocking stub javadoc. (#11772) 2024-12-26 13:31:34 -08:00
Vindhya Ningegowda 6516c7387e
xds: Remove xds authority label from metric registration (#11760)
* Remove `grpc.xds.authority` label while registering `grpc.xds_client.resources` gauge, until the label value is available to record.
2024-12-20 19:50:09 -08:00
Larry Safran ea8c31c305
Bidi Blocking Stub (#10318) 2024-12-20 16:16:17 -08:00
Larry Safran ef7c2d59c1
xds: Fix XDS control plane client retry timer backoff duration when connection closes after results are received (#11766)
* Fix retry timer backoff duration.

* Reset stopwatch when we had results on AdsStream rather than change the delay calculation logic.
2024-12-19 14:46:58 -08:00
Eric Anderson 8ea3629378
Re-enable animalsniffer, fixing violations
In 61f19d707a I swapped the signatures to use the version catalog. But I
failed to preserve the `@signature` extension and it all seemed to
work... But in fact all the animalsniffer tasks were completing as
SKIPPED as they lacked signatures. The build.gradle changes in this
commit are to fix that while still using version catalog.

But while it was broken violations crept in. Most violations weren't
too important and we're not surprised went unnoticed. For example, Netty
with TLS has long required the Java 8 API
`setEndpointIdentificationAlgorithm()`, so using `Optional` in the same
code path didn't harm anything in particular. I still swapped it to
Guava's `Optional` to avoid overuse of `@IgnoreJRERequirement`.

One important violation has not been fixed and instead I've disabled the
android signature in api/build.gradle for the moment.  The violation is
in StatusException using the `fillInStackTrace` overload of Exception.
This problem [had been noticed][PR11066], but we couldn't figure out
what was going on. AnimalSniffer is now noticing this and agreeing with
the internal linter. There is still a question of why our interop tests
failed to notice this, but given they are no longer running on pre-API
level 24, that may forever be a mystery.

[PR11066]: https://github.com/grpc/grpc-java/pull/11066
2024-12-19 07:54:54 -08:00
vinodhabib f8f613984f
xds: fixed unsupported unsigned 32 bits issue for circuit breaker (#11735)
Added change for circuit breaking by converting signed 32-bit Int to Unsigned 64-bit Long For MaxRequest negative value ( -1)

Fixes #11695
2024-12-16 21:37:22 -08:00
Eric Anderson fe752a290e xds: Move specialized APIs out of XdsResourceType
StructOrError is a more generic API, but we have StatusOr now so we
don't want new usages of StructOrError. Moving StructOrError out of
io.grpc.xds.client will make it easier to delete StructOrError once
we've migrated to StatusOr in the future.

TRANSPORT_SOCKET_NAME_TLS should also move, but it wasn't immediately
clear to me where it should go.
2024-12-16 16:03:59 -08:00
Eric Anderson e8ff6da2cf xds: Unexpected types in server_features should be ignored
It was clearly defined in gRFC A30. The relevant text was copied as a
comment in the code.

As discovered due to grpc/grpc-go#7932
2024-12-16 07:29:51 -08:00
Larry Safran 486b8ba67f
Fix tsan error (#11742)
Eliminate unneeded fakeClock.forwardTime() that was causing the conflict.
2024-12-11 17:48:19 -08:00
Larry Safran 210f9c083e
Xds fallback (#11254)
* XDS Client Fallback
2024-12-09 15:42:27 -08:00
Vindhya Ningegowda ebb43a69e7
Add "#server" as dataplane target value for xDS enabled gRPC servers. (#11715)
As mentioned in [A71 xDS Fallback]( https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md#update-csds-to-aggregate-configs-from-multiple-xdsclient-instances):
updated dataplane target to "#server" for xDS-enabled gRPC servers.
2024-11-27 10:59:54 -08:00
Vindhya Ningegowda 20d09cee57
xds: Add counter and gauge metrics (#11661)
Adds the following xDS client metrics defined in [A78](https://github.com/grpc/proposal/blob/master/A78-grpc-metrics-wrr-pf-xds.md#xdsclient).

Counters
- grpc.xds_client.server_failure
- grpc.xds_client.resource_updates_valid
- grpc.xds_client.resource_updates_invalid

Gauges
- grpc.xds_client.connected
- grpc.xds_client.resources
2024-11-25 16:47:32 -08:00
Eric Anderson 1f159d7899 xds: Fix XdsSecurityClientServerTest TrustManagerStore race
When spiffe support was added it caused
tlsClientServer_useSystemRootCerts_validationContext to become flaky.
This is because test execution order was important for whether the race
would occur.

Fixes #11678
2024-11-14 22:01:38 -08:00
Eric Anderson 4e8f7df589
util: Remove resolvedAddresses from MultiChildLb.ChildLbState
It isn't actually used by MultiChildLb, and using the health API gives
us more confidence that health is properly plumbed.
2024-11-14 12:56:24 -08:00
Eric Anderson 8237ae270a util: Remove EAG conveniences from MultiChildLb
This is a step toward removing ResolvedAddresses from ChildLbState,
which isn't actually used by MultiChildLb. Most usages of the EAG usages
can be served more directly without peering into MultiChildLb's
internals or even accessing ChildLbStates, which make the tests less
sensitive to implementation changes. Some changes do leverage the new
behavior of MultiChildLb where it preserves the order of the entries.

This does fix an important bug in shutdown tests. The tests looped over
the ChildLbStates after shutdown, but shutdown deleted all the children
so it looped over an entry collection. Fixing that exposed that
deliverSubchannelState() didn't function after shutdown, as the listener
was removed from the map when the subchannel was shut down. Moving the
listener onto the TestSubchannel allowed having access to the listener
even after shutdown.

A few places in LeastRequestLb lines were just deleted, but that's
because an existing assertion already provided the same check but
without digging into MultiChildLb.
2024-11-11 13:16:21 -08:00