Circuit breakers should be applied to clusters in the global scope. However, the LB hierarchy might cause the LB policy (currently EDS, but cluster_impl in the future) that applies circuit breaking to be duplicated. Also, for multi-channel cases, the circuit breaking threshold should still be shared across channels in the process.
This change creates a global map for accessing circuit breaking atomics that used to count the number of outstanding requests per global cluster basis. Atomics in the global map are held by WeakReferences so LB policies/Pickers/StreamTracers do not need to worry about counter's lifecycle and refcount.
The proto field is named as num_failures but its comment is saying it is for number of RPCs that failed to record a remote peer. RPC failed == RPC failed to record a remote peer was true previously (so no existing tests should be affected by this changed) as server completed RPCs immediately. It is no longer true with server capability to keep the call open/delayed.
This change clarifies the proto definition for stats RPC. rpcs_by_peer is for recording RPCs succeeded and num_failures is for RPCs failed. RPCs in the flight when the stats call times out are not counted towards any of the stats.
Update xDS interop test proto to aggregate accumulated stats based on RPC methods (mirroring 643e5bcd1e8db931cf76a3be19cd9bba223ee987 in C-core's change). Updated the xDS interop test client to support querying accumulated stats aggregated to RPC methods.
Previously the EDS LB policies does not propagate an updated picker that uses the new circuit breaker threshold and drop policies when those values change. The result is new circuit breaker/drop policies are not dynamically applied to new RPCs unless subchannel state has changed. This change fixes this problem. Whenever the EDS LB policy receives an config update, the immediately updates the picker with corresponding circuit breakers and drop policies to the channel so that the channel is alway picking up the latest configuration.
Round robin is keeping use of READY subchannels even if there is name resolution error. However, it moves Channel state to TRANSIENT_ERROR.
In hierarchical load balancers, the upstream LB policy may need to aggregate pickers from multiple downstream round_robin LB policy while filtering out non-ready subchannels. It cannot infer if the subchannel can be used just from the SubchannelPicker interface. It relies on the state that the round_robin intends to set channel to.
So the change is to match the readiness of the picker/subchannel with the state that round_robin tries to update. It will completely ignore name resolution error if there are READY subchannels.
This change refactors client side XdsClient's unit test. The main testing logic (test cases) will being the abstract class while the extended classes will be providing xDS version-specific services and messages. With this approach, we do not suffer from maintaining two copies of test logics in order to cover both v2 and v3 xDS protocols. So every time making changes to XdsClient's own logic, we only need to modify the corresponding test logic in the abstract class. Also, this approach could be sustainable for future xDS protocol version upgrades without necessity to re-implement test logics.
Implemented xDS circuit breaking for the maximum number of requests can be in-flight. The threshold is retrieved from CDS responses and is configured at the cluster level. It is implemented by wrapping the Picker spawned by EDS LB policy (which resolves endpoints for a single cluster) with stream-limiting logic. That is, when the picker is trying to create a new stream (aka, a new call), it is controlled by the number of open streams created by the current EDS LB policy. RPCs dropped by circuit breakers are recorded into total number of drops at cluster level and will be reported to TD via LRS.
In the future, multiple gRPC channels can be load balancing requests to the same (global) cluster. Those request should share the same quota for maximum number of requests can be in-flight. We will use a global counter for aggregating the number of currently-in-flight requests per cluster.
Since the xDS resource version info persists across ADS stream recreation so that the management server can choose to not send client resources that have already been sent previously (in the previous stream). The client should not consider previously received (resolved) resources not exist if it does not receive them on the new ADS stream. So initial resource fetch timers should only be scheduled for unresolved resources when the ADS stream is recreated.
* fix channel builders ABI backward compatibility broken in v1.33.0
* fix server builders ABI backward compatibility broken in v1.33.0
* makes ForwardingServerBuilder package-private
With this, it will be clear if the RPC failed because the server didn't
use a double-GOAWAY or if it failed because of MAX_CONCURRENT_STREAMS or
if it was due to a local race. It also fixes the status code to be
UNAVAILABLE except for the RPCs included in the GOAWAY error (modulo the
Netty bug).
Fixes#5855
Use a global factory to create a shared XdsClient object pool that can be used by multiple client channels. The object pool is thread-safe and holds a single XdsClient returning to each client channel. So at most one XdsClient instance will be created per process, and it is shared between client channels.
LoadReportClient is a subcomponent of XdsClient. Since the XdsClient uses a SynchronizationContext for synchronizing its operations, calls to LoadReportClient APIs should all from that SynchronizationContext. Hence, we can pass that SynchronizationContext into LoadReportClient to synchronize its RPC operations as well. This eliminates the synchronization needed by LoadReportClient itself.
A LoadStatsStore instance is used for recording client stats for a global cluster. A single instance may be shared by multiple client channels. So it should be thread-safe.
Two major changes involved:
- Separated client and server side XdsClient code paths. Currently the single XdsClientImpl2 implementation runs separate code paths for client side and server side usages. Due to different implementation progress for client side and server side development, client and server implementations diverge in whether it supports multiple/removing watchers, response data cache, synchronization model, etc. It became cumbersome to put them together in a single class. The separation is effectively duplicating the XdsClientImpl2 class for client and server so that the two sides can develop independently. But we made this AbstractXdsClient to reuse some of the code, such as the logic for xDS RPC stream. More details can be found in go/separate-client-server-xds-client.
- Changes the synchronization model for the client side APIs. Multiple gRPC Channels will be sharing a single XdsClient instance. So the client side APIs need to be thread-safe. Also, the XdsClient needs to implement synchronization for API calls and xDS RPC callbacks without using a particular Channel's SynchronizationContext. This is done by using XdsClient's own lock.
The stream creation was failing because the stream id was disallowed:
Caused by: io.grpc.StatusRuntimeException: INTERNAL: http2 exception
at io.grpc.Status.asRuntimeException(Status.java:533)
at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:629)
... 16 more
Caused by: io.grpc.netty.shaded.io.netty.handler.codec.http2.Http2Exception$StreamException: Cannot create stream 222691 greater than Last-Stream-ID 222689 from GOAWAY.
The problem was introduced in 9ead606. Fixes#7357