Not updating the example WORKSPACE because it doesn't have any
Bazel-enabled build that depends on xds and so doesn't need the
additional repository dependencies.
Fixes#9162
The test appears to be slow because of classloading. The failure cases
were very slow at 14-16 seconds, but looking at other logs it succeeds
after 12 seconds. It is the first test in the class, and the other tests
run much faster. This could be solved with warmup code, but increasing
the RPC deadline is easier.
Two back-to-back failures on aarch64:
https://source.cloud.google.com/results/invocations/c4612a28-d594-42e9-b8ab-12c999690b40/targetshttps://source.cloud.google.com/results/invocations/3d5d1dc2-6b47-493d-b15c-e99458067d73/targets
```
expected to be true
at app//io.grpc.rls.CachingRlsLbClientTest.rls_withCustomRlsChannelServiceConfig(CachingRlsLbClientTest.java:267)
```
And the next run failed on a different line but seems the same cause:
https://source.cloud.google.com/results/invocations/546b83d1-cd26-4b87-8871-a7a06a60dc06/targets
```
expected to be true
at app//io.grpc.rls.CachingRlsLbClientTest.rls_withCustomRlsChannelServiceConfig(CachingRlsLbClientTest.java:273)
```
Reproduced with:
```diff
diff --git a/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java b/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java
index 9fac852fa..631d632eb 100644
--- a/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java
+++ b/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java
@@ -264,6 +264,11 @@ public class CachingRlsLbClientTest {
// initial request
CachedRouteLookupResponse resp = getInSyncContext(routeLookupRequest);
+ try {
+ Thread.sleep(2000);
+ } catch (Exception e) {
+ throw new RuntimeException(e);
+ }
assertThat(resp.isPending()).isTrue();
// server response
```
Ticker is powered by System.nanoTime() which is CLOCK_MONOTONIC.
TimeProvider is powered by System.currentTimeMillis() which is
CLOCK_REALTIME. For durations, the monotonic clock is appropriate, not
the wall time which can jump around.
There was an attempt to use different epochs for the wall clock and the
monotonic clock. However, 123456789 is actually less than a second.
We want the gap between clocks to be at least a day. This issue was
discovered in #8968.
This separation found a bug in an RLS test where it was mixing epochs.
However, it was only a problem in the test. The code under test is
wrongly using wall clock for calculation durations, but that seems to be
a wide-spread problem and will need to be handled separately.
We shouldn't require addresses to be non-empty for the child lb of rls_lb. That might be a right requirement when the child lb is grpclb, but in our new usecase, the child lb will be cds lb and will only work if empty address is allowed.
Fixing b/223866089#comment24
Refactor to use `@AutoValue` for data types. This reduces human mistakes on `equals()`, `hashCode()`, and `toString()` while we are constantly adding and changing member fields of the data type.
Implementing the latest change for RLS lb config.
```
The configuration for the LB policy will be of the following form:
{
"routeLookupConfig": <JSON form of RouteLookupConfig proto>,
"routeLookupChannelServiceConfig": {...service config JSON...},
"childPolicy": [
{"<policy name>": {...child policy config...}}
],
"childPolicyConfigTargetFieldName": "<name of field>"
}
```
>If the routeLookupChannelServiceConfig field is present, we will pass the specified service config to the RLS control plane channel, and we will disable fetching service config via that channel's resolver.
When client channel is shutting down, the RlsLoadBalancer is shutting down. However, the child loadbalancers of RlsLoadBalancer are not shut down. This is causing the issue b/209831670
The `RlsProtoData.RouteLookupConfig` class is out-of-date.
- Some of the fields were long, but now are of `Duration` type.
- Some of the fields are deleted.
- The validation of some of the fields either have been changed or were wrong since beginning.
Now overhaul all the fields in `RlsProtoData.RouteLookupConfig` class based on the spec http://go/grpc-rls-lb-policy-design#heading=h.y3h669gfpown.
Also move the validation logic in json parsing rather than in the constructor of `RouteLookupConfig`.
- Partially revert the change of RlsProtoData.java in #8612 by removing `public` accessor
- Have grpc-xds no longer strongly depend on grpc-rls. The application will need grpc-rls as runtime dependencies if they need route lookup feature in xds.
- Parse RouteLookupServiceClusterSpecifierPlugin config to the Json/Map representation of `io.grpc.lookup.v1.RouteLookupClusterSpecifier` instead of `io.grpc.rls.RlsProtoData.RouteLookupConfig`
Add RlsClusterSpecifierPlugin as per the [design doc](http://go/grpc-rls-in-xds#heading=h.dmyrvi6ohebx)
The structure of `ClusterSpecifierPlugin` is very similar to `io.grpc.xds.Filter`.
The following changes to the existing code are made:
- move `ConfigOrError` class out of `Filter` class to be shared with `ClusterSpecifierPlugin`
- make `io.grpc.rls.RlsProtoData` public to be accessible by `io.grpc.xds`
- treat empty defaultTarget in `io.grpc.rls.RlsProtoData.RouteLookupConfig` as null to support both json and proto config without defaultTarget field specified.
Fix connectivity state aggregation as per http://go/grpc-rls-lb-policy-design#heading=h.6e8tt7xcwcdn
> Note that, for the purposes of aggregation, when a child policy reports TRANSIENT_FAILURE, we consider it to continue to be in that state until it reports READY (i.e., we ignore CONNECTING in between the two, no matter how many times it bounces back and forth between TRANSIENT_FAILURE and CONNECTING).
This can be used by annotation processors to avoid processing the
gRPC-generated code. The normal Generated annotation only has SOURCE
retention, so isn't available to annotation processors.
I don't include the service name within the annotation as that assumes
we'll never have need for any other type of generated class. If there's
a request for exposing service name via an annotation in the future, we
can make an RpcService annotation or the like.
Fixes#8158
failOnVersionConflict has never been good for us. It is equivalent to
Maven dependencyConvergence which we discourage our users to use because
it is too tempermental and _creates_ version skew issues over time.
However, we had no real alternative for determining if our deps would be
misinterpeted by Maven.
failOnVersionConflict has been a constant drain and makes it really hard
to do seemingly-trivial upgrades. As evidenced by protobuf/build.gradle
in this change, it also caused _us_ to introduce a version downgrade.
This introduces our own custom requireUpperBoundDeps implementation so
that we can get back to simple dependency upgrades _and_ increase our
confidence in a consistent dependency tree.
Currently each subchannel implicitly refreshes the name resolution when its state changes to IDLE or TRANSIENT_FAILURE. That is, this feature is built into subchannel's internal implementation. Although it eliminates the burden of having LB implementations refreshing the resolver when connections to backends are broken, this is gives LB policies no chance to disable or override this refresh (e.g., in some complex load balancing hierarchy like xDS, LB policies may embed a resolver inside for resolving backends so the refreshing resolution operation should be hooked to the resolver embedded in the LB policy instead of the one in Channel).
In order to make this transition smoothly, we add a check to SubchannelImpl that checks if the LoadBalancer has explicitly called Helper.refreshNameResolution for broken subchannels created by it. If not, it logs a warning and do the refresh.
A temporary LoadBalancer.Helper API ignoreRefreshNameResolution() is added to avoid false-positive warnings for xDS that intentionally does not want a refresh. Once the migration is done, this should be deleted.
The serviceName field in oobChannel grpclb config should not be null, otherwise it will default to the lbHelper.getAuthority(), which perviously defaulted to the lookup service before #7852, but has been overridden to the backend service for authentication in #7852.
The previous fix#7878 didn't work because the server field is expected to be full hostname (without port number). Need strip the port part from the authority.
The server filed in lookup request as specified in go/dynamic-request-routing/#heading=h.eqjtcpo6u8ep should be the original target, not the RLS server where the lookup request is sent to.
RLS RPC deadline is configured by service config, and could be extremely long. When RLS lb is shutdown, any pending RLS PRC should be cancelled. Now using shutdownNow() to forcefully close the RLS channel.
Resolves#7741
Some of the static methods in generated code have the same method name but different package name, such `ClientCalls.asyncClientStreamingCall` and `ServerCalls.asyncClientStreamingCall`. It's less readable using static import than using full-qualified method name in-place.
Cleanup `toString()` for cache entries, and print more debug information about cache entry when `pickSubchannel()`. This will be more helpful to debug.
The `default_target` field can be unset per the [spec](http://go/grpc-rls-lb-policy-design)
Also fixed a synchronization bug (related to #7460) that `createOrGet()` should be guarded by lock.
`RlsPicker.pickSubchannel()` does not run in SynchronizationContext, but it calls `CachingRlsLbClient.get()` which assumed running in SynchronizationContext. Fixed by removing `synchronizationContext.throwIfNotInThisSynchronizationContext()`. `CachingRlsLbClient.get()` is actually thread-safe in the sense it's guarded by lock, and `DataCacheEntry`'s fields are final.
`ChildPolicyWrapper.picker` was not thread-safe. Fixed by making it volatile.
Changed the test a bit since the old test doesn't really test things well.