If a child load balancer rejects the addresses it if given all we can do
is to trigger a name resolution refresh and hope for a better set of
addresses.
Introduce an AsyncService interface in the generated code and move the methods from <service>ImplBase to default implementation of the interface.
* update pom files to allow java 1.8
* Add a bindService(<service>Async) method
* Change TestServiceImpl to use the interface and include a bind method instead of extending TestServiceImplBase.
* Correct value being passed to throttler which had been backwards.
* Fix flaky test.
* Add a test using AdaptiveThrottler with a CachingRlsLBClient.
* Address test flakiness.
Second attempt at this, now with the understanding that RLS actually can
accept empty address lists.
This seems contrary to the behavior this LB advertizes with the canHandleEmptyAddressListFromNameResolution() method. This method is not overridden, so the default response of false is preserved. Empty address lists are supported though, and the parent LB never called the canHandleEmptyAddressListFromNameResolution() method.
Introduces a new acceptResolvedAddresses() method in LoadBalancer that
is to ultimately replace the existing handleResolvedAddresses(). The new
method allows the LoadBalancer implementation to reject the given
addresses by returning false from the method.
The long deprecated handleResolvedAddressGroups is also removed.
rls: Update implementation to match spec.
* Cleanup cache if exceeds max size when add an entry. Make cache entry size calculations more accurate
* Trigger pending RPC processing if unexpired backoff entries were removed from the cache by triggering helper to call it's parent updateBalancingState with the same state and picker
* Introduce minimum time before eviction (5 seconds)
* Change default accept ratio for AdaptiveThrottler from 1.2 -> 2.0
* Configuration validation
* When checking key names for duplicates also look at headers
* Check extra keys for duplicates
See analysis of implementation versus spec at https://docs.google.com/spreadsheets/d/18w5s1TEebRumWzk1pvWnjiHFGKc6MW-vt8tRLY4eNs0/
rls: Avoid library returning the status codes which the status spec document says that the library will never return when talking to RLS server. Instead, always return UNAVAILABLE on errors.
* Provide context around error message from RLS server
Introduces a new acceptResolvedAddresses() to the LoadBalancer.
This will now be the preferred way to handle addresses from the NameResolver. The existing handleResolvedAddresses() will eventually be deprecated.
The new method returns a boolean based on the LoadBalancers ability to use the provided addresses. If it does not accept them, false is returned. LoadBalancer implementations using the new method should no longer implement the canHandleEmptyAddressListFromNameResolution(), which will eventually be removed, along with handleResolvedAddresses().
Backward compatibility will be maintained so existing load balancers using handleResolvedAddresses() will continue to work.
Additionally the previously deprecated handleResolvedAddressGroups() method is removed.
rls: Change AdaptiveThrottler to use Ticker instead of TimeProvider
* Use a slot being null to mark invalid rather than relying on the slot's endNanos value.
Fixes#9048
* rls: Support multiple returned targets from RLS Server
Pick the first target that is not in TRANSIENT_FAILURE state. If none, use the first target.
Also initialize all targets returned from RLS so DataCache will contain a list of child policy wrappers.
Fixes#9236
INVALID_ARGUMENT is propagated to the data plane if no previous config
is available. INVALID_ARGUMENT is reserved for application use; LBs
should pretty much use UNAVAILABLE exclusively.
While most of the changes are in xds, there do not appear to be likely
xds code paths that would propagate a bad status to the data plane.
Internal policies either don't use parseLoadBalancingPolicyConfig() and
instead have their configuration objects constructed directly or are
constructed transitively through the cluster manager which uses INTERNAL
if there's a child failure. There was a worrisome hole before this
commit for StatusRuntimeExceptions received by the cluster manager, but
the audit didn't find any locations throwing such an exception.
User-selected policies produce a NACK and are protected from the
existing xds client watcher paths. The worst that appears could happen
is the channel could panic (which uses INTERNAL) if a bug let a bad
configuration through.
This can avoid creating an additional 736 tasks (previously 502 out of
1591 were not created). That's not all that important as the build time
is essentially the same, but this lets us see the poor behavior of the
protobuf plugin in our own project and increase our understanding of how
to avoid task creation when developing the plugin. Of the tasks still
being created, protobuf is the highest contributor with 165 tasks,
followed by maven-publish with 76 and appengine with 53. The remaining
59 are from our own build, but indirectly caused by maven-publish.
This moves our depedencies into a plain file that can be read and
updated by tooling. While the current tooling is not particularly better
than just using gradle-versions-plugin, it should put us on better
footing. gradle-versions-plugin is actually pretty nice, but will be
incompatible with Gradle 8, so we need to wait a bit to see what the
future holds.
Left libraries as an alias for libs to reduce the commit size and make
it easier to revert if we don't end up liking this approach.
We're using Gradle 7.3.3 where it was an incubating fetaure. But in
Gradle 7.4 is became stable.
Not updating the example WORKSPACE because it doesn't have any
Bazel-enabled build that depends on xds and so doesn't need the
additional repository dependencies.
Fixes#9162
The test appears to be slow because of classloading. The failure cases
were very slow at 14-16 seconds, but looking at other logs it succeeds
after 12 seconds. It is the first test in the class, and the other tests
run much faster. This could be solved with warmup code, but increasing
the RPC deadline is easier.
Two back-to-back failures on aarch64:
https://source.cloud.google.com/results/invocations/c4612a28-d594-42e9-b8ab-12c999690b40/targetshttps://source.cloud.google.com/results/invocations/3d5d1dc2-6b47-493d-b15c-e99458067d73/targets
```
expected to be true
at app//io.grpc.rls.CachingRlsLbClientTest.rls_withCustomRlsChannelServiceConfig(CachingRlsLbClientTest.java:267)
```
And the next run failed on a different line but seems the same cause:
https://source.cloud.google.com/results/invocations/546b83d1-cd26-4b87-8871-a7a06a60dc06/targets
```
expected to be true
at app//io.grpc.rls.CachingRlsLbClientTest.rls_withCustomRlsChannelServiceConfig(CachingRlsLbClientTest.java:273)
```
Reproduced with:
```diff
diff --git a/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java b/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java
index 9fac852fa..631d632eb 100644
--- a/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java
+++ b/rls/src/test/java/io/grpc/rls/CachingRlsLbClientTest.java
@@ -264,6 +264,11 @@ public class CachingRlsLbClientTest {
// initial request
CachedRouteLookupResponse resp = getInSyncContext(routeLookupRequest);
+ try {
+ Thread.sleep(2000);
+ } catch (Exception e) {
+ throw new RuntimeException(e);
+ }
assertThat(resp.isPending()).isTrue();
// server response
```
Ticker is powered by System.nanoTime() which is CLOCK_MONOTONIC.
TimeProvider is powered by System.currentTimeMillis() which is
CLOCK_REALTIME. For durations, the monotonic clock is appropriate, not
the wall time which can jump around.
There was an attempt to use different epochs for the wall clock and the
monotonic clock. However, 123456789 is actually less than a second.
We want the gap between clocks to be at least a day. This issue was
discovered in #8968.
This separation found a bug in an RLS test where it was mixing epochs.
However, it was only a problem in the test. The code under test is
wrongly using wall clock for calculation durations, but that seems to be
a wide-spread problem and will need to be handled separately.
We shouldn't require addresses to be non-empty for the child lb of rls_lb. That might be a right requirement when the child lb is grpclb, but in our new usecase, the child lb will be cds lb and will only work if empty address is allowed.
Fixing b/223866089#comment24
Refactor to use `@AutoValue` for data types. This reduces human mistakes on `equals()`, `hashCode()`, and `toString()` while we are constantly adding and changing member fields of the data type.
Implementing the latest change for RLS lb config.
```
The configuration for the LB policy will be of the following form:
{
"routeLookupConfig": <JSON form of RouteLookupConfig proto>,
"routeLookupChannelServiceConfig": {...service config JSON...},
"childPolicy": [
{"<policy name>": {...child policy config...}}
],
"childPolicyConfigTargetFieldName": "<name of field>"
}
```
>If the routeLookupChannelServiceConfig field is present, we will pass the specified service config to the RLS control plane channel, and we will disable fetching service config via that channel's resolver.
When client channel is shutting down, the RlsLoadBalancer is shutting down. However, the child loadbalancers of RlsLoadBalancer are not shut down. This is causing the issue b/209831670
The `RlsProtoData.RouteLookupConfig` class is out-of-date.
- Some of the fields were long, but now are of `Duration` type.
- Some of the fields are deleted.
- The validation of some of the fields either have been changed or were wrong since beginning.
Now overhaul all the fields in `RlsProtoData.RouteLookupConfig` class based on the spec http://go/grpc-rls-lb-policy-design#heading=h.y3h669gfpown.
Also move the validation logic in json parsing rather than in the constructor of `RouteLookupConfig`.
- Partially revert the change of RlsProtoData.java in #8612 by removing `public` accessor
- Have grpc-xds no longer strongly depend on grpc-rls. The application will need grpc-rls as runtime dependencies if they need route lookup feature in xds.
- Parse RouteLookupServiceClusterSpecifierPlugin config to the Json/Map representation of `io.grpc.lookup.v1.RouteLookupClusterSpecifier` instead of `io.grpc.rls.RlsProtoData.RouteLookupConfig`
Add RlsClusterSpecifierPlugin as per the [design doc](http://go/grpc-rls-in-xds#heading=h.dmyrvi6ohebx)
The structure of `ClusterSpecifierPlugin` is very similar to `io.grpc.xds.Filter`.
The following changes to the existing code are made:
- move `ConfigOrError` class out of `Filter` class to be shared with `ClusterSpecifierPlugin`
- make `io.grpc.rls.RlsProtoData` public to be accessible by `io.grpc.xds`
- treat empty defaultTarget in `io.grpc.rls.RlsProtoData.RouteLookupConfig` as null to support both json and proto config without defaultTarget field specified.
Fix connectivity state aggregation as per http://go/grpc-rls-lb-policy-design#heading=h.6e8tt7xcwcdn
> Note that, for the purposes of aggregation, when a child policy reports TRANSIENT_FAILURE, we consider it to continue to be in that state until it reports READY (i.e., we ignore CONNECTING in between the two, no matter how many times it bounces back and forth between TRANSIENT_FAILURE and CONNECTING).
This can be used by annotation processors to avoid processing the
gRPC-generated code. The normal Generated annotation only has SOURCE
retention, so isn't available to annotation processors.
I don't include the service name within the annotation as that assumes
we'll never have need for any other type of generated class. If there's
a request for exposing service name via an annotation in the future, we
can make an RpcService annotation or the like.
Fixes#8158
failOnVersionConflict has never been good for us. It is equivalent to
Maven dependencyConvergence which we discourage our users to use because
it is too tempermental and _creates_ version skew issues over time.
However, we had no real alternative for determining if our deps would be
misinterpeted by Maven.
failOnVersionConflict has been a constant drain and makes it really hard
to do seemingly-trivial upgrades. As evidenced by protobuf/build.gradle
in this change, it also caused _us_ to introduce a version downgrade.
This introduces our own custom requireUpperBoundDeps implementation so
that we can get back to simple dependency upgrades _and_ increase our
confidence in a consistent dependency tree.
Currently each subchannel implicitly refreshes the name resolution when its state changes to IDLE or TRANSIENT_FAILURE. That is, this feature is built into subchannel's internal implementation. Although it eliminates the burden of having LB implementations refreshing the resolver when connections to backends are broken, this is gives LB policies no chance to disable or override this refresh (e.g., in some complex load balancing hierarchy like xDS, LB policies may embed a resolver inside for resolving backends so the refreshing resolution operation should be hooked to the resolver embedded in the LB policy instead of the one in Channel).
In order to make this transition smoothly, we add a check to SubchannelImpl that checks if the LoadBalancer has explicitly called Helper.refreshNameResolution for broken subchannels created by it. If not, it logs a warning and do the refresh.
A temporary LoadBalancer.Helper API ignoreRefreshNameResolution() is added to avoid false-positive warnings for xDS that intentionally does not want a refresh. Once the migration is done, this should be deleted.
The serviceName field in oobChannel grpclb config should not be null, otherwise it will default to the lbHelper.getAuthority(), which perviously defaulted to the lookup service before #7852, but has been overridden to the backend service for authentication in #7852.
The previous fix#7878 didn't work because the server field is expected to be full hostname (without port number). Need strip the port part from the authority.