This should preserve all the existing behavior of GlobalInterceptors as
used by grpc-gcp-observability, including it disabling the implicit
OpenCensus integration.
Both the old and new API are internal. I hid Configurator and
ConfiguratorRegistry behind Internal-prefixed classes, like had been
done with GlobalInterceptors to further discourage use until the API is
ready.
GlobalInterceptorsTest was modified to become ConfiguratorRegistryTest.
As part of gRFC A78:
> To support the locality label in the WRR metrics, we will extend the
> `weighted_target` LB policy (see A28) to define a resolver attribute
> that indicates the name of its child. This attribute will be passed
> down to each of its children with the appropriate value, so that any
> LB policy that sits underneath the `weighted_target` policy will be
> able to use it.
xds_cluster_impl is involved because it uses the child names in the
AddressFilter, which must match the names used by weighted_target.
Instead of using Locality.toString() in multiple policies and assuming
the policies agree, we now have xds_cluster_impl decide the locality's
name and pass it down explicitly. This allows us to change the name
format to match gRFC A78:
> If locality information is available, the value of this label will be
> of the form `{region="${REGION}", zone="${ZONE}",
> sub_zone="${SUB_ZONE}"}`, where `${REGION}`, `${ZONE}`, and
> `${SUB_ZONE}` are replaced with the actual values. If no locality
> information is available, the label will be set to the empty string.
This adds the following components that are required for gRPC A79
non-per-call metrics architecture.
- MetricSink implementation for gRPC OpenTelemetry
- Configurator for plumbing per call metrics ClientInterceptor and
ServerStreamTracer.Factory via unified OpenTelemetryModule.
Integrates the new features of the the Kokoro PSM Interop install library introduced in grpc/psm-interop#73.
Nearly all common functionality was moved from per-language/per-branch PSM Interop build scripts to [psm_interop_kokoro_lib.sh](https://github.com/grpc/psm-interop/blob/main/.kokoro/psm_interop_kokoro_lib.sh):
1. The list of tests in the each test suite
2. Per-test-suite flag customization
3. `run_test` methods
4. `build_docker_images_if_needed` methods
5. Generic `build_test_app_docker_images` methods (simple docker build + docker push + docker tag). grpc-java is one exception, as it doesn't run docker directly, but a cloudbuild flow.
Now all PSM Interop jobs share the same buildscripts by all test suites:
1. buildscript that invokes the test: `psm-interop-test-{language}.sh` (configured as `build_file` in the build cfg)
2. buildscript that builds the xDS test client/server and publishes them as a Docker image: `psm-interop-build-{language}.sh` (conventional name called from `psm_interop_kokoro_lib.sh`)
`psm-interop-test-{language}.sh`:
1. Sets `GRPC_LANGUAGE`, `BUILD_SCRIPT_DIR` environment variables.
2. Downloads the shared `psm_interop_kokoro_lib.sh` from the main branch of the psm-interop repo.
3. Sources `psm-interop-build-{language}.sh`
4. Calls `psm::run "${PSM_TEST_SUITE}"` (`PSM_TEST_SUITE` configured in the cfg file).
`psm-interop-build-{language}.sh`:
1. Defines `psm::lang::build_docker_images` which is called from `psm_interop_kokoro_lib.sh`.
2. Invokes any repo-specific logic.
3. May use `psm::build::docker_images_generic` for generic Docker build, tag, push, or provide implement its own build/publish method.
References:
- b/288578634
- See the full list of the new features at grpc/psm-interop#73.
- Additional fixes to the shared lib: grpc/psm-interop#78, grpc/psm-interop#79
gRFC A78 has WRR and pick-first include a `grpc.target` label, defined
in A66:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`). For channels such as inprocess channels
> where a target URI is not available, implementations can synthesize a
> target URI.
Since 06df25b65d, WRR has been calling this method, and it will get an
exception. We don't want WRR to be broken until we have MetricRecorder
fully plumbed.
As part of gRFC A78:
> To support the locality label in the per-call metrics, we will provide
> a mechanism for LB picker to add optional labels to the call attempt
> tracer.
* added MetricRecorderImpl and unit tests for MetricInstrumentRegistry
* updated MetricInstrumentRegistry to use array instead of ArrayList
* renamed record<>Counter APIs to add<>Counter. Added check for mismatched label values
* added lock for instruments array
`getMinEvictionTime()` was fixed to make sure only deltas were used for
comparisons (`a < b` is broken; `a - b < 0` is okay). It had also
returned `0` by default, which was meaningless as there is no epoch for
`System.nanoTime()`. LinkedHashLruCache now passes the current time into
a few more functions since the implementations need it and it was
sometimes already available. This made it easier to make some classes
static.
Instead of having docs in RefCountedChildPolicyWrapperFactory saying
that every method was guarded by a lock, I added `@GuardedBy("lock")`
within CachingRlsLbClient, so now it is clearly not thread-safe and the
lock protects access. The AtomicLong was replaced with a long since
1) there was no multi-threading and 2) the logic was not atomic-safe
which was misleading.
In OpenCensus recording an attempt was delayed in order to wait for
inboundUncompressedSize(). But we don't need that in OpenTelemetry, and
could have removed this code when copying from OpenCensus.
Today, deframer errors cancel the stream without communicating a status code
to the peer. This change causes deframer errors to trigger a best-effort
attempt to send trailers with a status code so that the peer understands
why the stream is being closed.
Fixes#3996
`sendGrpcFrame` owns the buffer in `SendGrpcFrameCommand`. If the frame is not handed off to netty, it needs to be released in the method.
https://github.com/grpc/grpc-java/issues/11115
It is easy to manage these things outside of MultiChildLb and it makes
the shared code easier and use less memory. In particular, we don't want
to use many instances of GracefulSwitchLb in virtually every policy
simply because it was needed in one or two cases.
Adds interfaces required for recording metrics from gRPC components. And added API to get `MetricRecorder` in `LoadBalancer.Helper` and add `MetricSink` to `ManagedChannelBuilder`.
Handles Netty write frame failures caused by issues in the Netty
itself.
Normally we don't need to do anything on frame write failures because
the cause of a failed future would be an IO error that resulted in
the stream closure. Prior to this PR we treated these issues as a
noop, except the initial headers write on the client side.
However, a case like netty/netty#13805 (a bug in generating next
stream id) resulted in an unclosed stream on our side. This PR adds
write frame future failure handlers that ensures the stream is
cancelled, and the cause is propagated via Status.
Fixes#10849
The text between the GRPC_DEPS_{START,END} must be identical in
formatting. Probably not a problem in general and not necessarily bad.
But it is simplistic.
Eric waking up this morning:
> We need more sed.
This started with combining handleNewRequest with asyncRlsCall, but that
emphasized pre-existing synchronization issues and trying to fix those
exposed others. It was hard to split this into smaller commits because
they were interconnected.
handleNewRequest was combined with asyncRlsCall to use a single code
flow for handling the completed future while also failing the pick
immediately for thottled requests. That flow was then reused for
refreshing after backoff and data stale. It no longer optimizes the RPC
completing immediately because that would not happen in real life; it
only happens in tests because of inprocess+directExecutor() and we don't
want to test a different code flow in tests. This did require updating
some of the tests.
One small behavior change to share the combined asyncRlsCall with
backoff is we now always invalidate an entry after the backoff.
Previously the code could replace the entry with its new value in one
operation if the asyncRlsCall future completed immediately. That only
mattered to a single test which now sees an EXPLICIT eviction.
SynchronizationContext used to provide atomic scheduling in
BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was
only accessed from the sync context. The same was true for calling up
the LB tree with `updateBalancingState()`. In particular, adding entries
to the cache during a pick could evict entries without running the
cleanup methods within the context, as well as the RLS channel
transitioning from TRANSIENT_FAILURE to READY. This was replaced with
using a bare Future with a lock to provide atomicity.
BackoffCacheEntry no longer uses the current time and instead waits for
the backoff timer to actually run before considering itself expired.
Previously, it could race with periodic cleanup and get evicted before
the timer ran, which would cancel the timer and forget the
backoffPolicy. Since the backoff timer invalidates the entry, it is
likely useless to claim it ever expires, but that level of behavior was
preserved since I didn't look into the LRU cache deeply.
propagateRlsError() was moved out of asyncRlsCall because it was not
guaranteed to run after the cache was updated. If something was already
running on the sync context, then RPCs would hang until another update
caused updateBalancingState().
Some methods were moved out of the CacheEntry classes to avoid
shared-state mutation in constructors. But if we add something in a
factory method, we want to remove it in a sibling method to the factory
method, so additional code is moved for symmetry. Moving shared-state
mutation ouf of constructors is important because 1) it is surprising
and 2) ErrorProne doesn't validate locking within constructors. In
general, having shared-state methods in CacheEntries also has the
problem that ErrorProne can't validate CachingRlsLbClient calls to
CacheEntry. ErrorProne can't know that "lock" is already held because
CacheEntry could have been created from a _different instance_ of
CachingRlsLbClient and there's no way for us to let ErrorProne prove it
is the same instance of "lock".
DataCacheEntry still mutates global state that requires a lock in its
constructor, but it is less severe of a problem and it requires more
choices to address.
According to the docs, I can use bazel to build examples, but
retry-example is not supported in bazel config. So If you'll try to
build this example with bazel, you'll get an error:
```
examples git:(master) ✗ bazel build :retrying-hello-world-server :retrying-hello-world-client
ERROR: Skipping ':retrying-hello-world-client': no such target '//:retrying-hello-world-client': target 'retrying-hello-world-client' not declared in package '' defined by /Users/rostik404/projects/grpc-java/examples/BUILD.bazel
ERROR: no such target '//:retrying-hello-world-client': target 'retrying-hello-world-client' not declared in package '' defined by /Users/rostik404/projects/grpc-java/examples/BUILD.bazel
INFO: Elapsed time: 0.331s
INFO: 0 processes.
ERROR: Build did NOT complete successfully
```
* Add option in OkHttpServerBuilder
* Add value as MAX_CONCURRENT_STREAM setting in settings frame sent by the server to the client per connection
* Enforce limit by sending a RST frame with REFUSED_STREAM error
The recommended way to load dependencies from `rules_jvm_external`
is to make use of the `@maven` workspace, and the most readable
way of doing that is to use the `artifact` macro provides.
This removes the need to generate the "compat" namespaces, which
`rules_jvm_external` provided for backwards compatibility with
older releases. This change also sets things up for supporting
`bzlmod`: this requires all workspaces accessed by a library to
be named "up front" in the `MODULE.bazel` file. This way, the
only repo that needs to be exported is `@maven`, rather than the
current huge list.