* Eliminate NPE after recovering from a temporary name resolution failure.
* Add test case for 2 failing subchannels to make sure it causes channel to go into TF.
The test was added in e4e7f3a06 when InProcess stopped returning a
Runnable from start(). In c5a63a1 we realized (indirectly) that there's
no point in using the Runnable any more.
This test failed with Binder (which seems to have been using the
Runnable unnecessarily), and InProcess, Netty, and OkHttp don't use the
Runnable. Instead of fixing it, we'll just move toward stopping using
Runnable.
I'm not removing the Runnable usage from Binder in this commit because
this test is currently causing CI failures and I don't want to do a
behavior change when fixing it.
This hasn't been needed since f8f569e07, when InternalSubchannel stopped
calling start() with a lock held. Note that also means no transport
needs to return a Runnable (but some still are).
I had noticed in e4e7f3a06 that it was safe for InProcess to call the
listener directly within start(), but I didn't notice this Javadoc that
said it wasn't allowed.
Returning the runnable did nothing, as both the start method and the
runnable are run within the synchronization context. I believe the
Runnable used to be required in the previous implementation of
ManagedChannelImpl (the lock-based implementation before we created
SynchronizationContext).
This fixes a NPE seen in ServerImpl because the server expects proper
ordering of transport lifecycle events.
```
Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException: Cannot invoke "java.util.concurrent.Future.cancel(boolean)" because "this.handshakeTimeoutFuture" is null
at io.grpc.internal.ServerImpl$ServerTransportListenerImpl.transportReady(ServerImpl.java:440)
at io.grpc.inprocess.InProcessTransport$4.run(InProcessTransport.java:215)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
```
b/338445186
fea577c80 disabled an optimization that some tests notice, as it can
change execution order. This restores the old behavior, at slight
expense to seeing relationship between in-use tracking and idle mode.
This will be used by the metadata exchange of CSM. When recording
per-attempt metrics, we really need per-attempt data and can't leverage
ClientInterceptors.
8844cf7b8 triggered a regression where a new RPC wouldn't cause the
channel to exit idle mode, if an RPC was still progressing on an old
transport. This was already possible previously, but was racy.
8844cf7b8 made it less racy and more obvious.
The two added `exitIdleMode()` calls in this commit are companions to
those in `enterIdleMode()`, which detect whether the channel should
immediately exit idle mode.
Noticed in cl/635819804.
DelayedClientTransport already had to handle all the cases, so
ManagedChannelImpl picking was acting only as an optimization.
Optimizing DelayedClientTransport to avoid the lock when not queuing
makes ManagedChannelImpl picking entirely redundant, and allows us to
remove the duplicate race-handling logic.
This avoids double-picking when queuing, where ManagedChannelImpl does a
pick, decides to queue, and then DelayedClientTransport re-performs the
pick because it doesn't know which pick version was used. This was
noticed with RLS, which mutates state within the picker.
Some APIs were marked experimental but had internal APIs in their
surface. These were all changed to internal. And then the internal APIs
were mostly hidden from generated documentation.
All these APIs will eventually become public and maybe even stable. But
they need some iteration before we're ready for others to start using
them.
* Change HappyEyeballs flag default value to false since some G3 users are seeing problems.
Put the flag logic in a common place for PickFirstLeafLoadBalancer & WRR's test.
* Set expected requestConnection count based on whether happy eyeballs is enabled or not
* Disable new PickFirstLB
* Fix test expectations to handle both new and old PF LB paths.
This is needed by gRFC A78 for xds metrics, and for RLS metrics. Since
gauges need to acquire a lock (or other synchronization) in the
callback, the callback allows batching multiple gauges together to avoid
acquiring-and-requiring such locks.
Unlike other metrics, gauges are reported on-demand to the MetricSink.
This means not all sinks will receive the same data, as the sinks will
ask for the gauges at different times.
This will be used for gRFC A66's OTel per-RPC metric label:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`).
The majority of the changes are to move target computation from
ManagedChannelImpl into the builder. A small hack API was added to
ManagedChannelBuilder to get the target to create an interceptor.
This should preserve all the existing behavior of GlobalInterceptors as
used by grpc-gcp-observability, including it disabling the implicit
OpenCensus integration.
Both the old and new API are internal. I hid Configurator and
ConfiguratorRegistry behind Internal-prefixed classes, like had been
done with GlobalInterceptors to further discourage use until the API is
ready.
GlobalInterceptorsTest was modified to become ConfiguratorRegistryTest.
This adds the following components that are required for gRPC A79
non-per-call metrics architecture.
- MetricSink implementation for gRPC OpenTelemetry
- Configurator for plumbing per call metrics ClientInterceptor and
ServerStreamTracer.Factory via unified OpenTelemetryModule.
gRFC A78 has WRR and pick-first include a `grpc.target` label, defined
in A66:
> `grpc.target` : Canonicalized target URI used when creating gRPC
> Channel, e.g. "dns:///pubsub.googleapis.com:443",
> "xds:///helloworld-gke:8000". Canonicalized target URI is the form
> with the scheme included if the user didn't mention the scheme
> (`scheme://[authority]/path`). For channels such as inprocess channels
> where a target URI is not available, implementations can synthesize a
> target URI.
As part of gRFC A78:
> To support the locality label in the per-call metrics, we will provide
> a mechanism for LB picker to add optional labels to the call attempt
> tracer.
* added MetricRecorderImpl and unit tests for MetricInstrumentRegistry
* updated MetricInstrumentRegistry to use array instead of ArrayList
* renamed record<>Counter APIs to add<>Counter. Added check for mismatched label values
* added lock for instruments array
Handles Netty write frame failures caused by issues in the Netty
itself.
Normally we don't need to do anything on frame write failures because
the cause of a failed future would be an IO error that resulted in
the stream closure. Prior to this PR we treated these issues as a
noop, except the initial headers write on the client side.
However, a case like netty/netty#13805 (a bug in generating next
stream id) resulted in an unclosed stream on our side. This PR adds
write frame future failure handlers that ensures the stream is
cancelled, and the cause is propagated via Status.
Fixes#10849
The recommended way to load dependencies from `rules_jvm_external`
is to make use of the `@maven` workspace, and the most readable
way of doing that is to use the `artifact` macro provides.
This removes the need to generate the "compat" namespaces, which
`rules_jvm_external` provided for backwards compatibility with
older releases. This change also sets things up for supporting
`bzlmod`: this requires all workspaces accessed by a library to
be named "up front" in the `MODULE.bazel` file. This way, the
only repo that needs to be exported is `@maven`, rather than the
current huge list.
The name resolver takes some time before it returns addresses. While waiting the channel will be IDLE instead of the proper CONNECTING. This generally doesn't matter since RPCs behave similarly for IDLE and CONNECTING, but is confusing for users when watching channel.getState() closely.
Fixes#10517.
We provided extra details when the RPC is killed by CallOptions'
Deadline, but didn't do the same for Context.
To avoid duplicating code, things were restructured, including the
threading. There are more code flows now, but I think the
multi-threading came out more obvious and less error-prone. I didn't
change the status when the deadline is already expired, because the
text is shared with DelayedClientCall and AbstractInteropTest doesn't
distinguish between the two cases.
This is a roll-forward that avoids a NPE when cancel() is called
without an earlier call to start().
As seen at b/300991330
* Allow the queued byte threshold for a Stream to be ready to be configurable
- on clients this is exposed by setting a CallOption
- on servers this is configured by calling a method on ServerCall or ServerStreamListener
We provided extra details when the RPC is killed by CallOptions'
Deadline, but didn't do the same for Context.
To avoid duplicating code, things were restructured, including the
threading. There are more code flows now, but I think the
multi-threading came out more obvious and less error-prone. I didn't
change the status when the deadline is already expired, because the
text is shared with DelayedClientCall and AbstractInteropTest doesn't
distinguish between the two cases.
As seen at b/300991330
SelfSignedCertificate is not available on Java 17 because
OpenJdkSelfSignedCertGenerator is not available. This only impacted
tests.
AccessController is being removed, and these locations are doing simple
reflection which is unlikely to require it even when a security policy
is in effect. There's other places we do reflection without the
AccessController, so either no security policies care or the users can
update their policies to allow it.
We can just compare the Deadline instances instead of asserting that
very little time has passed during the test. Real time probably still
matters in the test, but only insofar that the deadline is not expired
by the time ClientCallImpl sees it.
This fixes a test failure seen in the emulated aarch64 CI. Note that the
message says "ns" when it should say "ms", but this change deletes the
code with that typo.
```
java.lang.AssertionError: timeout: 548 ns
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at io.grpc.internal.ClientCallImplTest.assertTimeoutBetween(ClientCallImplTest.java:1102)
at io.grpc.internal.ClientCallImplTest.contextDeadlineShouldBePropagatedToStream(ClientCallImplTest.java:828)
```
Add the 'fake' dependency to grpc-netty instead of grpc-core.
grpc-okhttp already depends on grpc-util and probably would be fine
without round_robin on Android.
There's not actually a circular dependency, but some tools can't handle
the compile vs runtime distinction. Such tools are broken, but fixes
have been slow and this approach works with no real downfalls.
Works around #10576#10701
* Provide a default implementation for new method added to ManagedTransport.Listener to support ClientTransportFilters
* Relax test constraint to reduce flakiness due to timing.
* Add test for listener.filterTransport.
This fixes two bugs in outbound message size checking:
* When thet checke failed, the thrown StatusRuntimeException with a status code
of RESOURCE_EXHAUSTED was been caught and rewrapped in another
StatusRuntimeException but this time with status code UNKNOWN.
* This applies the max outbound message size check to messages prior to, and
after compression, since compression of a message that is smaller than the
maximum send size can result in a larger message that exceeds the maximum
send size.
Originally you had to confirm that awaitTermination() returned true, but
that was annoying and useless, especially after calling shutdownNow().
The behavior was changed in ce2ae1fb because the awaitTermination()
detection logic could prevent the channel from getting garbage
collected.
Fixes#10732
This change has health checking consumer (new pick first) to install a listener through and health checking producer (outlier detection and client health checking) producing health checks. Health notification chain is built reusing the previous connectivity state chain.
Pickfirst installs the health listener, and is capable of detecting when no health checking producer is installed in the system. In that case, it sets health status to be READY so that health system is no-op.
This may help some to move closer to Providers. It especially helps
cases where `NameResolverFactory`s aren't returning `InetSocketAddress`,
as it allows them to override `getProducedSocketAddressTypes()`, which
will now fail starting in 15fc70be.
Adds a new module grpc-opentelemetry that integrates OpenTelemetry and focuses on metrics.
OpenTelemetry APIs are used for instrumenting metrics collection. Users are expected to provide SDK with implementations.
If no SDK is passed, by default gRPC uses OpenTelemetry.noop().
* Update picker logic per A61 that it no longer pays attention to the first 2 elements, but rather takes the first ring element not in TF and uses that.
---------
Pulled in by rebase:
Eric Anderson (android: Remove unneeded proguard rule 44723b6)
Terry Wilson (stub: Deprecate StreamObservers b5434e8)
This is currently the only place where we return Status.UNKNOWN with no description, which makes is harder to debug and differentiate from statuses originated from non-grpc sources.
This PR enriches ServerImpl's #internalClose `Status.UNKNOWN` with description `Application error processing RPC`.
* core, netty, okhttp: implement new logic for nameResolverFactory API in channelBuilder
fix ManagedChannelImpl to use NameResolverRegistry instead of NameResolverFactory
fix the ManagedChannelImplBuilder and remove nameResolverFactory
* Integrate target parsing and NameResolverProvider searching
Actually creating the name resolver is now delayed to the end of
ManagedChannelImpl.getNameResolver; we don't want to call into the name
resolver to determine if we should use the name resolver.
Added getDefaultScheme() to NameResolverRegistry to avoid needing
NameResolver.Factory.
---------
Co-authored-by: Eric Anderson <ejona@google.com>
Instead of a boolean, we now return a Status object. Status.OK
represents accepted addresses and other non-acceptance. This allows the
LB to provide more information about why a set of addresses were not
acceptable.
The status will later be sent to the name resolver as well to allow it
to also better react to to bad addresses.
This breaks the ABI of the classes listed below.
Users that recompiled their code using grpc-java [`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) (released on
Feb 23, 2021) and later, ARE NOT AFFECTED.
Users that compiled their source using grpc-java earlier than
[`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) need to
recompile when upgrading to grpc-java `v1.59.0`. Otherwise the code
will fail on runtime with `NoSuchMethodError`. For example, code:
```java
NettyChannelBuilder.forTarget("localhost:100").maxRetryAttempts(2);
```
Will fail with
> `java.lang.NoSuchMethodError: 'io.grpc.internal.AbstractManagedChannelImplBuilder
io.grpc.netty.NettyChannelBuilder.maxRetryAttempts(int)'`
**Affected classes**
Class `AbstractManagedChannelImplBuilder` is deleted, and no longer in
the class hierarchy of the channel builders:
- `io.grpc.netty.NettyChannelBuilder`
- `io.grpc.okhttp.OkhttpChannelBuilder`
- `grpc.cronet.CronetChannelBuilder`
This breaks the ABI of the classes listed below.
Users that recompiled their code using grpc-java [`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) (released on
Feb 23, 2021) and later, ARE NOT AFFECTED.
Users that compiled their source using grpc-java earlier than
[`v1.36.0`]
(https://github.com/grpc/grpc-java/releases/tag/v1.36.0) need to
recompile when upgrading to grpc-java `v1.59.0`. Otherwise the code
will fail on runtime with `NoSuchMethodError`. For example, code:
```java
NettyServerBuilder.forPort(80).directExecutor();
```
Will fail with
> `java.lang.NoSuchMethodError: 'io.grpc.internal.AbstractServerImplBuilder
io.grpc.netty.NettyServerBuilder.directExecutor()'`
**Affected classes**
Class `AbstractServerImplBuilder` is deleted, and no longer in the
class hierarchy of the server builders:
- `io.grpc.netty.NettyServerBuilder`
- `io.grpc.inprocess.InProcessServerBuilder`
When a memory leak occurs, it is really helpful to have access records
to understand where the buffer was being held when it leaked. retain()
when we create the NettyReadableBuffer already creates an access record
the ByteBuf, so here we track when the ByteBuf is passed to another
thread.
See #8330
* Use internalClose instead of close when sendMessage has a RuntimeException.
* Change argument to internalClose to a Throwable instead of a Status.
* Rename internalClose to handleInternalError
* Eliminate NPE by skipping further processing when stream is defined, but doesn't have a property for streamKey (header processing identified an error)
Fixes#10364
* Add unit test for missing content type
Encode the service authority before passing it into gRPC util in the xDS name resolver to handle xDS requests which might contain multiple slashes. Example: xds:///path/to/service:port.
As currently the underlying Java URI library does not break the encoded authority into host/port correctly simplify the check to just look for '@' as we are only interested in checking for user info to validate the authority for HTTP.
This change also leads to few changes in unit tests that relied on this check for invalid authorities which now will be considered valid.
Just like #9376, depending on Guava packages such as URLEscapers or PercentEscapers leads to internal failures(Ex: Unresolvable reference to com.google.common.escape.Escaper from io.grpc.internal.GrpcUtil). To avoid these issues create an in house version that is heavily inspired by grpc-go/grpc.
Wrapping the DnsNameResolver in DnsNameResolverProvider can cause
problems to external name resolvers that delegate to a DnsResolver
already wrapped in RetryingNameResolver. ManagedChannelImpl would
end up wrapping these name resolvers again, causing an exception
later from a RetryingNameResolver safeguard that checks for double
wrapping.
Motivation:
When multiple NameResolvers are created, the Classloader is scanned every time trying to figure out if the Platform is Android. This expensive work could be done only once.
Modification:
Cache isAndroid resolution in a constant.
Result:
Less expensive multiple NameResolvers instantiation.
ManagedCahnnelImpl did not make sure to use a RetryingNameResolver if
authority was not overriden. This was not a problem for DNS name
resolution as the DNS name resolver factory explicitly returns a
RetryingNameResolver. For polling name resolvers that do not do this in
their factories (like the grpclb name resolver) this meant not having retry
at all.
* context, all: move Context classes to grpc-api
clean up grpc-context since it has no source code: only add dep on grpc-api
add exclusion for all transitive deps of grpc-api - only guava
exclude grpc-context as a dependency from grpc-alts because all context code is in grpc-api now
api: 1.7 as target Java version for Context source-set of grpc-api
* core, census: fix the issues with android project pulling in old grpc-context version
* api,context: make changes to bazel build files to account for context code moving from context to api
Since 44847bf4e, when we upgraded our JUnit version, the JUnit
exclusions have probably not been necessary. e0ac97c4f upgraded
Robolectric to a version that had the auto.service problem fixed.
This avoids the (often missing) evaluationDependsOn and fixes using
results from other projects without propagating those through
Configuration. It also reduces the number of useless classes pulled in
by down-stream tests, reducing the probability of rebuilds.
The expectation of fixtures is they help testing down-stream code that
use the classes in main. That applies to all the classes here except for
FakeClock and StaticTestingClassLoader. It would also apply to many
internal classes in grpc-testing, but let's consider cleaning that up
future work.
LoadBalancers in general should never throw, but
parseLoadBalancingConfig() in particular has a return value to
communicate the error. Throwing can be a bit unpredictable, but at its
most trivial form causes a channel panic. There's no reason to throw
explicitly and calls to JsonUtil have to be protected by a try-catch
because it can throw.
ReadyPicker hasn't been necessary since 111ff60e, when we stopped
calling super.pickSubchannel(). EmptyPicker was only used in tests, and
we can just compare the class name instead of doing an instanceof check.
Unfortunately, calling getClass() caused Java to start casting the
return value of pickerCaptor.getValue() based on its generics. Captors
don't verify the type they capture, so using any type other than
SubchannelPicker for the pickerCaptor is misleading and hides a cast.
If provided with the new PickFirstLoadBalancerConfig,
PickFirstLoadBalancer will shuffle the list of addresses it receives
from the name resolver.
PickFirstLoadBalancerProvider will now support the new config
if enabled by an env variable.
When the subchannel is transitioning from TRANSIENT_FAILURE to either
IDLE or CONNECTING we will not update the LB state. Additionally, if
the subchannel becomes idle we request a new connection so that the
subchannel will keep on trying to establish a connection.
The problem was one hedge was committed before another had drained
start(). This was not testable because HedgingRunnable checks whether
scheduledHedgingRef is cancelled, which is racy, but there's no way to
deterministically trigger either race.
The same problem couldn't be triggered with retries because only one
attempt will be draining at a time. Retries with cancellation also
couldn't trigger it, for the surprising reason that the noop stream used
in cancel() wasn't considered drained.
This commit marks the noop stream as drained with cancel(), which allows
memory to be garbage collected sooner and exposes the race for tests.
That then showed the stream as hanging, because inFlightSubStreams
wasn't being decremented.
Fixes#9185
This change has these main aspects to it:
1. Removal of any name resolution responsibility from ManagedChannelImpl
2. Creation of a new RetryScheduler to own generic retry logic
- Can also be used outside the name resolution context
3. Creation of a new RetryingNameScheduler that can be used to wrap any
polling name resolver to add retry capability
4. A new facility in NameResolver to allow implementations to notify
listeners on the success of name resolution attempts
- RetryingNameScheduler relies on this
ExpectedException is deprecated, so I fixed the new warnings. However,
we are still using ExpectedException many places and had previously
supressed the warning. See
https://github.com/grpc/grpc-java/issues/7467 . I did not fix those
existing instances that had suppressed the warning, since it is
unrelated to the upgrade and we have been free to fix them at any time
since we dropped Java 7.
Use a volatile to publish the value even though it is final. TSAN
ignores the final aspect of a field, which is fair since another thread
may not see the parent's pointer become updated and use a stale value.
The lack of synchronization was clearly against lastServiceConfig's
documentation.
Fixes#9267
This helps to prevent retryable stream from using calloptions.executor when it shouldn't, e.g. call is already notified closed. It is done by delaying notifying upper layer (through masterListener).
Trying to upgrade Gradle to 7.6 improved the checkstyle plugin such that
it appears to have been running in new occasions. That in turn exposed
us to https://github.com/checkstyle/checkstyle/issues/5088. That bug was
fixed in 8.28, which also fixed lots of other bugs. So now we have
better checking and some existing volations needed fixing. Since the
code style fixes generated a lot of noise, this is a pre-fix to reduce
the size of a Gradle upgrade.
I did not upgrade past 8.28 because at some point some other bugs were
introduced, in particular with the Indentation module. I chose the
oldest version that had the particular bug impacting me fixed. Upgrading
to this old-but-newer version still makes it easier to upgrade to a
newer version in the future.
If an artifact on Maven Central exposes a type from gRPC on its API
surface, then consumers of that artifact need that gRPC API in the
compile classpath. Bazel handles this by making hjars for transitive
dependencies, but if the dependencies are runtime_deps then Bazel won't
generate hjars containing the needed symbols.
We don't export netty-shaded because the classes already don't match
Maven Central. If an artifact on Maven Central is exposing a
netty-shaded class on its API surface, it wouldn't work anyway since the
class simply doesn't exist for the Bazel build.
Fixes#9772
This change has these main aspects to it:
1. Removal of any name resolution responsibility from ManagedChannelImpl
2. Creation of a new RetryScheduler to own generic retry logic
- Can also be used outside the name resolution context
3. Creation of a new RetryingNameScheduler that can be used to wrap any
polling name resolver to add retry capability
4. A new facility in NameResolver to allow implementations to notify
listeners on the success of name resolution attempts
- RetryingNameScheduler relies on this
updateAndGet is only available in API level 24+. It is unclear why
AnimalSniffer didn't detect this.
This was noticed when doing the import, but the Android CI has also been
failing because of it.
Add big negative integer to pending stream count when cancelled. The count is used to delay closing master listener until streams fully drained.
Increment pending stream count before creating one. The count is also used to indicate callExecutor is safe to be used. New stream will not be created if big negative number was added, i.e. stream cancelled. New stream is created if not cancelled, callExecutor is safe to be used, because cancel will be delayed.
Create new streams (retry, hedging) is moved to the main thread, before callExecutor calls drain.
Minor refactor the masterListener.close() scenario.