internal: resetTransport connect deadline is across addresses
Currently, the connect deadline is recalculated per-address. This PR amends
that behavior such that all addresses for a single connection attempt share
the same deadline.
Fixes#2462
- Remove the slice of servers approach, since there's specific
logic at server 2 that's different from server 1. This has the
advantage of making the test more readable without sacrificing
anything (given the previous point).
- Defer server close at initialization time instead of at the
end.
- Remove a time.Sleep(time.Second): use timeout + select around
serverDone instead.
- Use a goroutine to keep the connection reading, instead of
using a for loop in the server goroutine. This causes the
defer close(server2Done) to happen immediately after preface
is sent, which combined with the aforementioned time.Sleep
removal causes the test to go from 1.00s to ~0.05s.
Previously, the transport was able to reset via the retry loop,
or via the event closures calling resetTransport. This meant
a very large amount of synchronization was necessary: one
reset meant the other had to not reset; state had to be kept
at the addrconn; and very subtle interactions were hard to
reason about.
This change removes the ability for event closures to directly
reset the transport. Instead, they signal to to the retry
loop about the event, and the retry loop is always the single
place that retries occur.
This also allows us to refactor the address switching logic
into a much simpler for loop inside the retry loop instead of
using addrConn state to keep track of an index.
Possible settings of this environment variable:
- "hybrid" (default; removed after the 1.17 release): do not wait for handshake before considering a connection ready, but wait before considering successful.
- "on" (default after the 1.17 release): wait for handshake before considering a connection ready/successful.
- "off": do not wait for handshake before considering a connection ready/successful.
This setting will be completely removed after the 1.18 release, and "on" will be the only supported behavior.
internal: fix client send preface problems
This CL fixes three problems:
- In clientconn_state_transitions_test.go, sometimes tests would flake because there's not enough buffer to send client side settings, causing the connection to unpredictably enter TRANSIENT FAILURE. Each time we set up a server to send SETTINGS, we should also set up the server to read. This allows the client to successfully send its SETTINGS, unflaking the test.
- In clientconn.go, we incorrectly transitioned into TRANSIENT FAILURE when creating an http2client returned an error. This should be handled in the outer resetTransport main reset loop. The reason this became a problem is that the outer resetTransport has very specific conditions around when to transition into TRANSIENT FAILURE that the egregious transition did not have. So, it could transition into TRANSIENT FAILURE after failing to dial, even if it was trying to connect to a non-final address in the list of addresses.
- In clientconn.go, we incorrectly stay in CONNECTING after `createTransport` when a server sends its connection preface but the client is not able to send its connection preface. This CL causes the addrconn to correctly enter TRANSIENT FAILURE when `createTransport` fails, even if a server preface was received. It does so by making ac.successfulHandshake to consider both server preface received as well as client preface sent.
internal: clean up and unflake state transitions test
Switches state transitions test to using a notification from a custom load
balancer, instead of relying on waiting for laggy balancer state updates.
Also generally adds more coverage around state transitions and a framework
for easily adding more of these kinds of tests.
Fixes#2348
Closing `ClientConn` sets `balancerWrapper` to nil.
If service config switches balancer, the new balancer will be notified of the existing addresses.
When these two happens together, there's a chance that a method will be called on the nil `balancerWrapper`. This change adds a check to make sure that never happens.
fixes#2367
internal: fix onClose state transitions
When onClose occurs during WaitForHandshake, it should immediately
exit createTransport instead of leaving CONNECTING and entering READY.
Furthermore, when any onClose happens, the state should change to
TRANSIENT FAILURE.
Fixes#2340Fixes#2341
Also fixes an unreported bug in which entering READY causes a
Dial call to end prematurely, instead of blocking until a READY
transport is found.
This fixes a race in ac.tearDown and ac.resetTransport. If ac.resetTransport
is in backoff when ac.tearDown occurs, there's a race between the state
changing to Shutdown and ac.resetTransport calling ac.createTransport.
This fixes it by returning when ac.resetTransport encounters an error
during ac.nextAddr (specifically ac.ctx.Error()). It also fixes it by
making sure that ac.tearDown changes state to Shutdown before canceling
the context.
Both fixes were implemented because they both seem to be valuable
standalone additions: the former makes ac.resetTransport more
understandable and less dependent on behavior happening elsewhere,
and the latter makes ac.tearDown more correct.
Finally, TestDialParseTargetUnknownScheme had its buffer removed; the
buffer was likely added a while ago to assuage this issue. It should
not be necessary anymore.
tests: fix goroutine leak
If TestResetConnectBackoff fails, the resetTransport goroutine will be
stuck dialing and subsequently the goroutine will be leaked. This is
all despite the test including `defer cc.Close()`:
- defer cc.Close() will cause ac.cancel to be called
- ac.context will be appropriately cancelled
- ac.context is correctly the context that gets passed to the dialer
- However, the WithDialer throws away the context and only passes its
deadline, which is for `backoffForever{}` is math.MaxInt64. So, even
though teardown occurs, the resetTransport goroutine will still be
stuck dialing.
This CL adds a small amendment: before performing leakcheck, attempt
to take an item off the synchronous `dials` channel. Either the tests
passed and there is no item, or the tests failed and there is one.
A leak happens when DialContext times out before a balancer returns any
addresses or before a successful connection is established.
The loop in ClientConn.lbWatcher breaks and doneChan never gets closed.
This modifies the WithBlock behavior somewhat to block until there is at least
one valid connection. Previously, each connection would be made serially until
all had completed successfully, with any errors returned to the caller. Errors
are now only returned due to connecting to a backend if a balancer is not used,
or if there is an error starting the balancer itself.
Fixes#976
The :authority pseudo-header for a gRPC Client defaults to the host
portion of the dialed target and can only be overwritten by providing a
TransportCredentials. However, there are cases where setting this header
independent of any tranport security is valid. In my particular case,
in order to leverage Envoy for request routing, the cluster/service name
must be provided in the :authority header. This may also be useful in a
testing context.
This patch adds a DialOption to overwrite the authority header,
even if TransportCredentials are provided (I'd imagine you'd only ever
need to specify one or the other).
* Add the initial service config support
* start scWatcher later
* remove timeoutCh
* address the comments
* deal with dial timeout
* defer cancel for the newly created context for correct lifetime management
* fix the defer order
* added other 2 missing cancels
To enforce immutability of the `DefaultBackoffConfig`, we've made it a
concrete value. While fields can still be set directly on the value,
taking a copy will not incidentally pull a reference to the variable.
Signed-off-by: Stephen J Day <stephen.day@docker.com>
Because most of the fields on `BackoffConfig` are unexported, correctly
using the config requires copying from the default. This sets the
defaults appropriately and falls back to a default if MaxDelay is
negative or zero.
Tests are added to ensure that the backoff is set correctly in common
use cases.
Signedroff-by: Stephen J Day <stephen.day@docker.com>
Signed-off-by: Stephen J Day <stephen.day@docker.com>