linkerd2-proxy/linkerd/retry
Eliza Weisman 7f817b5741
retry gRPC requests are immediately terminated by trailers (#1706)
Currently, gRPC responses are classified for metrics based on the value
of the `grpc-status` header in either the initial `HEADERS` frame *or*
in the `TRAILERS` frame that terminates the response body. However, this
is *not* the case for retries. Because we don't want to buffer the
entire response body before forwarding it to the client (introducing
latency, and potentially breaking client behavior in the case of
long-running streams), we make retry decisions for gRPC based only on
`grpc-status` in the initial `HEADERS` frame.

Unfortunately, some gRPC implementations (such as Akka) will send the
`grpc-status` header in a `TRAILERS` frame even when the body is empty
(see linkerd/linkerd2#7701). In that case, Linkerd won't retry the
request, even though it wouldn't actually have to wait to buffer a long
body before making a retry decision.

This branch resolves this issue by changing the retry implementation so
that the proxy will wait for a single additional frame after the first
`HEADERS` frame before making a retry decision. If the next frame is a
`TRAILERS` frame that terminates the response stream, we can include the
value of the `grpc-status` in that `TRAILERS` frame in the retry
decision. Otherwise, if the next frame is a `DATA` frame, we buffer only
a single frame, give up on trying to retry, and forward that frame to
the client, followed by the rest of the body.

Unfortunately, this change is a bit complicated, as it was necessary to
introduce the ability for retries to modify the response body as well as
the request body. This then implies that we need to be able to erase
response body types as well, when they may differ based on whether or
not retries are enabled for a route. I tried a number of different ways
of structuring this change, such as passing an additional `Layer` into
the `Retry` service that wraps the service constructed for retries, but
this was kind of a mess. Ultimately, I felt like the cleanest way of
doing this was changing the `linkerd_retry::PrepareRequest` trait to a
`PrepareRetry` trait that can also modify response bodies.

I'm open to suggestions for a nicer way of structuing this change. I
also considered the alternative of giving up on `tower::retry` entirely
and writing our own retry service from scratch, but I wanted to try to
avoid that if it was possible.

Depends on #1705 and #1726.

Fixes linkerd/linkerd2#7701.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2022-06-22 15:08:21 -07:00
..
src retry gRPC requests are immediately terminated by trailers (#1706) 2022-06-22 15:08:21 -07:00
Cargo.toml cargo: Omit patch versions from Cargo.toml files (#1504) 2022-02-17 11:55:43 -08:00