Commit Graph

421 Commits

Author SHA1 Message Date
Eliza Weisman 6df55c0059
Update h2 to 0.1.15 (#172)
carllerche/h2#338 fixes a deadlock in stream reference counts that could
potentially impact the proxy. This branch updates our `h2` dependency to a
version which includes this change.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2019-01-16 15:10:44 -08:00
Eliza Weisman 08b2a23ca8
Update to trust-dns-resolver 0.10.1 (#169)
An upstream bug in the `trust-dns-proto` library can cause
`trust-dns-resolver` to leak UDP sockets when DNS queries time out. This
issue appears to be the cause of the memory leak described in
linkerd/linkerd2#2012.

This branch updates the `trust-dns` dependency to pick up the change in
bluejekyll/trust-dns#635, which fixes the UDP socket leak.

I confirmed that the socket leak was fixed by modifying the proxy to
hard-code a 0-second DNS timeout, sending requests to the proxy's
outbound listener, and using

``` lsof -p $(pgrep linkerd2-proxy) ```

to count the number of open UDP sockets. On master, every request to a
different DNS name that times out leaves behind an additional open UDP
socket, which show up in `lsof`, while on this branch, only TCP sockets
remain open after the request ends.

In addition, I'm running a test in GCP to watch the memory and file
descriptor use of the proxy over a long period of time. This is still in
progress, but given the above, I strongly believe this branch fixes the
leak.

Fixes linkerd/linkerd2#2012.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2019-01-09 16:11:12 -08:00
Jon Richards 204fc408de Update for linkerd2 slack channel (#168)
See f833f5659c

Signed-off-by: Jon Richards <jon.richards@nordstrom.com>
2019-01-03 08:43:24 +05:30
Sean McArthur 2f2050537d add Route retries to Service Profiles
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-12-19 18:31:43 -08:00
Oliver Gould 6649630db7
Improve load balancer configuration (#167)
Adopts changes from https://github.com/tower-rs/tower/pull/134

> balance: Consider new nodes more readily
>
> When a PeakEwma Balancer discovers a single new endpoint, it will not
> dispatch requests to the new endpoint until the RTT estimate for an
> existing endpoint exceeds _one second_. This misconfiguration leads to
> unexpected behavior.
>
> When more than one endpoint is discovered, the balancer may eventually
> dispatch traffic to some of--but not all of--the new enpoints.
>
> This change alters the PeakEwma balancer in two ways:
>
> First, the previous DEFAULT_RTT_ESTIMATE of 1s has been changed to be
> configurable (and required). The library should not hard code a default
> here.
>
> Second, the initial RTT value is now decayed over time so that new
> endpoints will eventually be considered, even when other endpoints are
> less loaded than the default RTT estimate.
2018-12-19 16:35:04 -08:00
Sean McArthur 5b00bcf40e Update to latest tower and tower-grpc
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-12-19 13:18:57 -08:00
Jon Richards efe5299575 Makefile: Add missing fetch for test-flakey (#163)
Signed-off-by: Jon Richards <jon.richards@nordstrom.com>
2018-12-18 15:15:01 -08:00
Jon Richards 9c0a94987d Add clean target (#161)
Adds a `make clean` target that invokes `cargo clean`.
2018-12-18 15:13:54 -08:00
Sean McArthur 792c04b7d1 Replace tower-h2 tap service with hyper
This untangles some of the HTTP/gRPC glue, providing services/stacks
that have more specific focuses. The `HyperServerSvc` now *only*
converts to a `tower::Service`, and the HTTP/1.1 and Upgrade pieces were
moved to a specific `proxy::http::upgrade::Service`.

Several stack modules were added to `proxy::grpc`, which can map request
and response bodies into `Payload`, or into `grpc::Body`, as needed.

Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-12-18 12:05:50 -08:00
Eliza Weisman 0a085f98e8
Actually fix the master CI build (#164)
It turns out that increasing the recursion limit for the `tap` test
crate _actually_ fixes the compiler error that's broken the last several
builds on master. 

Since I'm now able to locally reproduce the error (which occurs only
when running the tests in release mode), I've verified that this
actually does fix the issue. Thus, I've also reverted the previous
commit (7c35f27ad3) which claimed to fix
this issue, as it turns out that was not actually necessary.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-12-14 12:18:59 -08:00
Eliza Weisman 7c35f27ad3
Workaround for Rust 1.31 crash with verbose mode enabled (#162)
This branch replaces the `export CARGO_VERBOSE=1` on CI release-mode
builds with the `travis_wait` script. Verbose mode was previously being
set to prevent long release-mode builds from timing out. However, there
appears to be a bug in `rustc` 1.31.0, which causes the compiler to
crash when building the proxy with verbose mode enabled. Hopefully, this
will fix the build on master.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-12-13 14:38:13 -08:00
Eliza Weisman 761a08e4ac
Make TLS accept logic compatible with disabled protocol detection (#158)
This branch changes the proxy's accept logic so that the proxy will no
longer attempt to terminate TLS on ports which are configured to skip
protocol detection. This means that a Linkerd deployment with 
`--tls optional` will no longer break server-speaks-first protocols like
 MySQL (although that traffic will not be encrypted). 

Since it's necessary to get the connection's original destination to
determine if it's on a port which should skip protocol detection, I've
moved the SO_ORIGINAL_DST call down the stack from `Server` to
`BoundPort`. However, to avoid making an additional unnecessary syscall,
the original destination is propagated to the server, along with the
information about whether or not protocol detection is enabled. This is
the approach described in
https://github.com/linkerd/linkerd2/issues/1270#issuecomment-406124236.

I've also written a new integration test for server-speaks-first
protocols with TLS enabled. This test is essentially the same as the
existing `transparency::tcp_server_first` test, but with TLS enabled for
the test proxy. I've confirmed that this fails against master.
Furthermore, I've validated this change by deploying the `booksapp` demo
with MySQL with TLS enabled, which [previously didn't work](https://github.com/linkerd/linkerd2/issues/1648#issuecomment-432867702).

Closes linkerd/linkerd2#1270

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-12-13 12:31:13 -08:00
Jon Richards d8d1b040f9 Fix test-flakey target (#157)
Signed-off-by: Jon Richards <jon.richards@nordstrom.com>
2018-12-11 11:14:58 -08:00
Oliver Gould 0065c13751
profiles: Drive profile discovery on a daemon task (#156)
The profile router currently is responsible for driving the state of
profile discovery; but this means that, if a service is not polled for
traffic, the proxy may not drive discovery (so that requests may
timeout, etc).

This change moves this discovery onto a daemon task that sends profile
updates to the service over an mpsc with capacity of 1.
2018-12-05 12:40:29 -08:00
Sean McArthur b9ffbb7f93 Update h2 to v0.1.14
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-12-05 11:01:49 -08:00
Sean McArthur 3ac6b72c48
Add basic tap integration tests (#154)
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-12-04 18:37:26 -08:00
Oliver Gould 68f42c337f
Log discovery updates in the outbound proxy (#153)
When debugging issues that users believe is related to discovery, it's
helpful to get a narrow set of logs out to determine whether the proxy
is observing discovery updates.

With this change, a user can inject the proxy with
```
LINKERD2_PROXY_LOG='warn,linkerd2_proxy=info,linkerd2_proxy::app::outbound::discovery=debug'
```
and the proxy's logs will include messages like:

```
DBUG voting-svc.emojivoto.svc.cluster.local:8080 linkerd2_proxy::app::outbound::discovery adding 10.233.70.98:8080 to voting-svc.emojivoto.svc.cluster.local:8080
DBUG voting-svc.emojivoto.svc.cluster.local:8080 linkerd2_proxy::app::outbound::discovery removing 10.233.66.36:8080 from voting-svc.emojivoto.svc.cluster.local:8080
```

This change also turns-down some overly chatty INFO logging in main.
2018-12-04 07:45:20 -08:00
Oliver Gould f3f959b854
Record grpc-status from response headers (#152)
As we well know, gRPC responses may include the `grpc-status` header
when there is no response payload.

This change ensures that tap response end events include this value when
it is set on response headers, since grpc-status is handled specially
in the Tap API.
2018-12-03 16:05:27 -08:00
Oliver Gould da9736c9da
Update h2 to v0.1.13 (#151)
* 80b4ec5 (tag: v0.1.13) Bump version to v0.1.13 (#324)
* 6b23542 Add client support for server push (#314)
* 6d8554a Reassign capacity from reset streams. (#320)
* b116605 Check whether the send side is not idle, not the recv side (#313)
* a4ed615 Check minimal versions (#322)
* ea8b8ac Avoid prematurely unlinking streams in `send_reset`, in some cases. (#319)
* 9bbbe7e Disable length_delimited deprecation warning. (#321)
* 00ca534 Update examples to use new Tokio (#316)
* 12e0d26 Added functions to access io::Error in h2::Error (#311)
* 586106a Fix push promise frame parsing (#309)
* 2b960b8 Add Reset::INTERNAL_ERROR helper to test support (#308)
* d464c6b set deny(warnings) only when cfg(test) (#307)
* b0db515 fix some autolinks that weren't resolving in docs (#305)
* 66a5d11 Shutdown the stream along with connection (#304)
2018-12-03 13:50:12 -08:00
Oliver Gould cbaed8af71
profiles: Add anchors to control-plane provided regexes (#150)
@adleong suggested that profile matching should always be anchored
so that users must be explicit about unexpected path components.

This change modifies the Profile client to always build anchore
 regular expressions.
2018-12-03 13:08:00 -08:00
Oliver Gould 872f78df31
Expose route labels via tap (#147)
Route labels are not queryable by tap, nor are they exposed to in tap
events.

This change uses the newly-added fields in linkerd/linkerd2-proxy-api#17
to make Tap route-aware.
2018-12-03 12:29:39 -08:00
khappucino 87fb677cdf Minor spelling error in connection.rs comments (#145)
Signed-off-by: David Capino <david.capino@gmail.com>
2018-11-29 20:02:03 -08:00
Oliver Gould 52a2bf5a3e
canonicalize: Drive resolution on a background task (#146)
canonicalize: Drive resolution on a background task

canonicalize::Service::poll_ready may not be called enough to drive
resolution, so a background task must be spawned to watch DNS.

Updates are published into service over an mpsc, so the task exits
gracefully when the service is dropped.
2018-11-29 17:28:51 -08:00
Oliver Gould 82524e4a1f
Apply tapping logic only when taps are active (#142)
Previously, as the proxy processed requests, it would:

Obtain the taps mutex ~4x per request to determine whether taps are active.
Construct an "event" ~4x per request, regardless of whether any taps were
active.
Furthermore, this relied on fragile caching logic, where the grpc server
manages individual stream states in a Map to determine when all streams have
been completed. And, beyond the complexity of caching, this approach makes it
difficult to expand Tap functionality (for instance, to support tapping of
payloads).

This change entirely rewrites the proxy's Tap logic to (1) prevent the need
to acquire muteces in the request path, (2) only produce events as needed to
satisfy tap requests, and (3) provide clear (private) API boundaries between
the Tap server and Stack, completely hiding gRPC details from the tap service.

The tap::service module now provides a middleware that is generic over a
way to discover Taps; and the tap::grpc module (previously,
control::observe), implements a gRPC service that advertises Taps such that
their lifetimes are managed properly, leveraging RAII instead of hand-rolled
map-based caching.

There is one user-facing change: tap stream IDs are now calculated relative to
the tap server. The base id is assigned from the count of tap requests that have
been made to the proxy; and the stream ID corresponds to an integer on [0, limit).
2018-11-29 17:12:48 -08:00
Sean McArthur edd124fa9f
Dockerfile: copy all lib crates before fetching dependencies (#143)
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-11-28 11:26:08 -08:00
Sean McArthur 88340cadf3 replace proxy::http usage of tower-h2 with hyper
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-11-27 17:29:18 -08:00
Oliver Gould 5d93690045
tests: Replace `uimplemented!` with 'unreachable!' (#141)
When developing, it's convenient to use `unimplemented!` as a
placeholder for work that has not yet been done. However, we also use
`unimplemented!` in various tests in stubbed methods; so searching the
project for `unimplemented` produces false positives.

This change replaces these with `unreachable!`, which is functionaly
equivalent, but better indicates that the current usage does not reach
these methods and disambiguates usage of `unimplemented!`.
2018-11-27 14:27:30 -08:00
Luca Bruno 4f9adf9ca4 metrics/counter: wrap values over 2^53 (#139)
This implements Prometheus reset semantics for counters, in order to
preserve precision when deriving rate of increase.
Wrapping is based on the fact that Prometheus models counters as `f64`
(52-bits mantissa), thus integer values over 2^53 are not guaranteed to
be correctly exposed.

Signed-off-by: Luca Bruno <luca.bruno@coreos.com>
2018-11-18 08:05:15 -08:00
Oliver Gould 663eab43dc
Reduce log level for orig-proto-downgrade (#138)
Using a downgrade stack is not sufficiently important to log at the INFO
level. Log it at DEBUG.
2018-11-16 13:19:06 -08:00
Oliver Gould 2ab7ce2e67
Delete the ctx module (#137)
The `ctx` module is no longer used. It can safely be deleted.
2018-11-16 12:28:28 -08:00
Sean McArthur cde7675a9a Ensure l5d-orig-proto header is removed before returning responses
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-11-16 12:10:29 -08:00
Sean McArthur f37c9e5128 Update all tower pieces to use Service<Request> (#132)
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-11-16 11:19:17 -08:00
Oliver Gould 7add0db68e
Disable debug symbols for tests (#135)
Our test artifacts end up consuming several GB of disk space, largely
due to debug symbols. This can prevent CI from passing, as CI hosts only
have about 9G of real estate.

By disabling debug symbols, we reduce artifact size by >90% (and total
target directory size from 14G to 4G).
2018-11-16 10:03:42 -08:00
Toby Lawrence 1cd340ccaa Fix Slack signup link in README. (#133)
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
2018-11-15 20:12:46 -08:00
Oliver Gould 9a9d929e34
Add controller client metrics (#131)
There is no telemetry from the controller client currently.

This change adds a new scope (`control_`) of metrics including HTTP
metrics for the client to the proxy-api.
2018-11-15 15:30:12 -08:00
Alex Leong c970a8c173 Add never lib to Dockerfile (#130)
Signed-off-by: Alex Leong <alex@buoyant.io>
2018-11-15 11:45:28 -08:00
Oliver Gould d5e2ff2cb7
Canonicalize outbound names via DNS for inbound profiles (#129)
When the inbound proxy receives requests, these requests may have
relative `:authority` values like _web:8080_. Because these requests can
come from hosts with a variety of DNS configurations, the inbound proxy
can't make a sufficient guess about the fully qualified name (e.g.
_web.ns.svc.cluster.local._).

In order for the inbound proxy to discover inbound service profiles, we
need to establish some means for the inbound proxy to determine the
"canonical" name of the service for each request.

This change introduces a new `l5d-dst-canonical` header that is set by
the outbound proxy and used by the remote inbound proxy to determine
which profile should be used.

The outbound proxy determines the canonical destination by performing
DNS resolution as requests are routed and uses this name for profile and
address discovery. This change removes the proxy's hardcoded Kubernetes
dependency.

The `LINKERD2_PROXY_DESTINATION_GET_SUFFIXES` and
`LINKERD2_PROXY_DESTINATION_PROFILE_SUFFIXES` environment variables
control which domains may be discovered via the destination service.

Finally, HTTP settings detection has been moved into a dedicated routing
layer at the "bottom" of the stack. This is done do that
canonicalization and discovery need not be done redundantly for each set
of HTTP settings. Now, HTTP settings, only configure the HTTP client
stack within an endpoint.

Fixes linkerd/linkerd2#1798
2018-11-15 11:41:17 -08:00
Sean McArthur 21887e57e4 change Inbound to always use localhost
Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-11-14 15:59:48 -08:00
Oliver Gould fbadd969ce
Remove the `timestamp_request_open` module (#128)
The `timestamp_request_open` module is no longer used.  It can safely be
removed.
2018-11-13 15:35:55 -08:00
Sean McArthur 1595b2457d convert several Stack unit errors into Never
Since this stack pieces will never error, we can mark their
`Error`s with a type that can "never" be created. When seeing an `Error
= ()`, it can either mean the error never happens, or that the detailed
error is dealt with elsewhere and only a unit is passed on. When seeing
`Error = Never`, it is clearer that the error case never happens.
Besides helping humans, LLVM can also remove the error branchs entirely.

Signed-off-by: Sean McArthur <sean@buoyant.io>
2018-11-13 11:55:41 -08:00
Oliver Gould 00b4009525
Allow routers to be implemented with a closure (#126)
The router's `Recognize` trait is now essentially a function.

This change provides an implementation of `Recognize` over a `Fn` so
that it's possible to implement routers without defining 0-point marker
types that implement `Recognize`.
2018-11-13 09:57:08 -08:00
Oliver Gould 2d2d209e4e
Implement Error for svc::Either (#125)
The `linkerd2_stack::Either` type is used to implement Layer, Stack, and
Service for alternate underlying implementations. However, the Service
implementation requires that both inner services emit the same type of
Error.

In order to allow the underlying types to emit different errors, this
change uses `Either` to wrap the underlying errors, and implements
`Error` for `Either`.
2018-11-13 09:56:46 -08:00
Oliver Gould 4d3e0abd41
Ensure metrics are not evicted for active routes (#124)
It was possible for a metrics scope to be deregistered for active
routes. This could cause metrics to disappear and never be recorded in
some situations.

This change ensure that metrics are only evicted for scopes that are not
active (i.e. in a router, load balancer, etc).
2018-11-12 18:50:36 -08:00
Oliver Gould d396acda6d
Fall-back to former classification logic with profiles (#123)
With the introduction of profile-based classification, the proxy would
not perform normal gRPC classification in some cases when it could &
should.

This change simplifies our default classifier logic and falls back to
the default grpc-aware behavior whenever another classification cannot
be performed.

Furthermore, this change moves the `proxy::http::classify` module to
`proxy::http::metrics::classify`, as these modules should only be relied
on for metrics classification. Other module (for instance, retries),
should provide their own abstractions.

Finally, this change fixes a test error-formatting issue.
2018-11-12 18:22:12 -08:00
Oliver Gould c4b3765574
Unify Name/Host/Addr types under Addr (#120)
Currently, the proxy uses a variety of types to represent the logical
destination of a request. Outbound destinations use a `NameAddr` type
which may be either a `DnsNameAndPort` or a `SocketAddr`. Other parts of
the code used a `HostAndPort` enum that always contained a port and also
contained a `Host` which could either be a `dns::Name` or a `IpAddr`.
Furthermore, we coerce these types into a `http::uri::Authority` in many
cases.

All of these types represent the same thing; and it's not clear when/why
it's appropriate to use a given variant.

In order to simplify the situtation, a new `addr` module has been
introduced with `Addr` and `NameAddr` types. A `Addr` may
contain either a `NameAddr` or a `SocketAddr`.

The `Host` value has been removed from the `Settings::Http1` type,
replaced by a boolean, as it's redundant information stored elsewhere in
the route key.

There is one small change in behavior: The `authority` metrics label is
now omitted only for requests that include an `:authority` or `Host`
with a _name_ (i.e. and not an IP address).
2018-11-08 14:49:42 -08:00
Oliver Gould 5e0a15b8a7
Introduce outbound route metrics (#117)
The Destination Profile API---provided by linkerd2-proxy-api v0.1.3--
allows the proxy to discovery route information for an HTTP service. As
the proxy processes outbound requests, in addition to doing address
resolution through the Destination service, the proxy may also discover
profiles including route patterns and labels.

When the proxy has route information for a destination, it applies the
RequestMatch for each route to find the first-matching route. The
route's labels are used to expose `route_`-prefixed HTTP metrics (and
each label is prefixed with `rt_`).

Furthermore, if a route includes ResponseMatches, they are used to
perform classification (i.e. for the `response_total` and
`route_response_total` metrics).

A new `proxy::http::profiles` module implements a router that consumes
routes from an infinite stream of route lists.

The `app::profiles` module implements a client that continually and
repeatedly tries to establish a watch for the destination's routes (with
some backoff).

Route discovery does not _block_ routing; that is, the first request to
a destination will likely be processed before the route information is
retrieved from the controller (i.e. on the default route). Route
configuration is applied in a best-effort fashion.
2018-11-05 16:30:39 -08:00
Oliver Gould 0b6e35857b
Upgrade trust-dns-resolver to 0.10.0 (#118) 2018-11-02 11:28:22 -07:00
Oliver Gould 8fca9ebde2
Only use the classification label for response_total (#116)
As described in https://github.com/linkerd/linkerd2/issues/1832, our eager
classification is too complicated.

This changes the `classification` label to only be used with the `response_total` label.

The following changes have been made:
1. response_latency metrics only include a status_code and not a classification.
2. response_total metrics include classification labels.
3. transport metrics no longer expose a `classification` label (since it's misleading).
   now the `errno` label is set to be empty when there is no error.
4. Only gRPC classification applies when the request's content type starts
   with `application/grpc+`

The `proxy::http::classify` APIs have been changed so that classifiers cannot
return a classification before the classifier is fully consumed.
2018-11-01 14:59:44 -07:00
Oliver Gould 19606bd528
refactor: Use a stack-based controller client (#115)
The controller's client is instantiated in the
`control::destination::background` module and is tightly coupled to its
use for address resolution.

In order to share this client across different modules---and to bring it
into line with the rest of the proxy's modular layout---the controller
client is now configured and instantiated in `app::main`. The
`app::control` module includes additional stack modules needed to
configure this client.

Our dependency on tower-buffer has been updated so that buffered
services may be cloned.

The `proxy::reconnect` module has been extended to support a
configurable fixed reconnect backoff; and this backoff delay has been
made configurable via the environment.
2018-11-01 11:21:38 -07:00
Oliver Gould f97239baf0
Use the grpc-status response header for classification (#113)
When a gRPC service fails a request eagerly, before it begins sending a
response, a `grpc-status` header is simply added to the initial response
header (rather than added to trailers).

This change ensures that classification honors these status codes.

Fixes linkerd/linkerd2#1819
2018-10-30 11:03:31 -07:00