The `tcp_connect_err` tests do not pass on Linux.
I initially observed that the errno produced is ECONNREFUSED on Linux,
not EXFULL, but even changing this does not help the test to pass
reliably, as the socket teardown semantics appear to be slightly
different.
In order to prevent spurious test failures during development, this
change marks these two tests as macos-specific.
When the destination service returns a hint that an endpoint is another
proxy, eligible HTTP1 requests are translated into HTTP2 and sent over
an HTTP2 connection. The original protocol details are encoded in a
header, `l5d-orig-proto`. When a proxy receives an inbound HTTP2
request with this header, the request is translated back into its HTTP/1
representation before being passed to the internal service.
Signed-off-by: Sean McArthur <sean@buoyant.io>
Required for linkerd/linkerd2#1322.
Currently, the proxy places a limit on the number of active routes
in the route cache. This limit defaults to 100 routes, and is intended
to prevent the proxy from requesting more than 100 lookups from the
Destination service.
However, in some cases, such as Prometheus scraping a large number of
pods, the proxy hits this limit even though none of those requests
actually result in requests to service discovery (since Prometheus
scrapes pods by their IP addresses).
This branch implements @briansmith's suggestion in
https://github.com/linkerd/linkerd2/issues/1322#issuecomment-407161829.
It splits the router capacity limit to two separate, configurable
limits, one that sets an upper bound on the number of concurrently
active destination lookups, and one that limits the capacity of the
router cache.
I've done some preliminary testing using the `lifecycle` tests, where a
single Prometheus instance is configured to scrape a very large number
of proxies. In these tests, neither limit is reached. Furthermore, I've added
integration tests in `tests/discovery` to exercise the destination service
query limit. These tests ensure that query capacity is released when inactive
routes which create queries are evicted from the router cache, and that the
limit does _not_ effect DNS queries.
This branch obsoletes and closes#27, which contained an earlier version of
these changes.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Both tcp_connect_err tests frequently fail, even during local
development. This seems to happen because the proxy doesn't necessarily
observe the socket closure.
Instead of shutting down the socket gracefully, we can just drop it!
This helps the test pass much more reliably.
The proxy's telemetry system is implemented with a channel: the proxy thread
generates events and the control thread consumes these events to record
metrics and satisfy Tap observations. This design was intended to minimize
latency overhead in the data path.
However, this design leads to substantial CPU overhead: the control thread's
work scales with the proxy thread's work, leading to resource contention in
busy, resource-limited deployments. This design also has other drawbacks in
terms of allocation & makes it difficult to implement planned features like
payload-aware Tapping.
This change removes the event channel so that all telemetry is recorded
instantaneously in the data path, setting up for further simplifications so
that, eventually, the metrics registry properly uses service lifetimes to
support eviction.
This change has a potentially negative side effect: metrics scrapes obtain
the same lock that the data path uses to write metrics so, if the metrics
server gets heavy traffic, it can directly impact proxy latency. These
effects will be ameliorated by future changes that reduce the need for the
Mutex in the proxy thread.
Specifically proxied bodies would make use of an optimization in hyper
that resulted in the connection not knowing (but it did know! just
didn't tell itself...) that the body was finished, and so the connection
was closed. 0.12.7 includes the fix in hyper.
As part of this upgrade, the keep-alive tests have been adjusted to send
a small body, since the empty body was not triggering this case.
This branch adds a label displaying the Unix error code name (or the raw
error code, on other operating systems or if the error code was not
recognized) to the metrics generated for TCP connection failures.
It also adds a couple of tests for label formatting.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The `Backoff` service wrapper is used for the controller client service
so that if the proxy can't find the controller (there is a connection
error), it doesn't keep trying in a tight loop, but instead waits a
couple seconds before trying again, presuming that the control plane
was rebooting.
When "backing off", a timer would be set, but it wasn't polled, so the
task was never registered to wake up after the delay. This turns out to
not have been a problem in practice, since the background destination
task was joined with other tasks that were constantly waking up,
allowing it to try again anyways.
To add tests for this, a new `ENV_CONTROL_BACKOFF_DELAY` config value
has been added, so that the tests don't have to wait the default 5
seconds.
Signed-off-by: Sean McArthur <sean@buoyant.io>
There are various comments, examples, and documentation that refers to
Conduit. This change replaces or removes these refernces.
CONTRIBUTING.md has been updated to refer to GOVERNANCE/MAINTAINERS.
Fixeslinkerd/linkerd2#1276.
Currently, metrics with the `tls="no_identity"` label are duplicated.
This is because that label is generated from the `tls_status` label on
the `TransportLabels` struct, which is either `Some(())` or a
`ReasonForNoTls`. `ReasonForNoTls` has a
variant`ReasonForNoTls::NoIdentity`, which contains a
`ReasonForNoIdentity`, but when we format that variant as a label, we
always just produce the string "no_identity", regardless of the value of
the `ReasonForNoIdentity`.
However, label types are _also_ used as hash map keys into the map that
stores the metrics scopes, so although two instances of
`ReasonForNoTls::NoIdentity` with different `ReasonForNoIdentity`s
produce the same formatted label output, they aren't _equal_, since that
field differs, so they correspond to different metrics.
This branch resolves this issue, by adding an additional label to these
metrics, based on the `ReasonForNoIdentity`. Now, the separate lines in
the metrics output that correspond to each `ReasonForNoIdentity` have a
label differentiating them from each other.
Note that the the `NotImplementedForTap` and `NotImplementedForMetrics`
reasons will never show up in metrics labels, currently, since we don't gather
metrics from the tap and metrics servers at the moment.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>