Fixes#846
The proxy `metrics_compression` test contained an assertion that a compressed scrape contained the `request_duration_ms_count` metric. This was chosen completely arbitrarily, and was only intended as an assertion that metrics were updated between compressed scrapes. Unfortunately, that metric was removed in d9112abc93, so when #665 merged to master, this test broke. CI didn't catch this since we don't build merges for PRs --- we should probably (re)enable this in Travis?
This PR fixes the test to assert on a metric that wasn't removed. Sorry for the ❌s!
Closes#598.
According to the Prometheus documentation, metrics export endpoints should support serving metrics compressed using GZIP. I've modified the proxy's `/metrics` endpoint to serve metrics compressed with GZIP when an `Accept-Encoding: gzip` request header is sent.
I've also added a new unit test that attempts to get the proxy's metrics endpoint as GZIP, and asserts that the metrics are decompressed successfully.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The `controller` part of the proxy will now use a default, removing the
need to pass the exact same `controller::new().run()` in every test
case.
The TCP server and client will include their socket addresses in some
panics.
Signed-off-by: Sean McArthur <sean@seanmonstar.com>
This PR removes the unused `request_duration_ms` and `response_duration_ms` histogram metrics from the proxy. It also removes them from the `simulate-proxy` script's output, and from `docs/proxy-metrics.md`
Closes#821
Fixes#831.
Proxy metrics tests `transport::inbound_tcp_accept` and `transport::inbound_tcp_duration` are known to be flaky and should be ignored on CI. Note that the outbound versions of these tests were already marked as flaky, so this was almost certainly either an oversight or the result of an incorrect merge.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The refactoring of how metrics are formatted in 674ce87588 inadvertently introduced a bug that caused the `process_start_time_seconds` metric to be formatted as just a number without the metric name. This causes Prometheus to fail with a parse error rather than accepting the metrics.
I've fixed this issue, and added a unit test to detect regressions in the future.
This PR adds a `classification` label to transport level metrics collected on transport close. Like the `classification` label on HTTP response metrics, the value may be either `"success"` or `"failure"`. The label value is determined based on the `clean` field on the `TransportClose` event, which indicates whether a transport closed cleanly or due to an error.
I've updated the tests for transport-level metrics to reflect the addition of the new label. I'd like to also modify the test support code to allow us to close transports with errors, in order to test that the errors are correctly classified as failures.
This branch adds all the transport-level Prometheus metrics as described in #742, with the exception of the `tcp_connections_open` gauge (to be added in a subsequent branch).
A brief description of the metrics added in this branch:
* `tcp_accept_open_total`: counter of the number of connections accepted by the proxy
* `tcp_accept_close_total`: counter of the number of accepted connections that have closed
* `tcp_connect_open_total`: counter of the number of connections opened by the proxy
* `tcp_connect_close_total`: counter of the number of connections opened by the proxy that have been closed.
* `tcp_connection_duration_ms`: histogram of the total duration of each TCP connection (incremented on connection close)
* `sent_bytes`: counter of the total number of bytes sent on TCP connections (incremented on connection close)
* `received_bytes`: counter of the total number of bytes received on TCP connections (incremented on connection close)
These metrics are labeled with the direction (inbound or outbound) and whether the connection was proxied as raw TCP or corresponds to an HTTP request.
Additionally, I've added several proxy tests for these metrics. Note that there are some cases which are currently untested; in particular, while there are tests for the `tcp_accept_close_total` counter, it's more difficult to test the `tcp_connect_close_total` counter, due to connection pooling. I'd like to improve the tests for this code in additional branches.
Currently, the mock controller, which is used in tests, takes all of its
updates a priori, which makes it hard to control when an update occurs within a
test.
Now, the controller exposes a `DstSender`, which wraps an unbounded channel of
destination updates. This allows tests to trigger updates at a specific point
in the test.
In order to accomplish this, the controller's hand-rolled gRPC server
implementation has been discarded in favor of a real gRPC destination service.
This requires that the `controller-grpc` project now builds both clients
and servers for the destination service. Additionally, we now build a tap
client as well (assuming that we'll want to write tests against our tap
server).
This is a fairly minor refactor to the proxy telemetry tests. b07b554d2b added a `Fixture` in the Destination service labeling tests added in #661 to reduce the repetition of copied and pasted code in those tests. I've refactored most of the other telemetry tests to also use the test fixture. Significantly less code is copied and pasted now.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The tests for label metadata updates from the control plane are flaky on CI. This is likely due to the CI containers not having enough cores to execute the test proxy thread, the test proxy's controller client thread, the mock controller thread, and the test server thread simultaneously --- see #751 for more information.
For now, I'm ignoring these on CI. Eventually, I'd like to change the mock controller code in test support so that we can trigger it to send a second metadata update only after the request has finished.
I think this issue also makes merging #738 a higher priority, so that we can still have some tests running on CI that exercise some part of the label update behaviour.
PR #654 adds pod-based metric labels to the Destination API responses for cluster-local services.
This PR modifies the proxy to actually add these labels to reported Prometheus metrics for outbound requests to local services.
It enhances the proxy's `control::discovery` module to track these labels and add a `LabelRequest` middleware to the service stack built in `Bind` for labeled services. Requests transiting `LabelRequest` are given an `Extension` which contains these labels, which are then added to events produced by the `Sensors` for these requests. When these events are aggregated to Prometheus metrics, the labels are added.
I've also added some tests in `test/telemetry.rs` ensuring that these metrics are added correctly when the Destination service provides labels.
Closes#660
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
This PR adds a `classification` label to proxy response metrics, as @olix0r described in https://github.com/runconduit/conduit/issues/634#issuecomment-376964083. The label is either "success" or "failure", depending on the following rules:
+ **if** the response had a gRPC status code, *then*
- gRPC status code 0 is considered a success
- all others are considered failures
+ **else if** the response had an HTTP status code, *then*
- status codes < 500 are considered success,
- status codes >= 500 are considered failures
+ **else if** the response stream failed **then**
- the response is a failure.
I've also added end-to-end tests for the classification of HTTP responses (with some work towards classifying gRPC responses as well). Additionally, I've updated `doc/proxy_metrics.md` to reflect the added `classification` label.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The metrics endpoint tests are flaky because there are no guarantees
that the metrics pipeline has processed events before the metrics
endpoint is read. This can cause CI to fail spuriously.
Disable these tests from running in CI until #613 is resolved.
This PR adds the `request_duration_ms` metric to the Prometheus metrics exported by the proxy. It also modifies the `request_total` metric so that it is incremented when a request stream finishes, rather than when it opens, for consistency with how the `response_total` metric is generated.
Making this change required modifying `telemetry::sensors::http` to generate a `StreamRequestEnd` event similar to the `StreamResponseEnd` event. This is done similarly to how sensors are added to response bodies, by generalizing the `ResponseBody` type into a `MeasuredBody` type that can wrap a request or response body. Since this changed the type of request bodies, it necessitated changing request types pretty much everywhere else in the proxy codebase in order to fix the resulting type errors, which is why the diff for this PR is so large.
Closes#570
Fixes#600
The proxy metrics endpoint has a bug where metrics recorded in the outbound direction can contain two commas in a row when no outbound label is present. This occurs because the code for formatting the outbound direction label mistakenly assumed that there would always be a destination pod owner label as well, but the proxy isn't currently aware of the destination's pod owner (waiting for #429).
I've fixed this issue by moving the place where the comma is output from the `fmt::Display` impl for `RequestLabels` to the `fmt::Display` impl for `OutboudnLabels`. This way, the comma between the `direction` and `dst_*` labels is only output when the `dst_*` label is present.
This bug made it to master since all of the proxy end-to-end tests for metrics only test the inbound router. I've rectified this issue by adding tests on the outbound router as well (which would fail against the current master due to the double comma bug). I've also added a test that asserts there are no double commas in exported metrics, to protect against regressions to this bug.
This PR adds an endpoint to the proxy that serves metrics in Prometheus' text exposition format. The endpoint currently serves the `request_total`, `response_total`, `response_latency_ms`, and `response_duration_ms metrics`, as described in #536. The endpoint's port and address are configurable with the `CONDUIT_PROXY_METRICS_LISTENER` environment variable.
Tests have been added in t`ests/telemetry.rs`
Stop initializing env_logger in every test. In env_logger 0.5, it
may only be initialized once per process.
Also, Prost will soon upgrade to env_logger 0.5 and this will
(eventually) help reduce the number of versions of env_logger we
have to build. Turning off the regex feature will (eventually) also
reduce the number of dependencies we have to build. Unfortunately,
as it is now, the number of dependencies has increased because
env_logger increased its dependencies in 0.5.
Signed-off-by: Brian Smith <brian@briansmith.org>
This PR adds a `flaky_tests` cargo feature to control whether or not to ignore tests that are timing-dependent. This feature is enabled by default in local builds, but disabled on CI and in all Docker builds.
Closes#440
The proxy will check that the requested authority looks like a local service, and if it doesn't, it will no longer ask the Destination service about the request, instead just using the SO_ORIGINAL_DST, enabling egress naturally.
The rules used to determine if it looks like a local service come from this comment:
> If default_zone.is_none() and the name is in the form $a.$b.svc, or if !default_zone.is_none() and the name is in the form $a.$b.svc.$default_zone, for some a and some b, then use the Destination service. Otherwise, use the IP given.
Removed the `method` label from Prometheus, and removed HTTP methods from reports. Removed `StreamSummary` from reports and replaced it with a `u32` count of streams.
Closes#266
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Follow-up from #315.
Now that the UIs don't report per-path metrics, we can remove the path label from Prometheus, the path aggregation and filtering options from the telemetry API, and the path field from the proxy report API.
I've modified the tests to no longer expect the removed fields, and manually verified that Conduit still works after making these changes.
Closes#265
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The proxy currently stores latency values in an `OrderMap` and reports every observed latency value to the controller's telemetry API since the last report. The telemetry API then sends each individual value to Prometheus. This doesn't scale well when there are a large number of proxies making reports.
I've modified the proxy to use a fixed-size histogram that matches the histogram buckets in Prometheus. Each report now includes an array indicating the histogram bounds, and each response scope contains a set of counts corresponding to each index in the bounds array, indicating the number of times a latency in that bucket was observed. The controller then reports the upper bound of each bucket to Prometheus, and can use the proxy's reported set of bucket bounds so that the observed values will be correct even if the bounds in the control plane are changed independently of those set in the proxy.
I've also modified `simulate-proxy` to generate the new report structure, and added tests in the proxy's telemetry test suite validating the new behaviour.
Response End events were only triggered after polling the trailers of
a response, but when the Response is given to a hyper h1 server, it
doesn't know about trailers, so they were never polled!
The fix is that the `BodyStream` glue will now poll the wrapped body for
trailers after it sees the end of the data, before telling hyper the
stream is over. This ensures a ResponseEnd event is emitted.
Includes a proxy telemetry test over h1 connections.
We’ve built Conduit from the ground up to be the fastest, lightest,
simplest, and most secure service mesh in the world. It features an
incredibly fast and safe data plane written in Rust, a simple yet
powerful control plane written in Go, and a design that’s focused on
performance, security, and usability. Most importantly, Conduit
incorporates the many lessons we’ve learned from over 18 months of
production service mesh experience with Linkerd.
This repository contains a few tightly-related components:
- `proxy` -- an HTTP/2 proxy written in Rust;
- `controller` -- a control plane written in Go with gRPC;
- `web` -- a UI written in React, served by Go.