The proxy Dockerfile includes test execution. While the intentions of
this are good, it has unintended consequences: we can ship code linked
with test dependencies.
Because we have other means for testing proxy code (cargo, locally; and
CI runs tests outside of Docker), it is fine to remove these tests.
The proxy is now configured with the CONDUIT_PROXY_METRICS_RETAIN_IDLE
environment variable that dictates the amount of time that the proxy will retain
metrics that have not been updated.
A timestamp is maintained for each unique set of labels, indicating the last time
that the scope was updated. Then, when metrics are read, all metrics older than
CONDUIT_PROXY_METRICS_RETAIN_IDLE are dropped from the stats registry.
A ctx::test_utils module has been added to aid testing.
Fixes#819
Previously, we maintained a map of labels for each metric. Because the same keys are used
in multiple scopes, this causes redundant hashing & map lookup when updating metrics.
With this change, there is now only one map per unique label scope and all of the metrics
for each scope are stored in the value. This makes metrics inserting faster and prepares
for eviction of idle metrics.
The Metric type has been split into Metric, which now only holds metric metadata and is
responsible for printing a given metric, and Scopes which holds groupings of metrics by
label.
The metrics! macro is provided to make it easy to define Metric instances statically.
In order to set up for a refactor that removes the `Metric` type, the
`FmtMetric` trait--implemented by `Counter`, `Gauge`, and
`Histogram`--is introduced to push prometheus formatting down into each
type.
With this change, the `Histogram` type now relies on `Counter` (and its
metric formatting) more heavily.
Enables filtering by one or more namespaces. Table updates are prevented
when the filter menu is open, as table updates will rerender the menu,
unselecting anything the user has selected but not confirmed.
This PR removes the `Arc`s from the various label types in the proxy's
`metrics` modules. This should make the write side of the metrics code
much more efficient (and makes the code much simpler! :D).
This change was particularly easy to implement for the TCP `TransportLabels`
and `TransportCloseLabels`, which consisted of only `struct`s and `enum`s,
and could easily be changed to derive `Copy`.
For protocol-level `RequestLabels`, the request's authority was a `String`,
which still needs to be reference-counted, as the overhead of cloning `String`s
is almost certainly worse than that added by ref-counting. However, rather than
adding an additional `Arc<str>`, I changed `RequestLabels` to store the
authority as a `http::uri::Authority`, which is backed by a `ByteStr` and thus
already ref-counted. Now, when constructing `RequestLabels`, we just take
another reference to the `Authority` already stored in the request context.
Since `Authority` implements `fmt::Display` already, formatting the labels
still works.
`ResponseLabels` already store the `DstLabels` string in an `Arc`, so no
additional changes there were necessary. By removing the outer `Arc` around
`ResponseLabels`, we now only have to ref-count the portion of the label type
that would actually be inefficient to clone.
@olix0r ran the benchmarks from #874 against this branch, and it seems to be
a small but noticeable improvement:
```
test record_many_dsts ... bench: 151,076 ns/iter (+/- 182,151)
test record_one_conn_request ... bench: 1,599 ns/iter (+/- 209)
test record_response_end ... bench: 676 ns/iter (+/- 144)
```
before:
```
test record_many_dsts ... bench: 158,403 ns/iter (+/- 130,241)
test record_one_conn_request ... bench: 1,823 ns/iter (+/- 1,408)
test record_response_end ... bench: 547 ns/iter (+/- 70)
```
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Conduit 0.4.1 contained some rough edges in the Grafana deployment.
This PR include the following:
- bump Grafana to 5.1.0
- fix deployment and rc graphs when no data present
- fix some text sections overlapping due to scrolling
Fixes#705
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Before changing the telemetry implementation, we should have a means to
understand the impacts of such changes.
To run, you must use a nightly toolchain:
```
rustup run nightly cargo bench -p conduit-proxy -- record
```
This PR adds the unit tests for the proxy metrics module's Histogram
implementation that I wrote in #775 to @olix0r's Histogram implementation
added in #868. The tests weren't too difficult to adapt for the new code,
and everything seems to work correctly!
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
In order to support histograms measured in, for instance, microseconds,
Histogram should blind store integers without being aware of the unit.
In order to accomplish this, we make `Histogram` generic over a `V:
Into<u64>`, such that all values added to the histogram must be of type
`V`.
In doing this, we also make the histogram buckets configurable, though
we maintain the same defaults used for latency values.
The `Histogram` type has been moved to a new module, and the `Bucket`
and `Bounds` helper types have been introduced to help make histogram
logic clearer and latency-agnostic.
In case there are any errors while peeking the connection to do protocol
detection, the sensors will now be in place to detect them. Besides just
errors, this will also allow reporting about connections that are
accepted, but then immediately closed.
Additionally:
- add write_buf implementation for Transport sensor, can help
performance for http1/http2
- add better logs for tcp connections errors
- add printlns for when tests fail
Signed-off-by: Sean McArthur <sean@seanmonstar.com>
In preparation for a larger metrics refactor, this change splits the
Counter and Gauge types into their own modules.
Furthermore, this makes the minor change to these types: incr() and
decr() no longer return `self`. We were not actually ever using the
returned self references, and I find the unit return type to more
obviously indicate the side-effecty-ness of these calls. #smpfy
* Add a namespace column to the metrics tables, support long resource names
* Add a test for GrafanaLink
* Change the PodList.jsx component to not use the ListPods api
This PR adds a description of the transport level (TCP) metrics
that the Conduit proxy now exposes as of 6ad0960.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
We removed individual Deployment pages a while ago, but left the autocomplete search bar in. Clicking on searches goes to a 404 because we don't have /deployment any more.
This will be revisited in the future with direct links to grafana dashboards to all the
resources we support.
This PR adds the TCP metrics added in #785 and #790 to the Grafana deployment dashboards. I've added three new charts in the "Inbound Traffic" and "Outbound Traffic" headings:
+ "TCP Connection Failures": plots the number of failed TCP connections over time
+ "TCP Connections Open": shows the number of accepted and opened connections currently open
+ "TCP Connection Duration": a heatmap of connection durations over time
I'm planning on adding similar graphs to other dashboards as well in subsequent PRs.
* Add a Replication Controllers page in the Web UI
@siggy pointed out that we don't need to use the PodsList api any more, since the new stats endpoint (#671) includes meshedPodCount and totalPodCount, which is all we need to determine whether the deployment/rc has been added to the mesh (which is what we were using ListPods to determine).
This PR modifies deployments to not use the pods api any more, and adds a Replication Controllers page. This page is quite similar to the Deployments page in logic, so I've made a PodOwnersList component to share the code.
I haven't added Replication Controllers to the Service Mesh page yet, because that page does require a list of component pods. Also, we don't need the calls to Prometheus for the Service Mesh page, so I don't want to use the existing stat apis for it. I figure that is a large enough change for a separate PR.
After this was implemented we found that ExternalName services are
represented in DNS as CNAMEs, which means that the proxy's DNS
fallback logic can be used instead of doing DNS in the control
plane. Besides simplifying the controller, this will also increase
fidelity with the proxied pods' DNS configuration (improve
transparency).
Signed-off-by: Brian Smith <brian@briansmith.org>
Previously, the proxy exposed separate _accept_ and _connect_ metrics
for some metric types, but not for all. This leads to confusing
aggregations, particularly for read and write taotals.
This change primarily introduces the `peer` prometheus label (with
possible values _src_ or _dst_) to indicate which side of the proxy the
metric reflects.
Additionally, the `received_bytes` and `sent_bytes` metrics have been
renamed as `tcp_read_bytes_total` and `tcp_write_bytes_total`,
resectively. This more naturally fits into existing idioms. Stream
classification is not applied to these metrics, as we plan to increment
them throughout stream lifetime and not only on close.
The `tcp_connections_open` metric has also been renamed to
`tcp_open_connections` to reflect Prometheus idioms.
Finally, `msg1` and `msg2` have been constified in telemetry test
fixtures so that tests are somewhat easier to read.
trust-dns-resolver is a more complete implementation. In particular,
it supports CNAMES correctly, which is needed for PR #764. It also
supports /etc/hosts, which will help with issue #62.
Use the 0.8.2 pre-release since it hasn't been released yet. It was
created at our request.
Signed-off-by: Brian Smith <brian@briansmith.org>
The `conduit tap` command is now deprecated.
Replace `conduit tap` with `connduit tapByResource`. Rename tapByResource
to tap. The underlying protobuf for tap remains, the tap gRPC endpoint now
returns Unimplemented.
Fixes#804
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Fixes#846
The proxy `metrics_compression` test contained an assertion that a compressed scrape contained the `request_duration_ms_count` metric. This was chosen completely arbitrarily, and was only intended as an assertion that metrics were updated between compressed scrapes. Unfortunately, that metric was removed in d9112abc93, so when #665 merged to master, this test broke. CI didn't catch this since we don't build merges for PRs --- we should probably (re)enable this in Travis?
This PR fixes the test to assert on a metric that wasn't removed. Sorry for the ❌s!
The TapByResource command now has access to destination labels from the
proxy, but was not outputting them on the cli.
Modify the TapByResource output to print the destination pod label,
rather than the ip, when available.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
public-api and and tap were both using their own implementations of
the Kubernetes Informer/Lister APIs.
This change factors out all Informer/Lister usage into the Lister
module. This also introduces a new `Lister.GetObjects` method.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Closes#598.
According to the Prometheus documentation, metrics export endpoints should support serving metrics compressed using GZIP. I've modified the proxy's `/metrics` endpoint to serve metrics compressed with GZIP when an `Accept-Encoding: gzip` request header is sent.
I've also added a new unit test that attempts to get the proxy's metrics endpoint as GZIP, and asserts that the metrics are decompressed successfully.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The `controller` part of the proxy will now use a default, removing the
need to pass the exact same `controller::new().run()` in every test
case.
The TCP server and client will include their socket addresses in some
panics.
Signed-off-by: Sean McArthur <sean@seanmonstar.com>
The Kubernetes client-go Informer/Lister APIs are implemented in several
parts of the code base.
This change introduces a Lister module, providing Informer/Lister
capability through a simple interface. Once this merges, we can follow
up with moving public-api and tap onto Lister.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The TapByResource endpoint was previously a stub.
Implement end-to-end tapByResource functionality, with support for
specifying any kubernetes resource(s) as target and destination.
Fixes#803, #49
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This PR removes the unused `request_duration_ms` and `response_duration_ms` histogram metrics from the proxy. It also removes them from the `simulate-proxy` script's output, and from `docs/proxy-metrics.md`
Closes#821
The Tap command leveraged new cli parsing code, enabling Kubernetes
resources specified as `(TYPE [NAME] | TYPE/NAME)`. The Stat command
did not use this.
Modify the Stat command to use the same cli flag parsing code as Tap.
Remove the to/from-resource flags from Stat.
Fixes#792
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Fixes#831.
Proxy metrics tests `transport::inbound_tcp_accept` and `transport::inbound_tcp_duration` are known to be flaky and should be ignored on CI. Note that the outbound versions of these tests were already marked as flaky, so this was almost certainly either an oversight or the result of an incorrect merge.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The refactoring of how metrics are formatted in 674ce87588 inadvertently introduced a bug that caused the `process_start_time_seconds` metric to be formatted as just a number without the metric name. This causes Prometheus to fail with a parse error rather than accepting the metrics.
I've fixed this issue, and added a unit test to detect regressions in the future.
This PR makes two changes to the `simulate-proxy` script:
1. Removed the `protocol={"http", "tcp"}` label from TCP metrics. The proxy no longer adds this label (see https://github.com/runconduit/conduit/pull/785#discussion_r182563499).
2. Fixed failed responses being labeled with `classification="fail"` rather than `classification="failure"` (the label the proxy sets). I noticed that while I was here and decided to fix it as well.
Note that the first change required some minor changes to the `proxyMetricCollectors` struct in `simulate-proxy`; since the label cardinality for TCP open stats decreased by one due to removing the `protocol` label, it's no longer necessary for that struct to `haveCounterVec`/`GaugeVec` pointers for these stats. It now owns the actual `Counter`/`Gauge` instead. This means that the metric vecs that are created to be labeled for `inbound` and `outbound` are now stored as variables in the `newSimulatedProxy` function rather than going in a `proxyMetricCollectors` struct first. This shouldn't impact behaviour at all.
This PR adds a `classification` label to transport level metrics collected on transport close. Like the `classification` label on HTTP response metrics, the value may be either `"success"` or `"failure"`. The label value is determined based on the `clean` field on the `TransportClose` event, which indicates whether a transport closed cleanly or due to an error.
I've updated the tests for transport-level metrics to reflect the addition of the new label. I'd like to also modify the test support code to allow us to close transports with errors, in order to test that the errors are correctly classified as failures.
The top-line single stat numbers were not calculated properly, resulting
in inflated counts.
Modify the underlying Prometheus queries to ensure accurate counts of
Deployments, Pods, and Namespaces.
Fixes#801.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
There have been a number of performance improvements and bug fixes since
v2.1.0.
Bump our Prometheus container to v2.2.1.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Now, the tap server may specify that requests should be matched by destination
label.
For example, if the controller's Destination service returns the labels:
`{"service": "users", "namespace": "prod"}` for an endpoint, then tap would be
able to specify a match like `namespace=prod` to match requests destined to
that namespace.
Conduit's Grafana currencly provides Top-line, Deployment, Pod, and Mesh
Health dashboards.
This change adds a new Conduit Service dashboard. In addition to
top-line information, this dashboards focuses primarily on requests to a
Service, as only dst_service is available in our metrics.
Part of #706
Signed-off-by: Andrew Seigner <siggy@buoyant.io>