Building on #796, this creates a new `Endpoint` type that wraps `SocketAddr`.
Still, no functional change has been introduced, but this sets up to move
destination labels into the bind stack directly (by adding the labels watch to
the `Endpoint` type).
Currently, the mock controller, which is used in tests, takes all of its
updates a priori, which makes it hard to control when an update occurs within a
test.
Now, the controller exposes a `DstSender`, which wraps an unbounded channel of
destination updates. This allows tests to trigger updates at a specific point
in the test.
In order to accomplish this, the controller's hand-rolled gRPC server
implementation has been discarded in favor of a real gRPC destination service.
This requires that the `controller-grpc` project now builds both clients
and servers for the destination service. Additionally, we now build a tap
client as well (assuming that we'll want to write tests against our tap
server).
Previously, `Bind` required that it bind to `SocketAddr` (and `SocketAddr`
only). This makes it hard to pass additional information from service discovery
into the client's stack.
To resolve this, `Bind` now has an additional `Endpoint` trait-generic type,
and `Bind::bind` accepts an `Endpoint` rather than a `SocketAddr`.
No additional endpoints have been introduced yet. There are no functional
changes in this refactor.
This changes the public api to have a new rpc type, `TapByResource`.
This api supersedes the Tap api. `TapByResource` is richer, more closely
reflecting the proxy's capabilities.
The proxy's Tap api is extended to select over destination labels,
corresponding with those returned by the Destination api.
Now both `Tap` and `TapByResource`'s responses may include destination
labels.
This change avoids breaking backwards compatibility by:
* introducing the new `TapByResource` rpc type, opting not to change Tap
* extending the proxy's Match type with a new, optional, `destination_label` field.
* `TapEvent` is extended with a new, optional, `destination_meta`.
Currently, the request open timestamp, which is used for calculating latency, is captured in the `sensor::http::Http` middleware. However, the sensor middleware is placed fairly low in the stack, below some of the proxy's components that can add measurable latency (e.g. the router).
This PR moves the request_open timestamp out of the `Http` middleware and into a new `TimestampRequestOpen` middleware, which is installed at the top of the stack (before the router). The `TimestampRequestOpen` middleware adds the timestamp as a request extension, so that it can later be consumed by the `Http` sensor to generate the request stats.
By moving the timestamping to the top of the stack, the timestamp should more accurately cover the overhead of the proxy, but a majority of the telemetry work can still be done where it was previously.
I'd like to have included unit tests for this change, but since the expected improvement is in the accuracy of latency measurements, there's no easy way to test this programmatically.
This is a fairly minor refactor to the proxy telemetry tests. b07b554d2bdb4b92a1feeed22a79bd71e87856eb added a `Fixture` in the Destination service labeling tests added in #661 to reduce the repetition of copied and pasted code in those tests. I've refactored most of the other telemetry tests to also use the test fixture. Significantly less code is copied and pasted now.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The proxy `telemetry::metrics::prometheus` module was initially added in order to give the Prometheus metrics export code a separate namespace from the controller push metrics. Since the controller push metrics code was removed from the proxy in #616, we no longer need a separate module for the Prometheus-specific metrics code. Therefore, I've moved that code to the root `telemetry::metrics` module, which should hopefully make the proxy source tree structure a little simpler.
This is a fairly trivial refactor.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Closes#713. This is a follow-up from #688.
This PR makes a number of refactorings to the proxy's `control::Cache` module and removes all but one of the `clone` calls.
The `CacheChange` enum now contains the changed key and a reference to the changed value when applicable. This simplifies `on_change` functions, which no longer have to take both a tuple of `(K, V)` and a `CacheChange` and can now simply destructure the `CacheChange`, and since the changed value is passed as a reference, the `on_change` function can now decide whether or not it should be cloned. This means that we can remove a majority of the clones previously present here.
I've also rewritten `Cache::update_union` so that it no longer clones values (twice if the cache was invalidated). There's still one `clone` call in `Cache::update_intersection`, but it seems like it will be fairly tricky to remove. However, I've moved the `V: Clone` bound to that function specifically. `Cache::clear` and `Cache::update_union` so that they no longer call `Cache::update_intersection` internally, so they don't need a `V: Clone` bound.
In addition, I've added some unit tests that test that `on_change` is called with the correct `CacheChange`s when key/value pairs are modified.
The tests for label metadata updates from the control plane are flaky on CI. This is likely due to the CI containers not having enough cores to execute the test proxy thread, the test proxy's controller client thread, the mock controller thread, and the test server thread simultaneously --- see #751 for more information.
For now, I'm ignoring these on CI. Eventually, I'd like to change the mock controller code in test support so that we can trigger it to send a second metadata update only after the request has finished.
I think this issue also makes merging #738 a higher priority, so that we can still have some tests running on CI that exercise some part of the label update behaviour.
PR #654 adds pod-based metric labels to the Destination API responses for cluster-local services.
This PR modifies the proxy to actually add these labels to reported Prometheus metrics for outbound requests to local services.
It enhances the proxy's `control::discovery` module to track these labels and add a `LabelRequest` middleware to the service stack built in `Bind` for labeled services. Requests transiting `LabelRequest` are given an `Extension` which contains these labels, which are then added to events produced by the `Sensors` for these requests. When these events are aggregated to Prometheus metrics, the labels are added.
I've also added some tests in `test/telemetry.rs` ensuring that these metrics are added correctly when the Destination service provides labels.
Closes#660
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
- The listener is immediately closed on receipt of a shutdown signal.
- All in-progress server connections are now counted, and the process will
not shutdown until the connection count has dropped to zero.
- In the case of HTTP1, idle connections are closed. In the case of HTTP2,
the HTTP2 graceful shutdown steps are followed of sending various
GOAWAYs.
Previously when the proxy could tell, by parsing, the request-target
is not in the cluster, it would not override the destination. That is,
load balancing would be disabled for such destinations.
With this change, the proxy will do L7 load balancing for all HTTP
services as long as the request-target has a DNS name.
Signed-off-by: Brian Smith <brian@briansmith.org>
No change in behavior is intended here.
Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().
Signed-off-by: Brian Smith <brian@briansmith.org>
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.
Signed-off-by: Brian Smith <brian@briansmith.org>
This PR changes the proxy's `control::Cache` module from a set to a key-value map.
This change is made in order to use the values in the map to store metadata from the Destination API, but allow evictions and insertions to be based only on the `SocketAddr` of the destination entry. This will make code in PR #661 much simpler, by removing the need to wrap `SocketAddr`s in the cache in a `Labeled` struct for storing metadata, and the need for custom `Borrow` implementations on that type.
Furthermore, I've changed from using a standard library `HashSet`/`HashMap` as the underlying collection to using `IndexMap`, as we suspect that this will result in performance improvements.
Currently, as `master` has no additional metadata to associate with cache entries, the type of the values in the map is `()`. When #661 merges, the values will actually contain metadata.
If we suspect that there are many other use-cases for `control::Cache` where it will be treated as a set rather than a map, we may want to provide a separate set of impls for `Cache<T, ()>` (like `std::HashSet`) to make the API more ergonomic in this case.
This PR adds the pretty-printing for durations I added in #676 to the panic message from the `assert_eventually!` macro added in #669.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
No change in behavior is intended here.
Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().
Signed-off-by: Brian Smith <brian@briansmith.org>
Proxy: Refactor DNS name parsing and normalization
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.
Signed-off-by: Brian Smith <brian@briansmith.org>
This branch adds simple pretty-printing to duration in log timeout messages. If the duration is >= 1 second, it's printed in seconds with a fractional part. If the duration is less than 1 second, it is printed in milliseconds. This simple formatting may not be sufficient as a formatting rule for all cases, but should be sufficient for printing our relatively small timeouts.
Log messages now look something like this:
```
ERROR 2018-04-04T20:05:49Z: conduit_proxy: turning operation timed out after 100 ms into 500
```
Previously, they looked like this:
```
ERROR 2018-04-04T20:07:26Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
```
I made this change partially because I wanted to make the panics from the `eventually!` macro added in #669 more readable.
The proxy's `telemetry/metrics/prometheus.rs` file was starting to get long and hard to find one's way around in. I split the prometheus labels code out into a separate submodule and `RequestLabels` and `ResponseLabels` public. This seems like a reasonable division of the code, and the resultant files are much easier to read.
The proxy's control::discovery module is becoming a bit dense in terms
of what it implements.
In order to make this code more understandable, and to be able to use a
similar caching strategy in other parts of the controller, the
`control::cache` module now holds discovery's cache implementation.
This module is only visible within the `control` module, and it now
exposes two new public methods: `values()` and
`set_reset_on_next_modification()`.
* Extracted logic from destination server
* Make tests follow style used elsewhere in the code
* Extract single interface for resolvers
* Add tests for k8s and ipv4 resolvers
* Fix small usability issues
* Update dep
* Act on feedback
* Add pod-based metric_labels to destinations response
* Add documentation on running control plane to BUILD.md
Signed-off-by: Phil Calcado <phil@buoyant.io>
* Fix mock controller in proxy tests (#656)
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
* Address review feedback
* Rename files in the destination package
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
- Adds environment variables to configure a set of ports that, when an
incoming connection has an SO_ORIGINAL_DST with a port matching, will
disable protocol detection for that connection and immediately start a
TCP proxy.
- Adds a default list of well known ports: SMTP and MySQL.
Closes#339
Previosuly, when the proxy was disconnected from the Destination
service and then reconnects, the proxy would not forget old, outdated
entries in its cache of endpoints. If those endpoints had been removed
while the proxy was disconnected then the proxy would never become
aware of that.
Instead, on the first message after a reconnection, replace the entire
set of cached entries with the new set, which may be empty.
Prior to this change, the new test
outbound_destinations_reset_on_reconnect_followed_by_no_endpoints_exists
passed already
but outbound_destinations_reset_on_reconnect_followed_by_add_none
and outbound_destinations_reset_on_reconnect_followed_by_remove_none
failed. Now all these tests pass.
Fixes#573
Signed-off-by: Brian Smith <brian@briansmith.org>
* Proxy: Factor out Destination service connection logic
Centralize the connection initiation logic for the Destination service
to make it easier to maintain. Clarify that the `rx` field isn't needed
prior to a (re)connect.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Rename `rx` to `query`.
Signed-off-by: Brian Smith <brian@briansmith.org>
* "recoonect" -> "reconnect"
Signed-off-by: Brian Smith <brian@briansmith.org>
This PR adds a `classification` label to proxy response metrics, as @olix0r described in https://github.com/runconduit/conduit/issues/634#issuecomment-376964083. The label is either "success" or "failure", depending on the following rules:
+ **if** the response had a gRPC status code, *then*
- gRPC status code 0 is considered a success
- all others are considered failures
+ **else if** the response had an HTTP status code, *then*
- status codes < 500 are considered success,
- status codes >= 500 are considered failures
+ **else if** the response stream failed **then**
- the response is a failure.
I've also added end-to-end tests for the classification of HTTP responses (with some work towards classifying gRPC responses as well). Additionally, I've updated `doc/proxy_metrics.md` to reflect the added `classification` label.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
In #602, @olix0r suggested that telemetry counters should wrap on overflows, as "most timeseries systems (like prometheus) are designed to handle this case gracefully."
This PR changes counters to use explicitly wrapping arithmetic.
Closes#602.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Have the controller tell the client whether the service exists, not
just what are available. This way we can implement fallback logic to
alternate service discovery mechanisms for ambigious names.
Signed-off-by: Brian Smith <brian@briansmith.org>
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
As described in #619. `process_start_time_seconds` is the idiomatic way of reporting to Prometheus the uptime of a process. It should contain the time in seconds since the beginning of the Unix epoch.
The proxy now exports this metric:
```
➜ http get localhost:4191/metrics
HTTP/1.1 200 OK
Content-Length: 902
Content-Type: text/plain; charset=utf-8
Date: Mon, 26 Mar 2018 22:09:55 GMT
# HELP request_total A counter of the number of requests the proxy has received.
# TYPE request_total counter
# HELP request_duration_ms A histogram of the duration of a request. This is measured from when the request headers are received to when the request stream has completed.
# TYPE request_duration_ms histogram
# HELP response_total A counter of the number of responses the proxy has received.
# TYPE response_total counter
# HELP response_duration_ms A histogram of the duration of a response. This is measured from when theresponse headers are received to when the response stream has completed.
# TYPE response_duration_ms histogram
# HELP response_latency_ms A histogram of the total latency of a response. This is measured from whenthe request headers are received to when the response stream has completed.
# TYPE response_latency_ms histogram
process_start_time_seconds 1522102089
```
Closes#619
Use `VecDeqeue` to make the queue structure clear. Follow good practice
by minimizing the amount of time the lock is held. Clarify how
defaulting logic works.
Signed-off-by: Brian Smith <brian@briansmith.org>
The metrics endpoint tests are flaky because there are no guarantees
that the metrics pipeline has processed events before the metrics
endpoint is read. This can cause CI to fail spuriously.
Disable these tests from running in CI until #613 is resolved.
The inject code detects the object it is being injected into, and writes
self-identifying information into the CONDUIT_PROMETHEUS_LABELS
environment variable, so that conduit-proxy may read this information
and report it to Prometheus at collection time.
This change puts the self-identifying information directly into
Kubernetes labels, which Prometheus already collects, removing the need
for conduit-proxy to be aware of this information. The resulting label
in Prometheus is recorded in the form `k8s_deployment`.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This PR adds the `request_duration_ms` metric to the Prometheus metrics exported by the proxy. It also modifies the `request_total` metric so that it is incremented when a request stream finishes, rather than when it opens, for consistency with how the `response_total` metric is generated.
Making this change required modifying `telemetry::sensors::http` to generate a `StreamRequestEnd` event similar to the `StreamResponseEnd` event. This is done similarly to how sensors are added to response bodies, by generalizing the `ResponseBody` type into a `MeasuredBody` type that can wrap a request or response body. Since this changed the type of request bodies, it necessitated changing request types pretty much everywhere else in the proxy codebase in order to fix the resulting type errors, which is why the diff for this PR is so large.
Closes#570
Fixes#600
The proxy metrics endpoint has a bug where metrics recorded in the outbound direction can contain two commas in a row when no outbound label is present. This occurs because the code for formatting the outbound direction label mistakenly assumed that there would always be a destination pod owner label as well, but the proxy isn't currently aware of the destination's pod owner (waiting for #429).
I've fixed this issue by moving the place where the comma is output from the `fmt::Display` impl for `RequestLabels` to the `fmt::Display` impl for `OutboudnLabels`. This way, the comma between the `direction` and `dst_*` labels is only output when the `dst_*` label is present.
This bug made it to master since all of the proxy end-to-end tests for metrics only test the inbound router. I've rectified this issue by adding tests on the outbound router as well (which would fail against the current master due to the double comma bug). I've also added a test that asserts there are no double commas in exported metrics, to protect against regressions to this bug.
This PR adds an endpoint to the proxy that serves metrics in Prometheus' text exposition format. The endpoint currently serves the `request_total`, `response_total`, `response_latency_ms`, and `response_duration_ms metrics`, as described in #536. The endpoint's port and address are configurable with the `CONDUIT_PROXY_METRICS_LISTENER` environment variable.
Tests have been added in t`ests/telemetry.rs`
Replace an unconditional dependency on windows-specific crates in
tempdir (via its update of its remove_dir_all dependency), which
eliminates the need to download any windows-specific crates during
the build when targetting non-Windows platforms.
Also, when targetting Windows platforms, replace a winapi 0.2.x
dependency with a winapi 0.3.x dependency.
This results in two fewer downloads during Docker builds:
```diff
- Downloading winapi v0.2.8
- Downloading winapi-build v0.1.1
```
Signed-off-by: Brian Smith <brian@briansmith.org>