simulate-proxy increments a single set of metrics on each iteration, and
also randomizes http status codes, leaving counters unchanged across
several collections.
Modify simuilate-proxy to increment all metrics on each iteration,
provide a 90% success rate, ensure a pod does not call itself, and
increase proxy count from 3 to 10.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Add tests/utils/scripts for running integration tests
Add a suite of integration tests in the `test/` directory, as well as
utilities for testing in the `testutil/` directory.
You can use the `bin/test-run` script to run the full suite of tests,
and the `bin/test-cleanup` script to cleanup after the tests.
The test/README.md file has more information about running tests.
@pcalcado, @franziskagoltz, and @rmars also contributed to this change.
* Create TEST.md file at the root of the repo
* Update based on review feedback
* Relax external service IP timeout for GKE
* Update TEST.md with more info about different types of test runs
* More updates to TEST.md based on review feedback
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
The Prometheus scrape config collects from Conduit proxies, and maps
Kubernetes labels to Prometheus labels, appending "k8s_".
This change keeps the resultant Prometheus labels consistent with their
source Kubernetes labels. For example: "deployment" and
"pod_template_hash".
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
In #602, @olix0r suggested that telemetry counters should wrap on overflows, as "most timeseries systems (like prometheus) are designed to handle this case gracefully."
This PR changes counters to use explicitly wrapping arithmetic.
Closes#602.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Have the controller tell the client whether the service exists, not
just what are available. This way we can implement fallback logic to
alternate service discovery mechanisms for ambigious names.
Signed-off-by: Brian Smith <brian@briansmith.org>
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
As described in #619. `process_start_time_seconds` is the idiomatic way of reporting to Prometheus the uptime of a process. It should contain the time in seconds since the beginning of the Unix epoch.
The proxy now exports this metric:
```
➜ http get localhost:4191/metrics
HTTP/1.1 200 OK
Content-Length: 902
Content-Type: text/plain; charset=utf-8
Date: Mon, 26 Mar 2018 22:09:55 GMT
# HELP request_total A counter of the number of requests the proxy has received.
# TYPE request_total counter
# HELP request_duration_ms A histogram of the duration of a request. This is measured from when the request headers are received to when the request stream has completed.
# TYPE request_duration_ms histogram
# HELP response_total A counter of the number of responses the proxy has received.
# TYPE response_total counter
# HELP response_duration_ms A histogram of the duration of a response. This is measured from when theresponse headers are received to when the response stream has completed.
# TYPE response_duration_ms histogram
# HELP response_latency_ms A histogram of the total latency of a response. This is measured from whenthe request headers are received to when the response stream has completed.
# TYPE response_latency_ms histogram
process_start_time_seconds 1522102089
```
Closes#619
Flaky proxy tests were not actually being ignored properly. This is due to our use of a Cargo workspace; as it turns out that Cargo doesn't propagate feature flags from the workspace to the crates in the workspace (see rust-lang/cargo#4753).
If I run `cargo test --no-default-features` in the root directory, the `flaky_tests` feature is still passed, and the flaky tests still run:
```
➜ cargo test --no-default-features
Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
Running target/debug/deps/conduit_proxy-0e0ab2829c6b743f
running 13 tests
test fully_qualified_authority::tests::test_normalized_authority ... ok
test ctx::transport::tests::same_addr_ip6_compat_ipv4 ... ok
test ctx::transport::tests::same_addr_ipv4 ... ok
test ctx::transport::tests::same_addr_ip6_mapped_ipv4 ... ok
test ctx::transport::tests::same_addr_ipv6 ... ok
test telemetry::tap::match_::tests::http_from_proto ... ok
test inbound::tests::recognize_default_no_ctx ... ok
test telemetry::tap::match_::tests::tcp_from_proto ... ok
test telemetry::tap::match_::tests::tcp_matches ... ok
test inbound::tests::recognize_default_no_loop ... ok
test transparency::tcp::tests::duplex_doesnt_hang_when_one_half_finishes ... ok
test inbound::tests::recognize_default_no_orig_dst ... ok
test inbound::tests::recognize_orig_dst ... ok
test result: ok. 13 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/conduit_proxy-74584a35ef749a60
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/discovery-73cd0b65bd7a45ae
running 16 tests
test http1::absolute_uris::outbound_reconnects_if_controller_stream_ends ... ok
test http1::outbound_reconnects_if_controller_stream_ends ... ok
test http1::absolute_uris::outbound_uses_orig_dst_if_not_local_svc ... ok
test http1::outbound_asks_controller_without_orig_dst ... ok
test http1::absolute_uris::outbound_asks_controller_api ... ok
test http1::outbound_asks_controller_api ... ok
test http1::absolute_uris::outbound_asks_controller_without_orig_dst ... ok
test http2::outbound_reconnects_if_controller_stream_ends ... ok
test http2::outbound_asks_controller_api ... ok
test http2::outbound_asks_controller_without_orig_dst ... ok
test http1::outbound_uses_orig_dst_if_not_local_svc ... ok
server h1 error: invalid HTTP version specified
test http2::outbound_uses_orig_dst_if_not_local_svc ... ok
ERROR 2018-03-26T20:54:09Z: conduit_proxy: turning Error caused by underlying HTTP/2 error: protocol error: frame with invalid size into 500
test outbound_updates_newer_services ... ok
ERROR 2018-03-26T20:54:09Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
test http1::absolute_uris::outbound_times_out ... ok
ERROR 2018-03-26T20:54:09Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
test http2::outbound_times_out ... ok
ERROR 2018-03-26T20:54:09Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
test http1::outbound_times_out ... ok
test result: ok. 16 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/telemetry-cb5bee2d2b94332c
running 12 tests
test metrics_endpoint_inbound_request_count ... ok
test metrics_endpoint_inbound_request_duration ... ok
test metrics_endpoint_outbound_request_count ... ok
test records_latency_statistics ... ignored
test telemetry_report_errors_are_ignored ... ok
test metrics_endpoint_outbound_request_duration ... ok
test metrics_have_no_double_commas ... ok
test http1_inbound_sends_telemetry ... ok
test inbound_sends_telemetry ... ok
test inbound_aggregates_telemetry_over_several_requests ... ok
test metrics_endpoint_inbound_response_latency ... ok
test metrics_endpoint_outbound_response_latency ... ok
test result: ok. 11 passed; 0 failed; 1 ignored; 0 measured; 0 filtered out
Running target/debug/deps/transparency-9d14bf92d8ba3700
running 19 tests
ERROR 2018-03-26T20:54:10Z: conduit_proxy: turning Error caused by underlying HTTP/2 error: protocol error: unexpected internal error encountered into 500
test http11_upgrade_not_supported ... ok
test http11_absolute_uri_differs_from_host ... ok
test http10_without_host ... ok
test http1_head_responses ... ok
test http10_with_host ... ok
test http1_connect_not_supported ... ok
test http1_bodyless_responses ... ok
test http1_content_length_zero_is_preserved ... ok
test http1_removes_connection_headers ... ok
test http1_one_connection_per_host ... ok
test inbound_http1 ... ok
test inbound_tcp ... ok
test http1_requests_without_body_doesnt_add_transfer_encoding ... ok
test http1_response_end_of_file ... ok
test http1_requests_without_host_have_unique_connections ... ok
test outbound_tcp ... ok
test tcp_with_no_orig_dst ... ok
test tcp_connections_close_if_client_closes ... ok
test outbound_http1 ... ok
test result: ok. 19 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/conduit_proxy_controller_grpc-7fdac3528475b1dc
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/conduit_proxy_router-024926cac5d328ee
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/convert-ae9bd3b8fee21c85
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/futures_mpsc_lossy-4afd31454ff77b40
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests conduit-proxy
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests conduit-proxy-controller-grpc
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests convert
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests conduit-proxy-router
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests futures-mpsc-lossy
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
```
This also happens if the `-p` flag is used to run tests only in the `conduit-proxy` crate:
```
➜ cargo test -p conduit-proxy --no-default-features
Compiling conduit-proxy v0.3.0 (file:///Users/eliza/Code/go/src/github.com/runconduit/conduit/proxy)
Finished dev [unoptimized + debuginfo] target(s) in 17.27 secs
Running target/debug/deps/conduit_proxy-0e0ab2829c6b743f
running 13 tests
test fully_qualified_authority::tests::test_normalized_authority ... ok
test ctx::transport::tests::same_addr_ip6_mapped_ipv4 ... ok
test ctx::transport::tests::same_addr_ipv6 ... ok
test ctx::transport::tests::same_addr_ipv4 ... ok
test ctx::transport::tests::same_addr_ip6_compat_ipv4 ... ok
test inbound::tests::recognize_default_no_loop ... ok
test telemetry::tap::match_::tests::http_from_proto ... ok
test inbound::tests::recognize_default_no_orig_dst ... ok
test inbound::tests::recognize_default_no_ctx ... ok
test transparency::tcp::tests::duplex_doesnt_hang_when_one_half_finishes ... ok
test telemetry::tap::match_::tests::tcp_from_proto ... ok
test inbound::tests::recognize_orig_dst ... ok
test telemetry::tap::match_::tests::tcp_matches ... ok
test result: ok. 13 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/conduit_proxy-74584a35ef749a60
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/discovery-73cd0b65bd7a45ae
running 16 tests
test http1::absolute_uris::outbound_reconnects_if_controller_stream_ends ... ok
test http1::outbound_reconnects_if_controller_stream_ends ... ok
test http1::absolute_uris::outbound_asks_controller_without_orig_dst ... ok
test http1::absolute_uris::outbound_uses_orig_dst_if_not_local_svc ... ok
test http1::outbound_asks_controller_without_orig_dst ... ok
test http1::absolute_uris::outbound_asks_controller_api ... ok
test http1::outbound_asks_controller_api ... ok
test http1::outbound_uses_orig_dst_if_not_local_svc ... ok
test http2::outbound_reconnects_if_controller_stream_ends ... ok
test http2::outbound_asks_controller_without_orig_dst ... ok
test http2::outbound_asks_controller_api ... ok
test http2::outbound_uses_orig_dst_if_not_local_svc ... ok
server h1 error: invalid HTTP version specified
ERROR 2018-03-26T20:56:50Z: conduit_proxy: turning Error caused by underlying HTTP/2 error: protocol error: frame with invalid size into 500
test outbound_updates_newer_services ... ok
ERROR 2018-03-26T20:56:50Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
test http1::absolute_uris::outbound_times_out ... ok
ERROR 2018-03-26T20:56:50Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
test http1::outbound_times_out ... ok
ERROR 2018-03-26T20:56:50Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
test http2::outbound_times_out ... ok
test result: ok. 16 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/telemetry-cb5bee2d2b94332c
running 12 tests
test metrics_endpoint_inbound_request_duration ... ok
test metrics_endpoint_inbound_request_count ... ok
test metrics_endpoint_outbound_request_count ... ok
test metrics_endpoint_outbound_request_duration ... ok
test telemetry_report_errors_are_ignored ... ok
test metrics_have_no_double_commas ... ok
test inbound_sends_telemetry ... ok
test http1_inbound_sends_telemetry ... ok
test inbound_aggregates_telemetry_over_several_requests ... ok
test metrics_endpoint_inbound_response_latency ... ok
test metrics_endpoint_outbound_response_latency ... ok
test records_latency_statistics ... ok
test result: ok. 12 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running target/debug/deps/transparency-9d14bf92d8ba3700
running 19 tests
ERROR 2018-03-26T20:56:55Z: conduit_proxy: turning Error caused by underlying HTTP/2 error: protocol error: unexpected internal error encountered into 500
test http1_connect_not_supported ... ok
test http11_upgrade_not_supported ... ok
test http10_without_host ... ok
test http11_absolute_uri_differs_from_host ... ok
test http1_head_responses ... ok
test http10_with_host ... ok
test http1_bodyless_responses ... ok
test http1_content_length_zero_is_preserved ... ok
test http1_removes_connection_headers ... ok
test http1_one_connection_per_host ... ok
test http1_response_end_of_file ... ok
test http1_requests_without_host_have_unique_connections ... ok
test inbound_http1 ... ok
test inbound_tcp ... ok
test http1_requests_without_body_doesnt_add_transfer_encoding ... ok
test outbound_tcp ... ok
test tcp_with_no_orig_dst ... ok
test tcp_connections_close_if_client_closes ... ok
test outbound_http1 ... ok
test result: ok. 19 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests conduit-proxy
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
```
However, if I `cd` into the `proxy` directory (so that Cargo treats the `conduit-proxy` crate as the root project, rather than the workspace) and pass the `--no-default-features` flag, the flaky tests are skipped as expected:
```
➜ (cd proxy && exec cargo test --no-default-features)
Finished dev [unoptimized + debuginfo] target(s) in 0.0 secs
Running /Users/eliza/Code/go/src/github.com/runconduit/conduit/target/debug/deps/conduit_proxy-ac198a96228a056e
running 13 tests
test fully_qualified_authority::tests::test_normalized_authority ... ok
test ctx::transport::tests::same_addr_ipv4 ... ok
test ctx::transport::tests::same_addr_ip6_compat_ipv4 ... ok
test ctx::transport::tests::same_addr_ipv6 ... ok
test ctx::transport::tests::same_addr_ip6_mapped_ipv4 ... ok
test telemetry::tap::match_::tests::tcp_from_proto ... ok
test telemetry::tap::match_::tests::http_from_proto ... ok
test transparency::tcp::tests::duplex_doesnt_hang_when_one_half_finishes ... ok
test telemetry::tap::match_::tests::tcp_matches ... ok
test inbound::tests::recognize_default_no_ctx ... ok
test inbound::tests::recognize_default_no_loop ... ok
test inbound::tests::recognize_default_no_orig_dst ... ok
test inbound::tests::recognize_orig_dst ... ok
test result: ok. 13 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running /Users/eliza/Code/go/src/github.com/runconduit/conduit/target/debug/deps/conduit_proxy-41e0f900f97e194b
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Running /Users/eliza/Code/go/src/github.com/runconduit/conduit/target/debug/deps/discovery-7ba7fe16345a347a
running 16 tests
test http1::absolute_uris::outbound_times_out ... ignored
test http1::outbound_times_out ... ignored
test http1::absolute_uris::outbound_reconnects_if_controller_stream_ends ... ok
test http1::outbound_reconnects_if_controller_stream_ends ... ok
test http1::absolute_uris::outbound_uses_orig_dst_if_not_local_svc ... ok
test http1::outbound_uses_orig_dst_if_not_local_svc ... ok
test http1::absolute_uris::outbound_asks_controller_without_orig_dst ... ok
test http1::outbound_asks_controller_without_orig_dst ... ok
test http1::outbound_asks_controller_api ... ok
test http1::absolute_uris::outbound_asks_controller_api ... ok
test http2::outbound_times_out ... ignored
server h1 error: invalid HTTP version specified
ERROR 2018-03-26T21:48:32Z: conduit_proxy: turning Error caused by underlying HTTP/2 error: protocol error: frame with invalid size into 500
test http2::outbound_reconnects_if_controller_stream_ends ... ok
test http2::outbound_uses_orig_dst_if_not_local_svc ... ok
test http2::outbound_asks_controller_api ... ok
test http2::outbound_asks_controller_without_orig_dst ... ok
test outbound_updates_newer_services ... ok
test result: ok. 13 passed; 0 failed; 3 ignored; 0 measured; 0 filtered out
Running /Users/eliza/Code/go/src/github.com/runconduit/conduit/target/debug/deps/telemetry-b0763b64edd8fc68
running 12 tests
test metrics_endpoint_inbound_request_count ... ignored
test metrics_endpoint_inbound_request_duration ... ignored
test metrics_endpoint_inbound_response_latency ... ignored
test metrics_endpoint_outbound_request_count ... ignored
test metrics_endpoint_outbound_request_duration ... ignored
test metrics_endpoint_outbound_response_latency ... ignored
test records_latency_statistics ... ignored
test telemetry_report_errors_are_ignored ... ok
test metrics_have_no_double_commas ... ok
test http1_inbound_sends_telemetry ... ok
test inbound_sends_telemetry ... ok
test inbound_aggregates_telemetry_over_several_requests ... ok
test result: ok. 5 passed; 0 failed; 7 ignored; 0 measured; 0 filtered out
Running /Users/eliza/Code/go/src/github.com/runconduit/conduit/target/debug/deps/transparency-300fd801daa85ccf
running 19 tests
ERROR 2018-03-26T21:48:32Z: conduit_proxy: turning Error caused by underlying HTTP/2 error: protocol error: unexpected internal error encountered into 500
test http1_connect_not_supported ... ok
test http11_upgrade_not_supported ... ok
test http10_without_host ... ok
test http10_with_host ... ok
test http11_absolute_uri_differs_from_host ... ok
test http1_head_responses ... ok
test http1_bodyless_responses ... ok
test http1_removes_connection_headers ... ok
test http1_content_length_zero_is_preserved ... ok
test http1_one_connection_per_host ... ok
test http1_response_end_of_file ... ok
test http1_requests_without_body_doesnt_add_transfer_encoding ... ok
test inbound_tcp ... ok
test inbound_http1 ... ok
test http1_requests_without_host_have_unique_connections ... ok
test outbound_tcp ... ok
test tcp_connections_close_if_client_closes ... ok
test tcp_with_no_orig_dst ... ok
test outbound_http1 ... ok
test result: ok. 19 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
Doc-tests conduit-proxy
running 0 tests
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
```
I'm wrapping the `cd` and `cargo test` command in a subshell so that the CWD on Travis is still in the repo root when the command exits, but the return value from `cargo test` is propagated.
Closes#625
Use `VecDeqeue` to make the queue structure clear. Follow good practice
by minimizing the amount of time the lock is held. Clarify how
defaulting logic works.
Signed-off-by: Brian Smith <brian@briansmith.org>
The metrics endpoint tests are flaky because there are no guarantees
that the metrics pipeline has processed events before the metrics
endpoint is read. This can cause CI to fail spuriously.
Disable these tests from running in CI until #613 is resolved.
CI builds on master have been failing to publish `cli-bin` images because the
`docker-push` script still refers to the `cli` image, though it was removed in
e7c4a9d4b9.
This change removes references to the `cli` image from all scripts.
The Prometheus config in the docker-compose environment had fallen
behind the prod setup.
This change updates the docker-compose environment in the following
ways:
- Prometheus config more closely matches prod, based on #583
- simulate-proxy labels matches prod, based on #605
- add Grafana container
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The inject code detects the object it is being injected into, and writes
self-identifying information into the CONDUIT_PROMETHEUS_LABELS
environment variable, so that conduit-proxy may read this information
and report it to Prometheus at collection time.
This change puts the self-identifying information directly into
Kubernetes labels, which Prometheus already collects, removing the need
for conduit-proxy to be aware of this information. The resulting label
in Prometheus is recorded in the form `k8s_deployment`.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This PR adds the `request_duration_ms` metric to the Prometheus metrics exported by the proxy. It also modifies the `request_total` metric so that it is incremented when a request stream finishes, rather than when it opens, for consistency with how the `response_total` metric is generated.
Making this change required modifying `telemetry::sensors::http` to generate a `StreamRequestEnd` event similar to the `StreamResponseEnd` event. This is done similarly to how sensors are added to response bodies, by generalizing the `ResponseBody` type into a `MeasuredBody` type that can wrap a request or response body. Since this changed the type of request bodies, it necessitated changing request types pretty much everywhere else in the proxy codebase in order to fix the resulting type errors, which is why the diff for this PR is so large.
Closes#570
In addition to dashboards display service health, we need a dashboard to
display health of the Conduit service mesh itself.
This change introduces a conduit-health dashboard. It currently only
displays health metrics for the control plane components. Proxy health
will come later.
Fixes#502
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Update the inject command to set a CONDUIT_PROMETHEUS_LABELS proxy environment variable with the name of the pod spec that the proxy is injected into. This will later be used as a label value when the proxy is exposing metrics.
Fixes: #426
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#600
The proxy metrics endpoint has a bug where metrics recorded in the outbound direction can contain two commas in a row when no outbound label is present. This occurs because the code for formatting the outbound direction label mistakenly assumed that there would always be a destination pod owner label as well, but the proxy isn't currently aware of the destination's pod owner (waiting for #429).
I've fixed this issue by moving the place where the comma is output from the `fmt::Display` impl for `RequestLabels` to the `fmt::Display` impl for `OutboudnLabels`. This way, the comma between the `direction` and `dst_*` labels is only output when the `dst_*` label is present.
This bug made it to master since all of the proxy end-to-end tests for metrics only test the inbound router. I've rectified this issue by adding tests on the outbound router as well (which would fail against the current master due to the double comma bug). I've also added a test that asserts there are no double commas in exported metrics, to protect against regressions to this bug.
The existing telemetry pipeline relies on Prometheus scraping the
Telemetry service, which will soon be removed.
This change configures Prometheus to scrape the conduit proxies directly
for telemetry data, and the control plane components for control-plane
health information. This affects the output of both conduit install
and conduit inject.
Fixes#428, #501
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The simulate-proxy script pushes metrics to the telemetry service. This PR modifies the script to expose metrics to a prometheus endpoint. This functionality creates a server that randomly generates response_total, request_totals, response_duration_ms and response_latency_ms. The server reads pod information from a k8s cluster and picks a random namespace to use for all exposed metrics.
Tested out these changes with a locally running prometheus server. I also ran the docker-compose.yml to make sure metrics were being recorded by the prometheus docker container.
fixes#498
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
This PR adds an endpoint to the proxy that serves metrics in Prometheus' text exposition format. The endpoint currently serves the `request_total`, `response_total`, `response_latency_ms`, and `response_duration_ms metrics`, as described in #536. The endpoint's port and address are configurable with the `CONDUIT_PROXY_METRICS_LISTENER` environment variable.
Tests have been added in t`ests/telemetry.rs`
Follow up
https://github.com/runconduit/conduit/pull/593 by avoiding
`cargo fetch` in favor of the implicit fetch done in `cargo build`
to work around the lack of a `--target` flag in `cargo fetch`. This
should at least slightly improve the speed and reliability of Travis
CI runs.
Signed-off-by: Brian Smith <brian@briansmith.org>
Replace an unconditional dependency on windows-specific crates in
tempdir (via its update of its remove_dir_all dependency), which
eliminates the need to download any windows-specific crates during
the build when targetting non-Windows platforms.
Also, when targetting Windows platforms, replace a winapi 0.2.x
dependency with a winapi 0.3.x dependency.
This results in two fewer downloads during Docker builds:
```diff
- Downloading winapi v0.2.8
- Downloading winapi-build v0.1.1
```
Signed-off-by: Brian Smith <brian@briansmith.org>
`cargo fetch` doesn't consider the target platform and downloads
all crates needed to build for any target.
Stop using `cargo fetch` and instead use the implicit fetch done
by `cargo build`, which *does* consider the target platform.
This change results in 12 (soon 15) fewer crates downloaded.
This is a non-trivial savings in build time for a full rebuild
since cargo downloads crates in parallel.
```diff
- Downloading bitflags v1.0.1
- Downloading fuchsia-zircon v0.3.3
- Downloading fuchsia-zircon-sys v0.3.3
- Downloading miow v0.2.1
- Downloading redox_syscall v0.1.37
- Downloading redox_termios v0.1.1
- Downloading termion v1.5.1
- Downloading winapi v0.3.4
- Downloading winapi-i686-pc-windows-gnu v0.4.0
- Downloading winapi-x86_64-pc-windows-gnu v0.4.0
- Downloading wincolor v0.1.6
- Downloading ws2_32-sys v0.2.1
```
I verified that no downloads are done during an incremental
build.
Signed-off-by: Brian Smith <brian@briansmith.org>
The existing `conduit dashboard` command supported opening the conduit
dashboard, or displaying the conduit dashboard URL, via a `url` boolean
flag.
Replace the `url` boolean flag with a `show` string flag, with three
modes:
`conduit dashboard --show conduit`: default, open conduit dashboard
`conduit dashboard --show grafana`: open grafana dashboard
`conduit dashboard --show url`: display dashboard URLs
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Successful `conduit check` commands now take into account `[ok]`
and `\n` tokens when constraining line length.
Fixes#554
Signed-off-by: Andy Hume <andyhume@gmail.com>
Existing Grafana configuration contained no dashboards, just a skeleton
for testing.
Introduce two Grafana dashboards:
1) Top Line: Overall health of all Conduit-enabled services
2) Deployment: Health of a specific conduit-enabled deployment
Fixes#500
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Applies timeout of 5s to check request contexts. This overrides
30s timeout applied at client transport level, and stops the
conduit check command from taking > 90s to complete.
Fixes#553
Signed-off-by: Andy Hume <andyhume@gmail.com>
When the proxy is run in a Docker container, it runs as PID 1, with
no default signal handlers setup. In order to react to signals from
Kubernetes about shutting down, we need to set up explicit handlers.
This adds handlers for SIGTERM and SIGINT.
Closes#549
This image isn't used. It references its base image using the `latest` tag, which
is wrong; it should have been using the tag that the base image was built with. It
is likely that the last few iterations of this image that we've published have
wrong and useless contents.
With that in mind, just remove the image.
Fixes#578.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Update release notes for v0.3.1.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Add PR authorship for non-Buoyant contributors.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Wrap lines at 100 characters.
Signed-off-by: Brian Smith <brian@briansmith.org>
We should review changes to Cargo.lock to ensure we're not adding
unexpected and/or unnecessary dependencies. (Maybe we should do the
same for Gopkg.lock, but I'm not in a position to say for sure.)
Signed-off-by: Brian Smith <brian@briansmith.org>
* Most controller listeners should only bind on localhost
* Use default listening addresses in controller components
* Review feedback
* Revert test_helper change
* Revert use of absolute domains
Signed-off-by: Alex Leong <alex@buoyant.io>
In order to ensure we catch discovery and routing issues arising from different logic for HTTP/1 and HTTP/2 requests, I've modified tests/discovery.rs to run all applicable tests with both HTTP/1 and HTTP/2 requests. The tests themselves are largely unchanged, but now there are separate modules containing HTTP/1 and HTTP/2 versions of a majority of the tests.
Commit 569d6939a7 introduced a regression that caused the proxy to stop using the Destination service for outbound HTTP/1 requests with no authority in the request URI but a valid authority in the `Host:` header.
The bug is due to some code in `Outbound::recognize` which assumed that a request had already been passed through `normalize_our_view_of_uri`. This was valid at one point while I was writing #492, as URIs were normalized prior to `recognize` and a request `Extension` was used to mark that they had been rewritten, and the host header and request URI could be assumed to be in agreement, but after merging #514 into the dev branch for #492, this behaviour changed and I forgot to update the logic in `recognize`.
I've fixed the issue by adding the logic for routing on `Host:` headers back into `Outbound::recognize`.
@seanmonstar added a test in `discovery.rs`, `outbound_http1_asks_controller_about_host`, which should exercise this case. I've added a couple more unit tests in that file to try and ensure we cover more of the different cases that can occur here.
Fixes#552
In some cases, we would adjust an existing Host header, or add one. And in all cases when an HTTP/1 request was received with an absolute-form target, it was not passed on.
Now, the Host header is never changed. And if the Uri was in absolute-form, it is sent in the same format.
Closes#518
The h2 crate (HTTP/2.0 client and server) has a new release which
includes bug fixes and stability improvements.
This updates the Cargo.lock file to include the new release.
Closes#538
Signed-off-by: Carl Lerche <me@carllerche.com>
An infinite loop exists in the TCP proxy, which could be triggered by any raw TCP connection (including HTTPS requests). The connection will be proxied successfully, but instead of closing, it will remain open, and the proxy's CPU usage will remain extremely high indefinitely.
Since `Duplex::poll` will call `half_in.copy_into()`/`half_out.copy_into()` repeatedly, even after they return `Async::Ready`, when one half has shut down and returned ready, it may still be polled again, as `Duplex::poll` waits until _both_ halves have returned `Ready`. Because of the guard that `!dst.is_shutdown`, intended to prevent the destination from shutting down twice, the function will not return if it is polled again after returning `Async::Ready` once.
I've fixed this by moving the guard against double shutdowns out of the loop, so that the function will return `Async::Ready` again if it is polled after shutting down the destination.
I've also included a unit test against regressions to this bug. The unit test fails against master.
Fixes#519
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Co-Authored-By: Andrew Seigner <andrew@sig.gy>
* Proxy: Don't resolve absolute names outside zone using Destinations service
Many absolute names were being resolved using the Destinations service due to logic error
in the proxy's matching of the zone to the default zone.
Fix that bug.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Temporarily stop trying to support configurable zones in the proxy.
None of the zone configuration is tested and lots of things assume the cluster
zone is `cluster.local`. Further, how exactly the proxy will actually learn the
cluster zone hasn't been decided yet.
Just hard-code the zone as "cluster.local" in the proxy until configurable zones
are fully implemented and tested to be working correctly.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Remove the CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN setting
The way that Kubernetes configures DNS search suffixes has some negative
consequences as some names like "example.com" are ambiguous: depending on
whether there is a service "example" in the "com" namespace, "example.com"
may refer to an external service or an internal service, and this can
fluctuate over time. In recognition of that we added the
CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN setting, thinking this would
be part of a solution for users to opt out of the unfortunate behavior
if their applications didn't depend on the DNS search suffix feature.
It turns out similar effects can be acheived using a custom dnsConfig,
starting in Kubernetes 1.10 when dnsConfig reaches the beta stability level.
Now any CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN-based seems duplicative.
Further, attempting to support it optionally made the code complex and hard
to read.
Therefore, let's just remove it. If/when somebody actually requests this
functionality then we can add it back, if dnsConfig isn't a valid alternative
for them.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Further hard-code "cluster.local" as the zone, temporarily.
Addresses review feedback.
Signed-off-by: Brian Smith <brian@briansmith.org>
Shortly after conduit is installed in k8s environment. The control plane component that establishes a watch endpoint with k8s run in to networking issues during proxy initialization. During failure, each watcher fails to retry its connection to k8s watch endpoint which leads to timeouts and eventually, multiple controller pod restarts.
This PR adds retry logic to each "watch" enabled package.
fixes#478
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
Wwe will be able to simplify service discovery in the near future if we
can rely on the namespace being available.
Signed-off-by: Brian Smith <brian@briansmith.org>
Pick up https://github.com/danburkert/prost/pull/87, which results in the
following reduction in build dependencies for the proxy:
Removing failure_derive v0.1.1
Adding prost-derive v0.3.2 (https://github.com/danburkert/prost#3427352e)
Removing prost-derive v0.3.2
Removing quote v0.3.15
Removing syn v0.11.11
Removing synom v0.11.3
Removing synstructure v0.6.1
Removing unicode-xid v0.0.4
Signed-off-by: Brian Smith <brian@briansmith.org>
Kubernetes will do multiple DNS lookups for a name like
`proxy-api.conduit.svc.cluster.local` based on the default search settings
in /etc/resolv.conf for each container:
1. proxy-api.conduit.svc.cluster.local.conduit.svc.cluster.local. IN A
2. proxy-api.conduit.svc.cluster.local.svc.cluster.local. IN A
3. proxy-api.conduit.svc.cluster.local.cluster.local. IN A
4. proxy-api.conduit.svc.cluster.local. IN A
We do not need or want this search to be done, so avoid it by making each
name absolute by appending a period so that the first three DNS queries
are skipped for each name.
The case for `localhost` is even worse because we expect that `localhost` will
always resolve to 127.0.0.1 and/or ::1, but this is not guaranteed if the default
search is done:
1. localhost.conduit.svc.cluster.local. IN A
2. localhost.svc.cluster.local. IN A
3. localhost.cluster.local. IN A
4. localhost. IN A
Avoid these unnecessary DNS queries by making each name absolute, so that the
first three DNS queries are skipped for each name.
Signed-off-by: Brian Smith <brian@briansmith.org>