The Prometheus config in the docker-compose environment had fallen
behind the prod setup.
This change updates the docker-compose environment in the following
ways:
- Prometheus config more closely matches prod, based on #583
- simulate-proxy labels matches prod, based on #605
- add Grafana container
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The inject code detects the object it is being injected into, and writes
self-identifying information into the CONDUIT_PROMETHEUS_LABELS
environment variable, so that conduit-proxy may read this information
and report it to Prometheus at collection time.
This change puts the self-identifying information directly into
Kubernetes labels, which Prometheus already collects, removing the need
for conduit-proxy to be aware of this information. The resulting label
in Prometheus is recorded in the form `k8s_deployment`.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This PR adds the `request_duration_ms` metric to the Prometheus metrics exported by the proxy. It also modifies the `request_total` metric so that it is incremented when a request stream finishes, rather than when it opens, for consistency with how the `response_total` metric is generated.
Making this change required modifying `telemetry::sensors::http` to generate a `StreamRequestEnd` event similar to the `StreamResponseEnd` event. This is done similarly to how sensors are added to response bodies, by generalizing the `ResponseBody` type into a `MeasuredBody` type that can wrap a request or response body. Since this changed the type of request bodies, it necessitated changing request types pretty much everywhere else in the proxy codebase in order to fix the resulting type errors, which is why the diff for this PR is so large.
Closes#570
In addition to dashboards display service health, we need a dashboard to
display health of the Conduit service mesh itself.
This change introduces a conduit-health dashboard. It currently only
displays health metrics for the control plane components. Proxy health
will come later.
Fixes#502
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Update the inject command to set a CONDUIT_PROMETHEUS_LABELS proxy environment variable with the name of the pod spec that the proxy is injected into. This will later be used as a label value when the proxy is exposing metrics.
Fixes: #426
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#600
The proxy metrics endpoint has a bug where metrics recorded in the outbound direction can contain two commas in a row when no outbound label is present. This occurs because the code for formatting the outbound direction label mistakenly assumed that there would always be a destination pod owner label as well, but the proxy isn't currently aware of the destination's pod owner (waiting for #429).
I've fixed this issue by moving the place where the comma is output from the `fmt::Display` impl for `RequestLabels` to the `fmt::Display` impl for `OutboudnLabels`. This way, the comma between the `direction` and `dst_*` labels is only output when the `dst_*` label is present.
This bug made it to master since all of the proxy end-to-end tests for metrics only test the inbound router. I've rectified this issue by adding tests on the outbound router as well (which would fail against the current master due to the double comma bug). I've also added a test that asserts there are no double commas in exported metrics, to protect against regressions to this bug.
The existing telemetry pipeline relies on Prometheus scraping the
Telemetry service, which will soon be removed.
This change configures Prometheus to scrape the conduit proxies directly
for telemetry data, and the control plane components for control-plane
health information. This affects the output of both conduit install
and conduit inject.
Fixes#428, #501
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The simulate-proxy script pushes metrics to the telemetry service. This PR modifies the script to expose metrics to a prometheus endpoint. This functionality creates a server that randomly generates response_total, request_totals, response_duration_ms and response_latency_ms. The server reads pod information from a k8s cluster and picks a random namespace to use for all exposed metrics.
Tested out these changes with a locally running prometheus server. I also ran the docker-compose.yml to make sure metrics were being recorded by the prometheus docker container.
fixes#498
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
This PR adds an endpoint to the proxy that serves metrics in Prometheus' text exposition format. The endpoint currently serves the `request_total`, `response_total`, `response_latency_ms`, and `response_duration_ms metrics`, as described in #536. The endpoint's port and address are configurable with the `CONDUIT_PROXY_METRICS_LISTENER` environment variable.
Tests have been added in t`ests/telemetry.rs`
Follow up
https://github.com/runconduit/conduit/pull/593 by avoiding
`cargo fetch` in favor of the implicit fetch done in `cargo build`
to work around the lack of a `--target` flag in `cargo fetch`. This
should at least slightly improve the speed and reliability of Travis
CI runs.
Signed-off-by: Brian Smith <brian@briansmith.org>
Replace an unconditional dependency on windows-specific crates in
tempdir (via its update of its remove_dir_all dependency), which
eliminates the need to download any windows-specific crates during
the build when targetting non-Windows platforms.
Also, when targetting Windows platforms, replace a winapi 0.2.x
dependency with a winapi 0.3.x dependency.
This results in two fewer downloads during Docker builds:
```diff
- Downloading winapi v0.2.8
- Downloading winapi-build v0.1.1
```
Signed-off-by: Brian Smith <brian@briansmith.org>
`cargo fetch` doesn't consider the target platform and downloads
all crates needed to build for any target.
Stop using `cargo fetch` and instead use the implicit fetch done
by `cargo build`, which *does* consider the target platform.
This change results in 12 (soon 15) fewer crates downloaded.
This is a non-trivial savings in build time for a full rebuild
since cargo downloads crates in parallel.
```diff
- Downloading bitflags v1.0.1
- Downloading fuchsia-zircon v0.3.3
- Downloading fuchsia-zircon-sys v0.3.3
- Downloading miow v0.2.1
- Downloading redox_syscall v0.1.37
- Downloading redox_termios v0.1.1
- Downloading termion v1.5.1
- Downloading winapi v0.3.4
- Downloading winapi-i686-pc-windows-gnu v0.4.0
- Downloading winapi-x86_64-pc-windows-gnu v0.4.0
- Downloading wincolor v0.1.6
- Downloading ws2_32-sys v0.2.1
```
I verified that no downloads are done during an incremental
build.
Signed-off-by: Brian Smith <brian@briansmith.org>
The existing `conduit dashboard` command supported opening the conduit
dashboard, or displaying the conduit dashboard URL, via a `url` boolean
flag.
Replace the `url` boolean flag with a `show` string flag, with three
modes:
`conduit dashboard --show conduit`: default, open conduit dashboard
`conduit dashboard --show grafana`: open grafana dashboard
`conduit dashboard --show url`: display dashboard URLs
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Successful `conduit check` commands now take into account `[ok]`
and `\n` tokens when constraining line length.
Fixes#554
Signed-off-by: Andy Hume <andyhume@gmail.com>
Existing Grafana configuration contained no dashboards, just a skeleton
for testing.
Introduce two Grafana dashboards:
1) Top Line: Overall health of all Conduit-enabled services
2) Deployment: Health of a specific conduit-enabled deployment
Fixes#500
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Applies timeout of 5s to check request contexts. This overrides
30s timeout applied at client transport level, and stops the
conduit check command from taking > 90s to complete.
Fixes#553
Signed-off-by: Andy Hume <andyhume@gmail.com>
When the proxy is run in a Docker container, it runs as PID 1, with
no default signal handlers setup. In order to react to signals from
Kubernetes about shutting down, we need to set up explicit handlers.
This adds handlers for SIGTERM and SIGINT.
Closes#549
This image isn't used. It references its base image using the `latest` tag, which
is wrong; it should have been using the tag that the base image was built with. It
is likely that the last few iterations of this image that we've published have
wrong and useless contents.
With that in mind, just remove the image.
Fixes#578.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Update release notes for v0.3.1.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Add PR authorship for non-Buoyant contributors.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Wrap lines at 100 characters.
Signed-off-by: Brian Smith <brian@briansmith.org>
We should review changes to Cargo.lock to ensure we're not adding
unexpected and/or unnecessary dependencies. (Maybe we should do the
same for Gopkg.lock, but I'm not in a position to say for sure.)
Signed-off-by: Brian Smith <brian@briansmith.org>
* Most controller listeners should only bind on localhost
* Use default listening addresses in controller components
* Review feedback
* Revert test_helper change
* Revert use of absolute domains
Signed-off-by: Alex Leong <alex@buoyant.io>
In order to ensure we catch discovery and routing issues arising from different logic for HTTP/1 and HTTP/2 requests, I've modified tests/discovery.rs to run all applicable tests with both HTTP/1 and HTTP/2 requests. The tests themselves are largely unchanged, but now there are separate modules containing HTTP/1 and HTTP/2 versions of a majority of the tests.
Commit 569d6939a7 introduced a regression that caused the proxy to stop using the Destination service for outbound HTTP/1 requests with no authority in the request URI but a valid authority in the `Host:` header.
The bug is due to some code in `Outbound::recognize` which assumed that a request had already been passed through `normalize_our_view_of_uri`. This was valid at one point while I was writing #492, as URIs were normalized prior to `recognize` and a request `Extension` was used to mark that they had been rewritten, and the host header and request URI could be assumed to be in agreement, but after merging #514 into the dev branch for #492, this behaviour changed and I forgot to update the logic in `recognize`.
I've fixed the issue by adding the logic for routing on `Host:` headers back into `Outbound::recognize`.
@seanmonstar added a test in `discovery.rs`, `outbound_http1_asks_controller_about_host`, which should exercise this case. I've added a couple more unit tests in that file to try and ensure we cover more of the different cases that can occur here.
Fixes#552
In some cases, we would adjust an existing Host header, or add one. And in all cases when an HTTP/1 request was received with an absolute-form target, it was not passed on.
Now, the Host header is never changed. And if the Uri was in absolute-form, it is sent in the same format.
Closes#518
The h2 crate (HTTP/2.0 client and server) has a new release which
includes bug fixes and stability improvements.
This updates the Cargo.lock file to include the new release.
Closes#538
Signed-off-by: Carl Lerche <me@carllerche.com>
An infinite loop exists in the TCP proxy, which could be triggered by any raw TCP connection (including HTTPS requests). The connection will be proxied successfully, but instead of closing, it will remain open, and the proxy's CPU usage will remain extremely high indefinitely.
Since `Duplex::poll` will call `half_in.copy_into()`/`half_out.copy_into()` repeatedly, even after they return `Async::Ready`, when one half has shut down and returned ready, it may still be polled again, as `Duplex::poll` waits until _both_ halves have returned `Ready`. Because of the guard that `!dst.is_shutdown`, intended to prevent the destination from shutting down twice, the function will not return if it is polled again after returning `Async::Ready` once.
I've fixed this by moving the guard against double shutdowns out of the loop, so that the function will return `Async::Ready` again if it is polled after shutting down the destination.
I've also included a unit test against regressions to this bug. The unit test fails against master.
Fixes#519
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Co-Authored-By: Andrew Seigner <andrew@sig.gy>
* Proxy: Don't resolve absolute names outside zone using Destinations service
Many absolute names were being resolved using the Destinations service due to logic error
in the proxy's matching of the zone to the default zone.
Fix that bug.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Temporarily stop trying to support configurable zones in the proxy.
None of the zone configuration is tested and lots of things assume the cluster
zone is `cluster.local`. Further, how exactly the proxy will actually learn the
cluster zone hasn't been decided yet.
Just hard-code the zone as "cluster.local" in the proxy until configurable zones
are fully implemented and tested to be working correctly.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Remove the CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN setting
The way that Kubernetes configures DNS search suffixes has some negative
consequences as some names like "example.com" are ambiguous: depending on
whether there is a service "example" in the "com" namespace, "example.com"
may refer to an external service or an internal service, and this can
fluctuate over time. In recognition of that we added the
CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN setting, thinking this would
be part of a solution for users to opt out of the unfortunate behavior
if their applications didn't depend on the DNS search suffix feature.
It turns out similar effects can be acheived using a custom dnsConfig,
starting in Kubernetes 1.10 when dnsConfig reaches the beta stability level.
Now any CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN-based seems duplicative.
Further, attempting to support it optionally made the code complex and hard
to read.
Therefore, let's just remove it. If/when somebody actually requests this
functionality then we can add it back, if dnsConfig isn't a valid alternative
for them.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Further hard-code "cluster.local" as the zone, temporarily.
Addresses review feedback.
Signed-off-by: Brian Smith <brian@briansmith.org>
Shortly after conduit is installed in k8s environment. The control plane component that establishes a watch endpoint with k8s run in to networking issues during proxy initialization. During failure, each watcher fails to retry its connection to k8s watch endpoint which leads to timeouts and eventually, multiple controller pod restarts.
This PR adds retry logic to each "watch" enabled package.
fixes#478
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
Wwe will be able to simplify service discovery in the near future if we
can rely on the namespace being available.
Signed-off-by: Brian Smith <brian@briansmith.org>
Pick up https://github.com/danburkert/prost/pull/87, which results in the
following reduction in build dependencies for the proxy:
Removing failure_derive v0.1.1
Adding prost-derive v0.3.2 (https://github.com/danburkert/prost#3427352e)
Removing prost-derive v0.3.2
Removing quote v0.3.15
Removing syn v0.11.11
Removing synom v0.11.3
Removing synstructure v0.6.1
Removing unicode-xid v0.0.4
Signed-off-by: Brian Smith <brian@briansmith.org>
Kubernetes will do multiple DNS lookups for a name like
`proxy-api.conduit.svc.cluster.local` based on the default search settings
in /etc/resolv.conf for each container:
1. proxy-api.conduit.svc.cluster.local.conduit.svc.cluster.local. IN A
2. proxy-api.conduit.svc.cluster.local.svc.cluster.local. IN A
3. proxy-api.conduit.svc.cluster.local.cluster.local. IN A
4. proxy-api.conduit.svc.cluster.local. IN A
We do not need or want this search to be done, so avoid it by making each
name absolute by appending a period so that the first three DNS queries
are skipped for each name.
The case for `localhost` is even worse because we expect that `localhost` will
always resolve to 127.0.0.1 and/or ::1, but this is not guaranteed if the default
search is done:
1. localhost.conduit.svc.cluster.local. IN A
2. localhost.svc.cluster.local. IN A
3. localhost.cluster.local. IN A
4. localhost. IN A
Avoid these unnecessary DNS queries by making each name absolute, so that the
first three DNS queries are skipped for each name.
Signed-off-by: Brian Smith <brian@briansmith.org>
This PR ensures that the mapping of requests to outbound connections is segregated by `Host:` header values. In most cases, the desired behavior is provided by Hyper's connection pooling. However, Hyper does not handle the case where a request had no `Host:` header and the request URI had no authority part, and the request was routed based on the SO_ORIGINAL_DST in the desired manner. We would like these requests to each have their own outbound connection, but Hyper will reuse the same connection for such requests.
Therefore, I have modified `conduit_proxy_router::Recognize` to allow implementations of `Recognize` to indicate whether the service for a given key can be cached, and to only cache the service when it is marked as cachable. I've also changed the `reconstruct_uri` function, which rewrites HTTP/1 requests, to mark when a request had no authority and no `Host:` header, and the authority was rewritten to be the request's ORIGINAL_DST. When this is the case, the `Recognize` implementations for `Inbound` and `Outbound` will mark these requests as non-cachable.
I've also added unit tests ensuring that A, connections are created per `Host:` header, and B, that requests with no `Host:` header each create a new connection. The first test passes without any additional changes, but the second only passes on this branch. The tests were added in PR #489, but this branch supersedes that branch.
Fixes#415. Closes#489.
Grafana by default calls out to grafana.com to check for updates. As
user's of Conduit do not have direct control over updating Grafana
directly, this update check is not needed.
Disable Grafana's update check via grafana.ini.
This is also a workaround for #155, root cause of #519.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The main goal with this change is to make it clear, from looking at
Cargo.lock, that the Conduit proxy doesn't depend on OpenSSL. Although
the *ring* crate attempts to avoid conflicts with symbols defined in
OpenSSL, that is a manual process that doesn't have automatic
verification yet.
The secondary goal is to reduce the total number of dependencies to
make (at least) full from-scratch builds, such as those in CI, faster.
As a result of this PR, following the following upstream PRs we
submitted to prost, as well as some similar PRs in other upstream
projects and in conduit inself, our usage of prost now results in us
depending on many fewer crates:
* https://github.com/danburkert/prost/pull/78
* https://github.com/danburkert/prost/pull/79
* https://github.com/danburkert/prost/pull/82
* https://github.com/danburkert/prost/pull/84
Here are the crate dependencies that are removed:
* adler32
* aho-corasick
* build_const
* bzip2
* bzip2-sys
* crc
* curl
* curl-sys
* env_logger (0.4)
* flate2
* lazy_static
* libz-sys
* memchr
* miniz_oxide
* miniz_oxide_c_api
* msdos_time
* openssl-probe
* openssl-sys
* pkg-config
* podio
* regex
* regex-syntax
* schannel
* socket2
* thread_local
* unreachable
* utf8-ranges
* vcpkg
* zip
Pretty much all of these are build dependencies, but Cargo.lock doesn't
distinguish between build dependencies and regular dependencies.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Add --expected-version flag for conduit check command
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
* Update build instructions
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
As a goal of being a transparent proxy, we want to proxy requests and
responses with as little modification as possible. Basically, servers
and clients should see messages that look the same whether the proxy was
injected or not.
With that goal in mind, we want to make sure that body headers (things
like `Content-Length`, `Transfer-Encoding`, etc) are left alone. Prior
to this commit, we at times were changing behavior. Sometimes
`Transfer-Encoding` was added to requests, or `Content-Length: 0` may
have been removed. While RC 7230 defines that differences are
semantically the same, implementations may not handle them correctly.
Now, we've added some fixes to prevent any of these header changes
from occurring, along with tests to make sure library updates don't
regress.
For requests:
- With no message body, `Transfer-Encoding: chunked` should no longer be
added.
- With `Content-Length: 0`, the header is forwarded untouched.
For responses:
- Tests were added that responses not allowed to have bodies (to HEAD
requests, 204, 304) did not have `Transfer-Encoding` added.
- Tests that `Content-Length: 0` is preserved.
- Tests that HTTP/1.0 responses with no body headers do not have
`Transfer-Encoding` added.
- Tests that `HEAD` responses forward `Content-Length` headers (but not
an actual body).
Closes#447
Signed-off-by: Sean McArthur <sean@seanmonstar.com>
When the conduit proxy is injected into the controller pod, we observe controller pod proxy stats show up as an "outbound" deployment for an unrelated upstream deployment. This may cause confusion when monitoring deployments in the service mesh.
This PR filters out this "misleading" stat in the public api whenever the dashboard requests metric information for a specific deployment.
* exclude telemetry generated by the control plane when requesting deployment metrics
fixes#370
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
The dns_test had assumed DNS changes were deterministically ordered, but
util.DiffAddresses uses a map and therefore does not guarantee ordering.
Fix dns_test to sort TCP Addresses prior to comparison.
Fixes#515
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Grafana dashboards will not be available for the 0.3.1 release, but
BUILD.md provides an (incorrect) way to access Grafana.
Remove mention of Grafana for now. Re-add when dashboards are integrated
into Conduit.
Part of #420.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Currently, the `Reconnect` middleware does not reconnect on connection errors (see #491) and treats them as request errors. This means that when a connection timeout is wrapped in a `Reconnect`, timeout errors are treated as request errors, and the request returns HTTP 500. Since this is not the desired behavior, the connection timeouts should be removed, at least until their errors can be handled differently.
This PR removes the connect timeouts from `Bind`, as described in https://github.com/runconduit/conduit/pull/483#issuecomment-369380003.
It removes the `CONDUIT_PROXY_PUBLIC_CONNECT_TIMEOUT_MS` environment variable, but _not_ the `CONDUIT_PROXY_PRIVATE_CONNECT_TIMEOUT_MS` variable, since this is also used for the TCP connect timeouts. If we want also want to remove the TCP connection timeouts, I can do that as well.
Closes#483. Fixes#491.
In the initial review for this code (preceding the creation of the
runconduit/conduit repository), it was noted that this container is not
actually used, so this is actually dead code.
Further, this container actualy causes a minor problem, as it doesn't
implement any retry logic, thus it will sometimes often cause errors to
be logged. See
https://github.com/runconduit/conduit/issues/496#issuecomment-370105328.
Further, this is a "buoyantio/" branded container. IF we actually need
such a container then it should be a Conduit-branded container.
See https://github.com/runconduit/conduit/issues/478 for additional
context.
Signed-off-by: Brian Smith <brian@briansmith.org>