The existing telemetry pipeline relies on Prometheus scraping the
Telemetry service, which will soon be removed.
This change configures Prometheus to scrape the conduit proxies directly
for telemetry data, and the control plane components for control-plane
health information. This affects the output of both conduit install
and conduit inject.
Fixes#428, #501
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The simulate-proxy script pushes metrics to the telemetry service. This PR modifies the script to expose metrics to a prometheus endpoint. This functionality creates a server that randomly generates response_total, request_totals, response_duration_ms and response_latency_ms. The server reads pod information from a k8s cluster and picks a random namespace to use for all exposed metrics.
Tested out these changes with a locally running prometheus server. I also ran the docker-compose.yml to make sure metrics were being recorded by the prometheus docker container.
fixes#498
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
This PR adds an endpoint to the proxy that serves metrics in Prometheus' text exposition format. The endpoint currently serves the `request_total`, `response_total`, `response_latency_ms`, and `response_duration_ms metrics`, as described in #536. The endpoint's port and address are configurable with the `CONDUIT_PROXY_METRICS_LISTENER` environment variable.
Tests have been added in t`ests/telemetry.rs`
Follow up
https://github.com/runconduit/conduit/pull/593 by avoiding
`cargo fetch` in favor of the implicit fetch done in `cargo build`
to work around the lack of a `--target` flag in `cargo fetch`. This
should at least slightly improve the speed and reliability of Travis
CI runs.
Signed-off-by: Brian Smith <brian@briansmith.org>
Replace an unconditional dependency on windows-specific crates in
tempdir (via its update of its remove_dir_all dependency), which
eliminates the need to download any windows-specific crates during
the build when targetting non-Windows platforms.
Also, when targetting Windows platforms, replace a winapi 0.2.x
dependency with a winapi 0.3.x dependency.
This results in two fewer downloads during Docker builds:
```diff
- Downloading winapi v0.2.8
- Downloading winapi-build v0.1.1
```
Signed-off-by: Brian Smith <brian@briansmith.org>
`cargo fetch` doesn't consider the target platform and downloads
all crates needed to build for any target.
Stop using `cargo fetch` and instead use the implicit fetch done
by `cargo build`, which *does* consider the target platform.
This change results in 12 (soon 15) fewer crates downloaded.
This is a non-trivial savings in build time for a full rebuild
since cargo downloads crates in parallel.
```diff
- Downloading bitflags v1.0.1
- Downloading fuchsia-zircon v0.3.3
- Downloading fuchsia-zircon-sys v0.3.3
- Downloading miow v0.2.1
- Downloading redox_syscall v0.1.37
- Downloading redox_termios v0.1.1
- Downloading termion v1.5.1
- Downloading winapi v0.3.4
- Downloading winapi-i686-pc-windows-gnu v0.4.0
- Downloading winapi-x86_64-pc-windows-gnu v0.4.0
- Downloading wincolor v0.1.6
- Downloading ws2_32-sys v0.2.1
```
I verified that no downloads are done during an incremental
build.
Signed-off-by: Brian Smith <brian@briansmith.org>
The existing `conduit dashboard` command supported opening the conduit
dashboard, or displaying the conduit dashboard URL, via a `url` boolean
flag.
Replace the `url` boolean flag with a `show` string flag, with three
modes:
`conduit dashboard --show conduit`: default, open conduit dashboard
`conduit dashboard --show grafana`: open grafana dashboard
`conduit dashboard --show url`: display dashboard URLs
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Successful `conduit check` commands now take into account `[ok]`
and `\n` tokens when constraining line length.
Fixes#554
Signed-off-by: Andy Hume <andyhume@gmail.com>
Existing Grafana configuration contained no dashboards, just a skeleton
for testing.
Introduce two Grafana dashboards:
1) Top Line: Overall health of all Conduit-enabled services
2) Deployment: Health of a specific conduit-enabled deployment
Fixes#500
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Applies timeout of 5s to check request contexts. This overrides
30s timeout applied at client transport level, and stops the
conduit check command from taking > 90s to complete.
Fixes#553
Signed-off-by: Andy Hume <andyhume@gmail.com>
When the proxy is run in a Docker container, it runs as PID 1, with
no default signal handlers setup. In order to react to signals from
Kubernetes about shutting down, we need to set up explicit handlers.
This adds handlers for SIGTERM and SIGINT.
Closes#549
This image isn't used. It references its base image using the `latest` tag, which
is wrong; it should have been using the tag that the base image was built with. It
is likely that the last few iterations of this image that we've published have
wrong and useless contents.
With that in mind, just remove the image.
Fixes#578.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Update release notes for v0.3.1.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Add PR authorship for non-Buoyant contributors.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Wrap lines at 100 characters.
Signed-off-by: Brian Smith <brian@briansmith.org>
We should review changes to Cargo.lock to ensure we're not adding
unexpected and/or unnecessary dependencies. (Maybe we should do the
same for Gopkg.lock, but I'm not in a position to say for sure.)
Signed-off-by: Brian Smith <brian@briansmith.org>
* Most controller listeners should only bind on localhost
* Use default listening addresses in controller components
* Review feedback
* Revert test_helper change
* Revert use of absolute domains
Signed-off-by: Alex Leong <alex@buoyant.io>
In order to ensure we catch discovery and routing issues arising from different logic for HTTP/1 and HTTP/2 requests, I've modified tests/discovery.rs to run all applicable tests with both HTTP/1 and HTTP/2 requests. The tests themselves are largely unchanged, but now there are separate modules containing HTTP/1 and HTTP/2 versions of a majority of the tests.
Commit 569d6939a7 introduced a regression that caused the proxy to stop using the Destination service for outbound HTTP/1 requests with no authority in the request URI but a valid authority in the `Host:` header.
The bug is due to some code in `Outbound::recognize` which assumed that a request had already been passed through `normalize_our_view_of_uri`. This was valid at one point while I was writing #492, as URIs were normalized prior to `recognize` and a request `Extension` was used to mark that they had been rewritten, and the host header and request URI could be assumed to be in agreement, but after merging #514 into the dev branch for #492, this behaviour changed and I forgot to update the logic in `recognize`.
I've fixed the issue by adding the logic for routing on `Host:` headers back into `Outbound::recognize`.
@seanmonstar added a test in `discovery.rs`, `outbound_http1_asks_controller_about_host`, which should exercise this case. I've added a couple more unit tests in that file to try and ensure we cover more of the different cases that can occur here.
Fixes#552
In some cases, we would adjust an existing Host header, or add one. And in all cases when an HTTP/1 request was received with an absolute-form target, it was not passed on.
Now, the Host header is never changed. And if the Uri was in absolute-form, it is sent in the same format.
Closes#518
The h2 crate (HTTP/2.0 client and server) has a new release which
includes bug fixes and stability improvements.
This updates the Cargo.lock file to include the new release.
Closes#538
Signed-off-by: Carl Lerche <me@carllerche.com>
An infinite loop exists in the TCP proxy, which could be triggered by any raw TCP connection (including HTTPS requests). The connection will be proxied successfully, but instead of closing, it will remain open, and the proxy's CPU usage will remain extremely high indefinitely.
Since `Duplex::poll` will call `half_in.copy_into()`/`half_out.copy_into()` repeatedly, even after they return `Async::Ready`, when one half has shut down and returned ready, it may still be polled again, as `Duplex::poll` waits until _both_ halves have returned `Ready`. Because of the guard that `!dst.is_shutdown`, intended to prevent the destination from shutting down twice, the function will not return if it is polled again after returning `Async::Ready` once.
I've fixed this by moving the guard against double shutdowns out of the loop, so that the function will return `Async::Ready` again if it is polled after shutting down the destination.
I've also included a unit test against regressions to this bug. The unit test fails against master.
Fixes#519
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Co-Authored-By: Andrew Seigner <andrew@sig.gy>
* Proxy: Don't resolve absolute names outside zone using Destinations service
Many absolute names were being resolved using the Destinations service due to logic error
in the proxy's matching of the zone to the default zone.
Fix that bug.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Temporarily stop trying to support configurable zones in the proxy.
None of the zone configuration is tested and lots of things assume the cluster
zone is `cluster.local`. Further, how exactly the proxy will actually learn the
cluster zone hasn't been decided yet.
Just hard-code the zone as "cluster.local" in the proxy until configurable zones
are fully implemented and tested to be working correctly.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Remove the CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN setting
The way that Kubernetes configures DNS search suffixes has some negative
consequences as some names like "example.com" are ambiguous: depending on
whether there is a service "example" in the "com" namespace, "example.com"
may refer to an external service or an internal service, and this can
fluctuate over time. In recognition of that we added the
CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN setting, thinking this would
be part of a solution for users to opt out of the unfortunate behavior
if their applications didn't depend on the DNS search suffix feature.
It turns out similar effects can be acheived using a custom dnsConfig,
starting in Kubernetes 1.10 when dnsConfig reaches the beta stability level.
Now any CONDUIT_PROXY_DESTINATIONS_AUTOCOMPLETE_FQDN-based seems duplicative.
Further, attempting to support it optionally made the code complex and hard
to read.
Therefore, let's just remove it. If/when somebody actually requests this
functionality then we can add it back, if dnsConfig isn't a valid alternative
for them.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Further hard-code "cluster.local" as the zone, temporarily.
Addresses review feedback.
Signed-off-by: Brian Smith <brian@briansmith.org>
Shortly after conduit is installed in k8s environment. The control plane component that establishes a watch endpoint with k8s run in to networking issues during proxy initialization. During failure, each watcher fails to retry its connection to k8s watch endpoint which leads to timeouts and eventually, multiple controller pod restarts.
This PR adds retry logic to each "watch" enabled package.
fixes#478
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
Wwe will be able to simplify service discovery in the near future if we
can rely on the namespace being available.
Signed-off-by: Brian Smith <brian@briansmith.org>
Pick up https://github.com/danburkert/prost/pull/87, which results in the
following reduction in build dependencies for the proxy:
Removing failure_derive v0.1.1
Adding prost-derive v0.3.2 (https://github.com/danburkert/prost#3427352e)
Removing prost-derive v0.3.2
Removing quote v0.3.15
Removing syn v0.11.11
Removing synom v0.11.3
Removing synstructure v0.6.1
Removing unicode-xid v0.0.4
Signed-off-by: Brian Smith <brian@briansmith.org>
Kubernetes will do multiple DNS lookups for a name like
`proxy-api.conduit.svc.cluster.local` based on the default search settings
in /etc/resolv.conf for each container:
1. proxy-api.conduit.svc.cluster.local.conduit.svc.cluster.local. IN A
2. proxy-api.conduit.svc.cluster.local.svc.cluster.local. IN A
3. proxy-api.conduit.svc.cluster.local.cluster.local. IN A
4. proxy-api.conduit.svc.cluster.local. IN A
We do not need or want this search to be done, so avoid it by making each
name absolute by appending a period so that the first three DNS queries
are skipped for each name.
The case for `localhost` is even worse because we expect that `localhost` will
always resolve to 127.0.0.1 and/or ::1, but this is not guaranteed if the default
search is done:
1. localhost.conduit.svc.cluster.local. IN A
2. localhost.svc.cluster.local. IN A
3. localhost.cluster.local. IN A
4. localhost. IN A
Avoid these unnecessary DNS queries by making each name absolute, so that the
first three DNS queries are skipped for each name.
Signed-off-by: Brian Smith <brian@briansmith.org>
This PR ensures that the mapping of requests to outbound connections is segregated by `Host:` header values. In most cases, the desired behavior is provided by Hyper's connection pooling. However, Hyper does not handle the case where a request had no `Host:` header and the request URI had no authority part, and the request was routed based on the SO_ORIGINAL_DST in the desired manner. We would like these requests to each have their own outbound connection, but Hyper will reuse the same connection for such requests.
Therefore, I have modified `conduit_proxy_router::Recognize` to allow implementations of `Recognize` to indicate whether the service for a given key can be cached, and to only cache the service when it is marked as cachable. I've also changed the `reconstruct_uri` function, which rewrites HTTP/1 requests, to mark when a request had no authority and no `Host:` header, and the authority was rewritten to be the request's ORIGINAL_DST. When this is the case, the `Recognize` implementations for `Inbound` and `Outbound` will mark these requests as non-cachable.
I've also added unit tests ensuring that A, connections are created per `Host:` header, and B, that requests with no `Host:` header each create a new connection. The first test passes without any additional changes, but the second only passes on this branch. The tests were added in PR #489, but this branch supersedes that branch.
Fixes#415. Closes#489.
Grafana by default calls out to grafana.com to check for updates. As
user's of Conduit do not have direct control over updating Grafana
directly, this update check is not needed.
Disable Grafana's update check via grafana.ini.
This is also a workaround for #155, root cause of #519.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The main goal with this change is to make it clear, from looking at
Cargo.lock, that the Conduit proxy doesn't depend on OpenSSL. Although
the *ring* crate attempts to avoid conflicts with symbols defined in
OpenSSL, that is a manual process that doesn't have automatic
verification yet.
The secondary goal is to reduce the total number of dependencies to
make (at least) full from-scratch builds, such as those in CI, faster.
As a result of this PR, following the following upstream PRs we
submitted to prost, as well as some similar PRs in other upstream
projects and in conduit inself, our usage of prost now results in us
depending on many fewer crates:
* https://github.com/danburkert/prost/pull/78
* https://github.com/danburkert/prost/pull/79
* https://github.com/danburkert/prost/pull/82
* https://github.com/danburkert/prost/pull/84
Here are the crate dependencies that are removed:
* adler32
* aho-corasick
* build_const
* bzip2
* bzip2-sys
* crc
* curl
* curl-sys
* env_logger (0.4)
* flate2
* lazy_static
* libz-sys
* memchr
* miniz_oxide
* miniz_oxide_c_api
* msdos_time
* openssl-probe
* openssl-sys
* pkg-config
* podio
* regex
* regex-syntax
* schannel
* socket2
* thread_local
* unreachable
* utf8-ranges
* vcpkg
* zip
Pretty much all of these are build dependencies, but Cargo.lock doesn't
distinguish between build dependencies and regular dependencies.
Signed-off-by: Brian Smith <brian@briansmith.org>
* Add --expected-version flag for conduit check command
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
* Update build instructions
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
As a goal of being a transparent proxy, we want to proxy requests and
responses with as little modification as possible. Basically, servers
and clients should see messages that look the same whether the proxy was
injected or not.
With that goal in mind, we want to make sure that body headers (things
like `Content-Length`, `Transfer-Encoding`, etc) are left alone. Prior
to this commit, we at times were changing behavior. Sometimes
`Transfer-Encoding` was added to requests, or `Content-Length: 0` may
have been removed. While RC 7230 defines that differences are
semantically the same, implementations may not handle them correctly.
Now, we've added some fixes to prevent any of these header changes
from occurring, along with tests to make sure library updates don't
regress.
For requests:
- With no message body, `Transfer-Encoding: chunked` should no longer be
added.
- With `Content-Length: 0`, the header is forwarded untouched.
For responses:
- Tests were added that responses not allowed to have bodies (to HEAD
requests, 204, 304) did not have `Transfer-Encoding` added.
- Tests that `Content-Length: 0` is preserved.
- Tests that HTTP/1.0 responses with no body headers do not have
`Transfer-Encoding` added.
- Tests that `HEAD` responses forward `Content-Length` headers (but not
an actual body).
Closes#447
Signed-off-by: Sean McArthur <sean@seanmonstar.com>
When the conduit proxy is injected into the controller pod, we observe controller pod proxy stats show up as an "outbound" deployment for an unrelated upstream deployment. This may cause confusion when monitoring deployments in the service mesh.
This PR filters out this "misleading" stat in the public api whenever the dashboard requests metric information for a specific deployment.
* exclude telemetry generated by the control plane when requesting deployment metrics
fixes#370
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
The dns_test had assumed DNS changes were deterministically ordered, but
util.DiffAddresses uses a map and therefore does not guarantee ordering.
Fix dns_test to sort TCP Addresses prior to comparison.
Fixes#515
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Grafana dashboards will not be available for the 0.3.1 release, but
BUILD.md provides an (incorrect) way to access Grafana.
Remove mention of Grafana for now. Re-add when dashboards are integrated
into Conduit.
Part of #420.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Currently, the `Reconnect` middleware does not reconnect on connection errors (see #491) and treats them as request errors. This means that when a connection timeout is wrapped in a `Reconnect`, timeout errors are treated as request errors, and the request returns HTTP 500. Since this is not the desired behavior, the connection timeouts should be removed, at least until their errors can be handled differently.
This PR removes the connect timeouts from `Bind`, as described in https://github.com/runconduit/conduit/pull/483#issuecomment-369380003.
It removes the `CONDUIT_PROXY_PUBLIC_CONNECT_TIMEOUT_MS` environment variable, but _not_ the `CONDUIT_PROXY_PRIVATE_CONNECT_TIMEOUT_MS` variable, since this is also used for the TCP connect timeouts. If we want also want to remove the TCP connection timeouts, I can do that as well.
Closes#483. Fixes#491.
In the initial review for this code (preceding the creation of the
runconduit/conduit repository), it was noted that this container is not
actually used, so this is actually dead code.
Further, this container actualy causes a minor problem, as it doesn't
implement any retry logic, thus it will sometimes often cause errors to
be logged. See
https://github.com/runconduit/conduit/issues/496#issuecomment-370105328.
Further, this is a "buoyantio/" branded container. IF we actually need
such a container then it should be a Conduit-branded container.
See https://github.com/runconduit/conduit/issues/478 for additional
context.
Signed-off-by: Brian Smith <brian@briansmith.org>
This PR changes the proxy to log error messages using `fmt::Display` whenever possible, which should lead to much more readable and meaningful error messages
This is part of the work I started last week on issue #442. While I haven't finished everything for that issue (all errors still are mapped to HTTP 500 error codes), I wanted to go ahead and open a PR for the more readable error messages. This is partially because I found myself merging these changes into other branches to aid in debugging, and because I figured we may as well have the nicer logging on master.
We previously `join`ed on piping data from both sides, meaning
that the future didn't complete until **both** sides had disconnected.
Even if the client disconnected, it was possible the server never knew,
and we "leaked" this future.
To fix this, the `join` is replaced with a `Duplex` future, which pipes
from both ends into the other, while also detecting when one side shuts
down. When a side does shutdown, a write shutdown is forwarded to the
other side, to allow draining to occur for deployments that half-close
sockets.
Closes#434
The Markdown files were all originally named "$x/_index.md"; I renamed
them as follows:
```
for x in `ls ~/conduit-site/conduit.io/content`; do
cp ~/conduit-site/conduit.io/content/$x/_index.md doc/$x.md
done
mv doc/doc.md doc/overview.md
```
When we publish the files on conduit.io we need to do the inverse
transformation to avoid breaking existing links.
The images were embedded using a syntax GitHub doesn't support. Also, the
images were not originally in a subdirectory of docs/.
Use normal Markdown syntax for image embedding, and reference the docs
using relative links to the images/ subdirectory. This way they will show
up in the GitHub UI. When we publish the docs on conduit.io we'll need to
figure out how to deal with this change.
I took the liberty of renaming data-plane.png to dashboard-data-plane.png to
clarify it a bit.
There is no other roadmap so there's no need to qualify this one as
"public." Before it was made public we marked it "public" to emphasize
that it would become public, but that isn't needed now.
Signed-off-by: Brian Smith <brian@briansmith.org>