Problem
If you navigate directly to (or do a hard refresh on) a path with more than one segment,
e.g. http://localhost:8084/namespaces/conduit, the dashboard js is not served.
Pages with two paths have to be accessed by loading the dashboard on a different
path and then clicking through.
When accessing the dashboard via conduit dashboard we append a path prefix so that
we can connect using the k8s proxy. This means that moving the dashboard to serve
images off relative paths won't work, because we need to serve images whether the
dashboard is loaded from http://localhost:8084/namespaces/conduit or
from http://localhost:8084/namespaces.
Solution
Check whether we're serving the dashboard with the proxy url, and if we are, adjust
the url at which we serve the index bundle from.
I've also added a very manual override if the conduit logo can't be found at the usual url.
While preparing #946, I was again struck by the `discovery` module being very weighty
(nearly 800 dense lines). The intent of this change is only to improve readability. There
are no functional changes. The following aesthetic changes have been made:
* `control::discovery` has been renamed to `control::destination` to be more consistent
with the rest of conduit's terminology (destinations aren't the only thing that need to
be discovered).
* In that vein, the `Discovery` type has been renamed `Resolver` (since it exposes one
function, `resolve`).
* The `Watch` type has been renamed `Resolution`. This disambiguates the type form
`futures_watch::Watch`(which is used in the same code) and makes it more clearly the
product of a `Resolver`.
* The `Background` and `DiscoveryWork` names were very opaque. `Background` is now
`background::Config` to indicate that it can't actually _do_ anything; and
`DiscoveryWork` is now `background::Process` to indicate that it's responsible for
processing destination updates.
* `DestinationSet` is now a private implementation detail in the `background` module.
* An internal `ResolveRequest` type replaces an unnamed tuple (now that it's used across
files).
* `rustfmt` has been run on `background.rs` and `endpoint.rs`
This is in preparation for landing the Tokio upgrade.
In the upcoming Hyper release, the handling of absolute form request URIs
moved from `hyper::Request` to the `hyper::client::connect::Connect` trait.
Once we upgrade to the new Tokio, we will have to upgrade our Hyper
dependency as well.
Currently, Conduit detects whether the request URI is in absolute form in
`h1::normalize_our_view_of_uri` and adds an extension to the request if it is.
This will no longer work with the new Hyper, as that function is called from
the `bind::NormalizeUri` service, which is not constructed until after the
client connection is established. Therefore, it is necessary to move this
information to `bind::Protocol`, so that it can be passed to
`transparency::client::HyperConnect` (our implementation of Hyper's `Connect`
trait) when we are using the newest Hyper.
For now, however, I've left in the `UriIsAbsoluteForm` extension and continued
to set it in `h1::normalize_our_view_of_uri`, since we currently still use it
on the current Hyper version. I thought it was good to minimize the changes to
this existing code, as it will be removed when we migrate to the new Hyper.
This PR shouldn't cause any functional changes.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
This enables the removal of the inline-block display for links and
fixes menu items not showing up when sidebar is expanded on firefox
Problem
Previously we were linking the icon and expand text of the menu bar separately.
This caused the clickable areas of the menus to be inconsistent, which we were
fixing via css. This wasn't consistently displayed across browsers.
Fix
Linkify the whole Menu Item rather than linking the icon and text separately.
This enables the removal of the inline-block display for links and
fixes menu items not showing up when sidebar is expanded on firefox.
Additionally it makes the clicking of menu links way more consistent.
The frontend assets was not optimized, resulting in suboptimal page load times.
Enabled webpack production mode in the Dockerfile, this still allows good development
and debugging experience when running the web interface locally during development.
Also added minification of the CSS handled by css-loader.
The web interface still works as expected.
The size of the JS file has been reduced from 3.6 MB to 1.2 MB.
And the CSS minification has resulted in sidebar.css from 5.71 kB to 4.33 kb,
and styles.css from 4.18 kB to 3.1 kB.
Fixes#378
Signed-off-by: Kim Christensen <kimworking@gmail.com>
This is in preparation for landing the Tokio upgrade.
In order to be generic over Tokio's current thread and threadpool executors,
a number of types in Conduit which were not previously `Send` are now required
to be `Send`. A majority of this work will be done in the main Tokio upgrade
PR, as it is in many cases not possible to make these types `Send` _without_
using the new Tokio API (in order to remove `Handle`s, etc.); however, I'm
factoring out everything possible and trying to land it in separate PRs.
The p2c load balancer constructed in `Outbound` is currently parameterized
over a random number generator. We currently construct it by getting the
thread-local RNG, and passing it to the load balancer constructor. However,
the thread-local RNG is not `Send`. I've fixed this issue by creating a new
zero-sized empty struct type which implements `rand::Rng` simply by calling
`thread_rng()` every time its' called, and passing that to
`choose::power_of_two_choices` instead. Since this is an empty type which
contains no data, and the correct thread-local RNG is accessed whenever
the methods are called, this new type can trivially be `Send`. According to
the `rand` crate's documentation, this is the correct way to use `ThreadRng`
anyway:
> Retrieve the lazily-initialized thread-local random number generator, seeded
> by the system. Intended to be used in method chaining style, e.g.
> `thread_rng().gen::<i32>()`.
> (from https://docs.rs/rand/0.4.2/rand/fn.thread_rng.html)
This shouldn't lead to any functional changes.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
This is in preparation for landing the Tokio upgrade.
The test `discovery::outbound_updates_newer_services` currently contains an
assertion that an HTTP/2 request to an HTTP/1 service will return a response
with status code 500. This is because the current version of Hyper on which
Conduit depends does not support protocol upgrades.
However, commit hyperium/hyper@bc6af88a32, which
adds support for this kind of protocol upgrade, was recently merged to Hyper's
master branch. Therefore, this assertion will no longer be correct once we
depend on the upcoming Hyper release. When we migrate to the new Tokio, it will
be necessary to upgrade our Hyper dependency as well, and this test will fail.
I've modified the test to no longer make assertions about the response's status
code, so that it's compatible with both the current and future Hyper versions.
If the response is not `Ok`, the test will still fail, since
`tests::support::Client::request()` `expect`s that the response is successful,
but the status code is ignored. I've added a comment in the test explaining
this.
Eventually, when the master version of Conduit depends on the latest Hyper, we
may want to change this test to assert that the status code is 200 instead. We
may also want to add more tests for Hyper's protocol upgrade functionality, but
that seems out of scope for this PR.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Allow the Stat endpoint in the public-api to accept requests for resourceType "all".
Currently, this queries Pods, Deployments, RCs and Services, but can be modified
to query other resources as well.
Both the CLI and web endpoints now work if you set resourceType to all.
e.g. `conduit stat all`
The router's cache has no means to evict unused entries when capacity is
reached.
This change does the following:
- Wraps cache values in a smart pointer that tracks the last time of
access for each entry. The smart pointer updates the access time when
the reference to entry is dropped.
- When capacity is not available, all nodes that have not been accessed
within some minimal idle age are dropped.
Accesses and updates to the map are O(1) when capacity is available.
Reclaiming capacity is O(n), so it's expected that the router is
configured with enough capacity such that capacity need not be reclaimed
usually.
* Fix issue where we were waiting for the next polling interval when switching tabs
Fix issue where we were waiting for the next polling interval when switching tabs.
When we switch tabs, we update the Props of the ResourceList component, but we weren't
resetting how we poll the server. This meant we'd wait until the end of the current polling interval
(2s) to get the data for the tab we just switched to.
I've added stopServerPolling and startServerPolling methods so that we can cancel the resource
requests of the page we're leaving and immediately start polling for new data if the resource type
changes.
The router's `Inner` type contains a map of routes. Recently, this map's
capacity has become constrained to prevent leakage for long-running
processes.
This change prepares for a fuller LRU implementation by moving the
router's `Inner` type to a new (tested) module, `cache`.
The router stores its cache and `Recognize` implementation within a `Mutex`,
but there is no need for the recognizer to be locked.
This change creates a new `Cache` type that is locked independently of
`Recognize`. In order to accomplish this, `Recognize::bind_service` has
been changed to take an immutable reference to its `self`.
The (unused) `Single` type has been removed because it relied on
`bind_service` being mutable.
The way that git-related version information is linked into go binaries
busts Docker's cache such that every commit causes all binaries to
rebuilt.
In order to ameliorate this, we can build each binary once without
version information first so that its artifacts are cached. When Go
sources are not changed and only the version information changes, builds
are 4.3x faster than before (from 5+ minutes to <90s).
On `master`
Branch off of master and build (mostly cached):
```
:; time DOCKER_TRACE=1 bin/docker-build
...
DOCKER_TRACE=1 bin/docker-build 9.10s user 6.30s system 5% cpu 4:26.47 total
```
Rebuild without changing anything (highly cached):
```
:; time DOCKER_TRACE=1 bin/docker-build
...
DOCKER_TRACE=1 bin/docker-build 9.23s user 6.04s system 47% cpu 32.017 total
```
Update only the git sha and rebuild:
```
:; git ci -am 'bump it' --allow-empty
[ver/eg 2749eb3] bump it
:; time DOCKER_TRACE=1 bin/docker-build
...
DOCKER_TRACE=1 bin/docker-build 8.55s user 6.08s system 4% cpu 5:22.25 total
```
On this branch:
Rebuild without changing anything (highly cached):
```
:; time DOCKER_TRACE=1 bin/docker-build
...
DOCKER_TRACE=1 bin/docker-build 8.94s user 5.97s system 46% cpu 32.257 total
```
Update only the git sha and rebuild:
```
:; git ci -am 'bump it' --allow-empty
[ver/go-docker-cache-versionless 77a80b5] bump it
:; time DOCKER_TRACE=1 bin/docker-build
...
DOCKER_TRACE=1 bin/docker-build-cli-bin 2.02s user 1.34s system 9% cpu 34.144 total
```
* Turn the status bars red if there exist failed pods in the namespace
* Also use failed pods in conduit component table
Now that the API returns the number of failed pods, use this info to indicate failed pods in
the ServiceMesh page.
The bars will turn red if there are any failed pods present in the namespace.
They'll be green if they have non-zero pods meshed, and grey otherwise.
The proxy's Dockerfile is split into stages: build and runtime.
The build stage includes all of the intermdiate build information, and
the runtime image discards these layers with a small production-ready
image.
In order to improve docker build times, we can save this build layer to
be reused.
This reduces the docker build of the proxy in CI from 15 minutes to
about 7.5 minutes (when the proxy is not changed).
The goals of this change are:
1. Reduce the size/complexity of `control::discovery` in order to ease code reviews.
2. Extract a reusable grpc streaming utility.
There are no intended functional changes.
`control::discovery::DestinationServiceQuery` is used to track the state of a request (and
streaming response) to the destination service. Very little of this logic is specific to
the destination service.
The `DestinationServiceQuery` and associated `UpdateRx` type have been moved to a new
module, `control::remote_stream`, as `Remote` and `Receiver`, respectively. Both of these
types are generic over the gRPC message type, so it will be possible to use this utility
with additional API endpoints.
The `Receiver::poll` implementation has been simplified to be more idiomatic with the rest
of our code (namely, using `try_ready!`).
Grafana provides default dashboards for Prometheus and Grafana health.
The community also provides Kubernetes-specific dashboards. Conduit was
not taking advantage of these.
Introduce new Grafana dashboards focused on Grafana, Kubernetes, and
Prometheus health. Tag all Conduit dashboards for easier UI navigation.
Also fix layout in Conduit Health dashboard.
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Add namespaces as a top level resource in the Web UI
This PR does the following:
- Replace the deployments table in the service mesh page with namespaces
- Add a Namespaces index page that lists all namespaces and their stats
- Add an individual namespace page showing all resources for that namespace
- Make the incomplete mesh message more generic to any resource type
- Revamp rest of service mesh page to move off ListPods
Configuration values that take durations are currently specified as
time values with no units. So `600` may mean 600ms in some contexts and
10 minutes in others.
In order to avoid this problem, this change now requires that
configurations provide explicit units for time values such as '600ms' or
10 minutes'.
Fixes#27.
* Fix bug where GetPodsFor(pod) was returning all pods in a namespace
Problem
In lister.GetPodsFor, when the input object was a pod, we would return all the pods in the namespace. I would expect GetPodsFor(pod) to return only one pod - the pod itself.
Cause
The cause of this is that when the object type was pod we were setting the selector to selector = labels.Everything() which gets all the pods in the namespace.
Fix
Special case GetPodsFor(pod) to return the pod itself, rather than looking up pods via labels.
Make the sidebar icon based and collapsed by default
I had to move the call to version check into the sidebar component, indicator
when the sidebar was minimized if there was a conduit update.
Currently I just have letters representing the icons for Deployments, RCs and Pods,
but we can change this in the future.
It's easy to misconfigure default durations, since they're recorded as
integers and converted to Durations separately.
Now, all default constants that represent durations use const `Duration`
instances (enabled by a recent Rust release).
This fixes#905 which was caused by using the wrong time unit for the
metrics retain time.
PR #898 introduces capacity limits to the balancer. However, because the
router supports "single-use" routes--routes that are bound only for the
life of a single HTTP1 request--it is easy for a router to exceed its
configured capacity.
In order to fix this, the `Reuse` type is removed from the router
library so that _all_ routes are considered cacheable. It's now the
responsibility of the bound service to enforce policies with regards to
client retention.
Routes were not added to the cache when the service could not be used to
process more than a single request. Now, `Bind` wraps its returned
services (via the `Binding` type), that dictate whether a single client
is reused or if one is bound for each request.
This enables all routes to be cached without changing behavior with
regards to connection reuse.
* Modify the Stat endpoint to also return the count of failed pods
* Add comments explaining pod count stats
* Rename total pod count to running pod count
This is to support the service mesh overview page, as I'd like to include an indicator of
failed pods there.
Currently, the proxy may cache an unbounded number of routes. In order
to prevent such leaks in production, new configurations are introduced
to limit the number of inbound and outbound HTTP routes. By default, we
support 100 inbound routes and 10K outbound routes.
In a followup, we'll introduce an eviction strategy so that capacity can
be reclaimed gracefully.
The Router's primary `call` implementation is somewhat difficult to
follow.
This change does not introduce any functional changes, but makes the
function easier to reason about.
This is being done in preparation for functional changes.
This PR adds unit tests for `metrics::record`, based on the benchmarks for the
same function. Currently, there is a test that fires a single response end event
and asserts that the metrics state is correct afterward, and a test that fires
all the events to simulate a full connection lifetime, and asserts that the
metrics state is correct afterward. I'd like to also add a test that simulates
multiple events with different labels, but I'll add that in a subsequent PR,
In order to add these tests, it was necessary to to add test-only accessors
to make some `metrics` structs `pub`` so that the test can access them.
I also added some test-only functions to `metrics::Histogram`s, to make
them easier to make assertions about.
When the proxy's Dockerfile ran tests, it was necessary to build
Arbitrary types for quickchecking protobuf types.
Now that tests have been disabled, this optional set of dependencies is
no longer required.
Relates to #882.
The proxy's tap server assigns a sequential numeric ID to each inbound
Tap request to assist tap lifecycle management.
The server implementation keeps a local counter to keep track of tap
IDs. However, this implementation is cloned for each individual tap
requests, so `0` the only tap ID ever used.
This change moves the Tap ID to be stored in a shared atomic integer.
Debug logging has been improved as well.
The proxy Dockerfile includes test execution. While the intentions of
this are good, it has unintended consequences: we can ship code linked
with test dependencies.
Because we have other means for testing proxy code (cargo, locally; and
CI runs tests outside of Docker), it is fine to remove these tests.
The proxy is now configured with the CONDUIT_PROXY_METRICS_RETAIN_IDLE
environment variable that dictates the amount of time that the proxy will retain
metrics that have not been updated.
A timestamp is maintained for each unique set of labels, indicating the last time
that the scope was updated. Then, when metrics are read, all metrics older than
CONDUIT_PROXY_METRICS_RETAIN_IDLE are dropped from the stats registry.
A ctx::test_utils module has been added to aid testing.
Fixes#819
Previously, we maintained a map of labels for each metric. Because the same keys are used
in multiple scopes, this causes redundant hashing & map lookup when updating metrics.
With this change, there is now only one map per unique label scope and all of the metrics
for each scope are stored in the value. This makes metrics inserting faster and prepares
for eviction of idle metrics.
The Metric type has been split into Metric, which now only holds metric metadata and is
responsible for printing a given metric, and Scopes which holds groupings of metrics by
label.
The metrics! macro is provided to make it easy to define Metric instances statically.
In order to set up for a refactor that removes the `Metric` type, the
`FmtMetric` trait--implemented by `Counter`, `Gauge`, and
`Histogram`--is introduced to push prometheus formatting down into each
type.
With this change, the `Histogram` type now relies on `Counter` (and its
metric formatting) more heavily.
Enables filtering by one or more namespaces. Table updates are prevented
when the filter menu is open, as table updates will rerender the menu,
unselecting anything the user has selected but not confirmed.
This PR removes the `Arc`s from the various label types in the proxy's
`metrics` modules. This should make the write side of the metrics code
much more efficient (and makes the code much simpler! :D).
This change was particularly easy to implement for the TCP `TransportLabels`
and `TransportCloseLabels`, which consisted of only `struct`s and `enum`s,
and could easily be changed to derive `Copy`.
For protocol-level `RequestLabels`, the request's authority was a `String`,
which still needs to be reference-counted, as the overhead of cloning `String`s
is almost certainly worse than that added by ref-counting. However, rather than
adding an additional `Arc<str>`, I changed `RequestLabels` to store the
authority as a `http::uri::Authority`, which is backed by a `ByteStr` and thus
already ref-counted. Now, when constructing `RequestLabels`, we just take
another reference to the `Authority` already stored in the request context.
Since `Authority` implements `fmt::Display` already, formatting the labels
still works.
`ResponseLabels` already store the `DstLabels` string in an `Arc`, so no
additional changes there were necessary. By removing the outer `Arc` around
`ResponseLabels`, we now only have to ref-count the portion of the label type
that would actually be inefficient to clone.
@olix0r ran the benchmarks from #874 against this branch, and it seems to be
a small but noticeable improvement:
```
test record_many_dsts ... bench: 151,076 ns/iter (+/- 182,151)
test record_one_conn_request ... bench: 1,599 ns/iter (+/- 209)
test record_response_end ... bench: 676 ns/iter (+/- 144)
```
before:
```
test record_many_dsts ... bench: 158,403 ns/iter (+/- 130,241)
test record_one_conn_request ... bench: 1,823 ns/iter (+/- 1,408)
test record_response_end ... bench: 547 ns/iter (+/- 70)
```
Signed-off-by: Eliza Weisman <eliza@buoyant.io>