Commit Graph

514 Commits

Author SHA1 Message Date
Risha Mars 0431193e94
Indicate failed pods in ServiceMesh page (#914)
* Turn the status bars red if there exist failed pods in the namespace
* Also use failed pods in conduit component table

Now that the API returns the number of failed pods, use this info to indicate failed pods in 
the ServiceMesh page. 
The bars will turn red if there are any failed pods present in the namespace. 
They'll be green if they have non-zero pods meshed, and grey otherwise.
2018-05-09 10:22:45 -07:00
Oliver Gould e5ad5de975
Reuse the proxy's build stage across CI runs (#891)
The proxy's Dockerfile is split into stages: build and runtime.
The build stage includes all of the intermdiate build information, and
the runtime image discards these layers with a small production-ready
image.

In order to improve docker build times, we can save this build layer to
be reused.

This reduces the docker build of the proxy in CI from 15 minutes to
about 7.5 minutes (when the proxy is not changed).
2018-05-09 09:11:58 -07:00
Sean McArthur b48a26b0b7
proxy: change peek to use reads for eventual support of TLS (#901) 2018-05-08 18:19:12 -07:00
Oliver Gould 8bedd9d38a
rfc: Make DestinationServiceQuery generic (#749)
The goals of this change are:
1. Reduce the size/complexity of `control::discovery` in order to ease code reviews.
2. Extract a reusable grpc streaming utility.

There are no intended functional changes.

`control::discovery::DestinationServiceQuery` is used to track the state of a request (and
streaming response) to the destination service. Very little of this logic is specific to
the destination service.

The `DestinationServiceQuery` and associated `UpdateRx` type have been moved to a new
module, `control::remote_stream`, as `Remote` and `Receiver`, respectively. Both of these
types are generic over the gRPC message type, so it will be possible to use this utility
with additional API endpoints.

The `Receiver::poll` implementation has been simplified to be more idiomatic with the rest
of our code (namely, using `try_ready!`).
2018-05-08 16:54:20 -07:00
Andrew Seigner 1275b1ae89
Introduce Grafana, K8s, and Prom dashboards (#904)
Grafana provides default dashboards for Prometheus and Grafana health.
The community also provides Kubernetes-specific dashboards. Conduit was
not taking advantage of these.

Introduce new Grafana dashboards focused on Grafana, Kubernetes, and
Prometheus health. Tag all Conduit dashboards for easier UI navigation.
Also fix layout in Conduit Health dashboard.

Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-05-08 23:11:43 +02:00
Risha Mars 1b0f269a43
Add per namespace pages that show all resource stats for a namespace (#893)
Add namespaces as a top level resource in the Web UI

This PR does the following:
- Replace the deployments table in the service mesh page with namespaces
- Add a Namespaces index page that lists all namespaces and their stats
- Add an individual namespace page showing all resources for that namespace
- Make the incomplete mesh message more generic to any resource type
- Revamp rest of service mesh page to move off ListPods
2018-05-08 14:05:16 -07:00
Oliver Gould 63fbbd6931
proxy: Parse units with duration configurations (#909)
Configuration values that take durations are currently specified as
time values with no units.  So `600` may mean 600ms in some contexts and
10 minutes in others.

In order to avoid this problem, this change now requires that
configurations provide explicit units for time values such as '600ms' or
10 minutes'.

Fixes #27.
2018-05-08 13:54:12 -07:00
Risha Mars 416381cdfd
Fix bug where GetPodsFor(pod) was returning all pods in a namespace (#900)
* Fix bug where GetPodsFor(pod) was returning all pods in a namespace

Problem
In lister.GetPodsFor, when the input object was a pod, we would return all the pods in the namespace. I would expect GetPodsFor(pod) to return only one pod - the pod itself.

Cause
The cause of this is that when the object type was pod we were setting the selector to selector = labels.Everything() which gets all the pods in the namespace.

Fix
Special case GetPodsFor(pod) to return the pod itself, rather than looking up pods via labels.
2018-05-08 13:52:49 -07:00
Risha Mars 0f7d98067d
Make the sidebar icon based and collapsed by default (#897)
Make the sidebar icon based and collapsed by default

I had to move the call to version check into the sidebar component, indicator 
when the sidebar was minimized if there was a conduit update.

Currently I just have letters representing the icons for Deployments, RCs and Pods, 
but we can change this in the future.
2018-05-08 12:30:05 -07:00
Oliver Gould 68e203a2fc
proxy: Use Duration types for config defaults (#906)
It's easy to misconfigure default durations, since they're recorded as
integers and converted to Durations separately.

Now, all default constants that represent durations use const `Duration`
instances (enabled by a recent Rust release).

This fixes #905 which was caused by using the wrong time unit for the
metrics retain time.
2018-05-08 10:58:22 -07:00
Oliver Gould f4dba72cc3
proxy: Track SingleUse services against router capacity (#902)
PR #898 introduces capacity limits to the balancer. However, because the
router supports "single-use" routes--routes that are bound only for the
life of a single HTTP1 request--it is easy for a router to exceed its
configured capacity.

In order to fix this, the `Reuse` type is removed from the router
library so that _all_ routes are considered cacheable. It's now the
responsibility of the bound service to enforce policies with regards to
client retention.

Routes were not added to the cache when the service could not be used to
process more than a single request. Now, `Bind` wraps its returned
services (via the `Binding` type), that dictate whether a single client
is reused or if one is bound for each request.

This enables all routes to be cached without changing behavior with
regards to connection reuse.
2018-05-08 10:57:56 -07:00
Risha Mars f94856e489
Modify the Stat endpoint to also return the number of failed conduit pods (#895)
* Modify the Stat endpoint to also return the count of failed pods
* Add comments explaining pod count stats
* Rename total pod count to running pod count

This is to support the service mesh overview page, as I'd like to include an indicator of
failed pods there.
2018-05-08 10:35:21 -07:00
Oliver Gould 02e6d018d0
proxy: Bound on router capacity (#898)
Currently, the proxy may cache an unbounded number of routes. In order
to prevent such leaks in production, new configurations are introduced
to limit the number of inbound and outbound HTTP routes. By default, we
support 100 inbound routes and 10K outbound routes.

In a followup, we'll introduce an eviction strategy so that capacity can
be reclaimed gracefully.
2018-05-04 16:32:30 -07:00
Oliver Gould 3310968647
proxy: Refactor router implementation (#894)
The Router's primary `call` implementation is somewhat difficult to
follow.

This change does not introduce any functional changes, but makes the
function easier to reason about.

This is being done in preparation for functional changes.
2018-05-02 15:47:36 -07:00
Oliver Gould 7f16079f64
proxy: Upgrade tower dependencies (#892)
In order to pick up https://github.com/tower-rs/tower-grpc/pull/60,
upgrade tower dependencies. This will reduce the cost of updating
for upcoming tower-h2 improvements.
2018-05-02 13:40:55 -07:00
Eliza Weisman 86bb701be8
Add unit tests for `metrics::record` (#890)
This PR adds unit tests for `metrics::record`, based on the benchmarks for the
same function. Currently, there is a test that fires a single response end event
and asserts that the metrics state is correct afterward, and a test that fires
all the events to simulate a full connection lifetime, and asserts that the
metrics state is correct afterward. I'd like to also add a test that simulates
multiple events with different labels, but I'll add that in a subsequent PR,

In order to add these tests, it was necessary to to add test-only accessors
to make some `metrics` structs `pub`` so that the test can access them. 
I also added some test-only functions to `metrics::Histogram`s, to make 
them easier to make assertions about.
2018-05-02 13:26:27 -07:00
Oliver Gould 2578a47617
proxy: Do not build Arbitrary types in Docker (#889)
When the proxy's Dockerfile ran tests, it was necessary to build
Arbitrary types for quickchecking protobuf types.

Now that tests have been disabled, this optional set of dependencies is
no longer required.

Relates to #882.
2018-05-01 14:56:24 -07:00
Oliver Gould 88656ecd9d
proxy: add `bench` tests to CI (#883)
Proxy tests, including benchmarks, are run against run nightly.

Because nightly is not stable, failures are ignored.
2018-05-01 13:05:42 -07:00
Oliver Gould eb08d47347
proxy: Fix Tap ID generation (#885)
The proxy's tap server assigns a sequential numeric ID to each inbound
Tap request to assist tap lifecycle management.

The server implementation keeps a local counter to keep track of tap
IDs. However, this implementation is cloned for each individual tap
requests, so `0` the only tap ID ever used.

This change moves the Tap ID to be stored in a shared atomic integer.

Debug logging has been improved as well.
2018-05-01 11:59:45 -07:00
Oliver Gould 1801118906
Do not run tests in proxy Dockerfile (#882)
The proxy Dockerfile includes test execution. While the intentions of
this are good, it has unintended consequences: we can ship code linked
with test dependencies.

Because we have other means for testing proxy code (cargo, locally; and
CI runs tests outside of Docker), it is fine to remove these tests.
2018-05-01 11:54:02 -07:00
Eliza Weisman dd3b952634 proxy: Fix metrics constructor in benches (#881)
Fixes a test compilation error.
2018-04-30 17:48:07 -07:00
Oliver Gould ada5cb267e
proxy: Expire metrics that have not been updated for 10 minutes (#880)
The proxy is now configured with the CONDUIT_PROXY_METRICS_RETAIN_IDLE
environment variable that dictates the amount of time that the proxy will retain
metrics that have not been updated.

A timestamp is maintained for each unique set of labels, indicating the last time
that the scope was updated. Then, when metrics are read, all metrics older than
CONDUIT_PROXY_METRICS_RETAIN_IDLE are dropped from the stats registry.

A ctx::test_utils module has been added to aid testing.

Fixes #819
2018-04-30 16:11:12 -07:00
Oliver Gould 6512b02780
proxy: Group metrics by label (#879)
Previously, we maintained a map of labels for each metric. Because the same keys are used
in multiple scopes, this causes redundant hashing & map lookup when updating metrics.

With this change, there is now only one map per unique label scope and all of the metrics
for each scope are stored in the value. This makes metrics inserting faster and prepares
for eviction of idle metrics.

The Metric type has been split into Metric, which now only holds metric metadata and is
responsible for printing a given metric, and Scopes which holds groupings of metrics by
label.

The metrics! macro is provided to make it easy to define Metric instances statically.
2018-04-30 15:33:09 -07:00
Risha Mars a2321a1802
Remove repetitive code in dashboard - web UI (#871)
* Rename PodOwnerList.jsx to ResourceList.jsx

* Delete PodsList.jsx
2018-04-30 13:37:09 -07:00
Oliver Gould 7247cb8ddc
proxy: Make each metric type responsible for formatting (#878)
In order to set up for a refactor that removes the `Metric` type, the
`FmtMetric` trait--implemented by `Counter`, `Gauge`, and
`Histogram`--is introduced to push prometheus formatting down into each
type.

With this change, the `Histogram` type now relies on `Counter` (and its
metric formatting) more heavily.
2018-04-30 13:00:21 -07:00
Oliver Gould f73fdc1eae
Move `metrics::Serve` into its own module (#877)
With this change, metrics/mod.rs now contains only metrics types.
2018-04-30 10:52:08 -07:00
Risha Mars 84d131b6b0
Add a filter to the namespace column in the web UI (#866)
Enables filtering by one or more namespaces. Table updates are prevented
when the filter menu is open, as table updates will rerender the menu,
unselecting anything the user has selected but not confirmed.
2018-04-30 10:18:47 -07:00
Eliza Weisman 1fb15a55b3
proxy: Remove Arcs from metric labels (#873)
This PR removes the `Arc`s from the various label types in the proxy's 
`metrics` modules. This should make the write side of the metrics code
much more efficient (and makes the code much simpler! :D).

This change was particularly easy to implement for the TCP `TransportLabels`
and `TransportCloseLabels`, which consisted of only `struct`s and `enum`s,
and could easily be changed to derive `Copy`. 

For protocol-level `RequestLabels`, the request's authority was a `String`,
which still needs to be reference-counted, as the overhead of cloning `String`s
is almost certainly worse than that added by ref-counting. However, rather than
adding an additional `Arc<str>`, I changed `RequestLabels` to store the 
authority as a `http::uri::Authority`, which is backed by a `ByteStr` and thus
already ref-counted. Now, when constructing `RequestLabels`, we just take 
another reference to the `Authority` already stored in the request context.
Since `Authority` implements `fmt::Display` already, formatting the labels 
still works.

`ResponseLabels` already store the `DstLabels` string in an `Arc`, so no
additional changes there were necessary. By removing the outer `Arc` around
`ResponseLabels`, we now only have to ref-count the portion of the label type
that would actually be inefficient to clone.

@olix0r ran the benchmarks from #874 against this branch, and it seems to be 
a small but noticeable improvement:
```
test record_many_dsts        ... bench:     151,076 ns/iter (+/- 182,151)
test record_one_conn_request ... bench:       1,599 ns/iter (+/- 209)
test record_response_end     ... bench:         676 ns/iter (+/- 144)
```

before:
```
test record_many_dsts        ... bench:     158,403 ns/iter (+/- 130,241)
test record_one_conn_request ... bench:       1,823 ns/iter (+/- 1,408)
test record_response_end     ... bench:         547 ns/iter (+/- 70)
```

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-29 13:50:27 -07:00
Andrew Seigner 5a5c6d14ab
Update Grafana to 5.1.0, handle missing data (#876)
Conduit 0.4.1 contained some rough edges in the Grafana deployment.

This PR include the following:
- bump Grafana to 5.1.0
- fix deployment and rc graphs when no data present
- fix some text sections overlapping due to scrolling

Fixes #705

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-29 22:24:22 +02:00
Oliver Gould 935ccd5f78
proxy: Implement benchmarks for telemetry recording (#874)
Before changing the telemetry implementation, we should have a means to
understand the impacts of such changes.

To run, you must use a nightly toolchain:
```
rustup run nightly cargo bench -p conduit-proxy -- record
```
2018-04-29 12:55:26 -07:00
Oliver Gould 64a3bb09b2
Rename `metrics::Aggregate` to `metrics::Record` (#875)
Move `Record` into its own file.
2018-04-28 15:35:29 -07:00
Oliver Gould 9c70310406
proxy: Implement a From converter for latency::Ms (#872)
This reduces callsite verbosity for latency measurements at the expense of a fn-level
generic.
2018-04-27 17:55:44 -07:00
Eliza Weisman 9156a80a22
proxy: Add histogram unit tests (#870)
This PR adds the unit tests for the proxy metrics module's Histogram 
implementation that I wrote in #775 to @olix0r's Histogram implementation
added in #868. The tests weren't too difficult to adapt for the new code,
and everything seems to work correctly!

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-27 17:02:13 -07:00
Oliver Gould 1cadef9516
proxy: Make `Histogram` generic over its value (#868)
In order to support histograms measured in, for instance, microseconds,
Histogram should blind store integers without being aware of the unit.
In order to accomplish this, we make `Histogram` generic over a `V:
Into<u64>`, such that all values added to the histogram must be of type
`V`.

In doing this, we also make the histogram buckets configurable, though
we maintain the same defaults used for latency values.

The `Histogram` type has been moved to a new module, and the `Bucket`
and `Bounds` helper types have been introduced to help make histogram
logic clearer and latency-agnostic.
2018-04-27 14:43:09 -07:00
Sean McArthur 5fb4695358
proxy: wrap connections in Transport sensor before peeking (#851)
In case there are any errors while peeking the connection to do protocol
detection, the sensors will now be in place to detect them. Besides just
errors, this will also allow reporting about connections that are
accepted, but then immediately closed.

Additionally:

- add write_buf implementation for Transport sensor, can help
  performance for http1/http2
- add better logs for tcp connections errors
- add printlns for when tests fail

Signed-off-by: Sean McArthur <sean@seanmonstar.com>
2018-04-27 14:18:23 -07:00
Oliver Gould 80fdb97f88
Move `Counter` and `Gauge` to their own modules (#861)
In preparation for a larger metrics refactor, this change splits the
Counter and Gauge types into their own modules.

Furthermore, this makes the minor change to these types: incr() and
decr() no longer return `self`. We were not actually ever using the
returned self references, and I find the unit return type to more
obviously indicate the side-effecty-ness of these calls. #smpfy
2018-04-26 16:49:37 -07:00
Risha Mars f85dab8937
Upgrade antd to 3.4.3 (#855) 2018-04-26 16:35:42 -07:00
Risha Mars 4661aaf30d
Add a namespace column to the metrics tables (#854)
* Add a namespace column to the metrics tables, support long resource names

* Add a test for GrafanaLink

* Change the PodList.jsx component to not use the ListPods api
2018-04-26 16:34:59 -07:00
Andrew Seigner 97bf4fcdf2
Release Notes for 0.4.1 release. (#839)
Also update Getting Started and Debugging docs to reflect changes in
`Tap` and `Stat`.

Fixes #838

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-26 13:32:41 -07:00
Eliza Weisman 309ef14a11
Add transport-level metrics to proxy-metrics.md (#742)
This PR adds a description of the transport level (TCP) metrics 
that the Conduit proxy now exposes as of 6ad0960.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-26 09:44:52 -07:00
Risha Mars cd8c53ca2d
Remove the deployment search bar from the sidebar (#853)
We removed individual Deployment pages a while ago, but left the autocomplete search bar in. Clicking on searches goes to a 404 because we don't have /deployment any more.

This will be revisited in the future with direct links to grafana dashboards to all the 
resources we support.
2018-04-25 16:28:54 -07:00
Eliza Weisman d55e334a42
Add TCP stats to deployment dashboards (#824)
This PR adds the TCP metrics added in #785 and #790 to the Grafana deployment dashboards. I've added three new charts in the "Inbound Traffic" and "Outbound Traffic" headings:
+ "TCP Connection Failures": plots the number of failed TCP connections over time
+ "TCP Connections Open": shows the number of accepted and opened connections currently open
+ "TCP Connection Duration": a heatmap of connection durations over time

I'm planning on adding similar graphs to other dashboards as well in subsequent PRs.
2018-04-25 16:26:43 -07:00
Risha Mars fbacdd8a05
Add a Replication Controllers page in the Web UI (#850)
* Add a Replication Controllers page in the Web UI


@siggy pointed out that we don't need to use the PodsList api any more, since the new stats endpoint (#671) includes meshedPodCount and totalPodCount, which is all we need to determine whether the deployment/rc has been added to the mesh (which is what we were using ListPods to determine).

This PR modifies deployments to not use the pods api any more, and adds a Replication Controllers page. This page is quite similar to the Deployments page in logic, so I've made a PodOwnersList component to share the code.

I haven't added Replication Controllers to the Service Mesh page yet, because that page does require a list of component pods. Also, we don't need the calls to Prometheus for the Service Mesh page, so I don't want to use the existing stat apis for it. I figure that is a large enough change for a separate PR.
2018-04-25 15:01:06 -07:00
Brian Smith c5d2dab8bd
Remove special support for ExternalName services (#764)
After this was implemented we found that ExternalName services are
represented in DNS as CNAMEs, which means that the proxy's DNS
fallback logic can be used instead of doing DNS in the control
plane. Besides simplifying the controller, this will also increase
fidelity with the proxied pods' DNS configuration (improve
transparency).

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-25 11:53:33 -10:00
Oliver Gould d4d0e579c2
Introduce the `peer` label to transport metrics (#848)
Previously, the proxy exposed separate _accept_ and _connect_ metrics
for some metric types, but not for all. This leads to confusing
aggregations, particularly for read and write taotals.

This change primarily introduces the `peer` prometheus label (with
possible values _src_ or _dst_) to indicate which side of the proxy the
metric reflects.

Additionally, the `received_bytes` and `sent_bytes` metrics have been
renamed as `tcp_read_bytes_total` and `tcp_write_bytes_total`,
resectively. This more naturally fits into existing idioms.  Stream
classification is not applied to these metrics, as we plan to increment
them throughout stream lifetime and not only on close.

The `tcp_connections_open` metric has also been renamed to
`tcp_open_connections` to reflect Prometheus idioms.

Finally, `msg1` and `msg2` have been constified in telemetry test
fixtures so that tests are somewhat easier to read.
2018-04-25 14:06:33 -07:00
Brian Smith 0c23ad416b
Proxy: Use trust-dns-resolver for DNS. (#834)
trust-dns-resolver is a more complete implementation. In particular,
it supports CNAMES correctly, which is needed for PR #764. It also
supports /etc/hosts, which will help with issue #62.

Use the 0.8.2 pre-release since it hasn't been released yet. It was
created at our request.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-25 10:04:49 -10:00
Andrew Seigner dce31b888f
Deprecate Tap, rename TapByResource to Tap (#844)
The `conduit tap` command is now deprecated.

Replace `conduit tap` with `connduit tapByResource`. Rename tapByResource
to tap. The underlying protobuf for tap remains, the tap gRPC endpoint now
returns Unimplemented.

Fixes #804

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-25 12:24:46 -07:00
Eliza Weisman 3a3079d8ed
Fix assertions in metrics_compression test (#847)
Fixes #846 

The proxy `metrics_compression` test contained an assertion that a compressed scrape contained the `request_duration_ms_count` metric. This was chosen completely arbitrarily, and was only intended as an assertion that metrics were updated between compressed scrapes. Unfortunately, that metric was removed in d9112abc93, so when #665 merged to master, this test broke. CI didn't catch this since we don't build merges for PRs --- we should probably (re)enable this in Travis?

This PR fixes the test to assert on a metric that wasn't removed. Sorry for the s!
2018-04-25 11:02:52 -07:00
Risha Mars aca09813fd
Add a Replication Controller grafana dashboard (#843)
* Add a Replication Controller grafana dashboard, very similar to the Deployment one
2018-04-25 10:57:41 -07:00
Carl Lerche 298321a6c1
Bump proxy h2 dependency to v0.1.6. (#845)
This release includes a number of bug fixes related to HTTP/2.0 stream
management.

Signed-off-by: Carl Lerche <me@carllerche.com>
2018-04-25 08:40:42 -07:00