Commit Graph

514 Commits

Author SHA1 Message Date
Kevin Lingerfelt fb15fe7c1a
Remove the telemetry service (#757)
* Remove the telemetry service

The telemetry service is no longer needed, now that prometheus scrapes
metrics directly from proxies, and the public-api talks directly to
prometheus. In this branch I'm removing the service itself as well as
all of the telemetry protobuf, and updating the conduit install command
to no longer install the service. I'm also removing the old version of
the stat command, which required the telemetry service, and renaming the
statsummary command to stat.

* Fix time window tests

* Remove deprecated controller scrape config

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-13 11:21:29 -07:00
Oliver Gould efdfc93b50
Stop pushing telemetry reports from the proxy (#616)
Now that the controller does not depend on pushed telemetry reports, the
proxy need not depend on the telemetry API or maintain legacy sampling
logic.
2018-04-12 17:39:29 -07:00
Kevin Lingerfelt 37434d048a
Update web component to use new stat api (#753)
* Update web component to use new stat api
* Address review feedback
* Add external link icon

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-12 17:35:03 -07:00
Andrew Seigner e9b209829d
Handle NaN metrics (#750)
The Prometheus client sometimes returns NaN if a calculation is invalid,
such as histogram_quantile when no requests have occurred.

Add IsNaN check in the public-api and set output to zero.

Fixes #747

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-12 15:21:00 -07:00
Eliza Weisman b6180d8bfe
Add unit tests for Labeled middleware (#738)
I've added unit tests for the `Labeled` middleware used to add Destination labels in the proxy, as @olix0r requested in https://github.com/runconduit/conduit/pull/661#discussion_r179897783. 

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-12 15:10:01 -07:00
Eliza Weisman 61d15a6c3e
Ignore flaky telemetry tests on CI (#752)
The tests for label metadata updates from the control plane are flaky on CI. This is likely due to the CI containers not having enough cores to execute the test proxy thread, the test proxy's controller client thread, the mock controller thread, and the test server thread simultaneously --- see #751 for more information. 

For now, I'm ignoring these on CI. Eventually, I'd like to change the mock controller code in test support so that we can trigger it to send a second metadata update only after the request has finished.

I think this issue also makes merging #738 a higher priority, so that we can still have some tests running on CI that exercise some part of the label update behaviour.
2018-04-12 14:59:17 -07:00
Eliza Weisman b07b554d2b
Add labels from service discovery to proxy metrics reports (#661)
PR #654 adds pod-based metric labels to the Destination API responses for cluster-local services. 

This PR modifies the proxy to actually add these labels to reported Prometheus metrics for outbound requests to local services. 

It enhances the proxy's `control::discovery` module to track these labels and add a `LabelRequest` middleware to the service stack built in `Bind` for labeled services. Requests transiting `LabelRequest` are given an `Extension` which contains these labels, which are then added to events produced by the `Sensors` for these requests. When these events are aggregated to Prometheus metrics, the labels are added.

I've also added some tests in `test/telemetry.rs` ensuring that these metrics are added correctly when the Destination service provides labels.

Closes #660

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-12 12:54:38 -07:00
Andrew Seigner 624b87f743
Implement ListPods in public-api (#743)
The ListPods endpoint's logic resides in the telemetry service, which is
going away.

Move ListPods logic into public-api, use new k8s informer APIs.

Fixes #694

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-11 17:53:57 -07:00
Kevin Lingerfelt 47caf1ca07
Add --all-namespaces flag to CLI statsummary command (#745)
* Add --all-namespaces flag to CLI statsummary command

* Fix statsummary output formatting

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-11 16:40:25 -07:00
Andrew Seigner 259fdcd134
Add latency stats in new stat summary endpoint (#737)
The new StatSummary endpoint was only providing request volume and
successs rate information.

Add support for retrieving latency stats via StatSummary. Also make
all prometheus calls in parallel, and implement kubernetes test
fixtures.

Fixes #681

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-11 11:58:32 -07:00
Kevin Lingerfelt e1e1b6b599
Controller: add more destination labels, fix service label (#731)
* Add more destination labels, fix service label

* Update owner labels to match proxy metrics docs

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-11 10:44:52 -07:00
Sean McArthur 7f54b5253d
proxy: fix flaky tcp graceful shutdown test (#735) 2018-04-10 19:47:00 -07:00
Sean McArthur 02c6887020
proxy: improve graceful shutdown process (#684)
- The listener is immediately closed on receipt of a shutdown signal.
- All in-progress server connections are now counted, and the process will
  not shutdown until the connection count has dropped to zero.
- In the case of HTTP1, idle connections are closed. In the case of HTTP2,
  the HTTP2 graceful shutdown steps are followed of sending various
  GOAWAYs.
2018-04-10 14:15:37 -07:00
Kevin Lingerfelt 91c359e612
Switch public API to use cached k8s resources (#724)
* Switch public API to use cached k8s resources
* Move shared informer code to separate goroutine
* Fix spelling issue

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-10 11:39:31 -07:00
Brian Smith 7319cf648f
Proxy: Do L7 load balancing for all external HTTP services. (#726)
Previously when the proxy could tell, by parsing, the request-target
is not in the cluster, it would not override the destination. That is,
load balancing would be disabled for such destinations.

With this change, the proxy will do L7 load balancing for all HTTP
services as long as the request-target has a DNS name.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-10 08:07:16 -10:00
Andrew Seigner 3a341abe9a
Fix success rate calculation in public api (#723)
The success rate calculation relies on the `classification` label, but
was incorrectly specifying `fail` rather than `failure`.

Fix public api to specify `failure`. Also re-org public api tests for
easier Kubernetes and Prometheus mocking.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-10 11:04:04 -07:00
Brian Smith bc16034fd6
Proxy: Fall back to using DNS when Destination service can't find service. (#692)
Fixes #155.
2018-04-07 18:26:06 -10:00
Brian Smith c25e9c371b
Refactor poll_destination() in service discovery. (#725)
No change in behavior is intended here.

Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-07 18:15:19 -10:00
Brian Smith 7d3b715c4d
Proxy: Move DNS name normalization to service discovery (#722)
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-06 15:04:09 -10:00
Andrew Seigner 716b392231
Move StatSummary logic into grpc server (#717)
The StatSummary logic was implemented as a method on http_server.

Move the StatSummary logic into grpc_server, for consistency with the
other endpoints.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 16:46:15 -07:00
Andrew Seigner b6bcdcc059
Namespace-aware Grafana dashboards (#716)
The Grafana dashboards key off of deployment, but had no awareness of
namespaces, causing incorrect metrics aggregation and display.

This change makes the Grafana dashboards key off of namespaces, and also
modifies the Grafana links in the Conduit dashboard to link to
namespace+deployment.

Fixes #704
Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 15:37:53 -07:00
Kevin Lingerfelt baa4d10c2f
CLI: change conduit namespace shorthand flag to -c (#714)
* CLI: change conduit namespace shorthand flag to -c

All of the conduit CLI subcommands accept a --conduit-namespace flag,
indicating the namespace where conduit is running. Some of the
subcommands also provide a --namespace flag, indicating the kubernetes
namespace where a user's application code is running. To prevent
confusion, I'm changing the shorthand flag for the conduit namespace to
-c, and using the -n shorthand when referring to user namespaces.

As part of this change I've also standardized the capitalization of all
of our command line flags, removed the -r shorthand for the install
--registry flag, and made the global --kubeconfig and --api-addr flags
apply to all subcommands.

* Switch flag descriptions from lowercase to Capital

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-06 14:47:31 -07:00
Eliza Weisman 8bc05472ed
Make `control::Cache` key-value in order to store discovery metadata (#688)
This PR changes the proxy's `control::Cache` module from a set to a key-value map. 

This change is made in order to use the values in the map to store metadata from the Destination API, but allow evictions and insertions to be based only on the `SocketAddr` of the destination entry. This will make code in PR #661 much simpler, by removing the need to wrap `SocketAddr`s in the cache in a `Labeled` struct for storing metadata, and the need for custom `Borrow` implementations on that type.

Furthermore, I've changed from using a standard library `HashSet`/`HashMap` as the underlying collection to using `IndexMap`, as we suspect that this will result in performance improvements. 

Currently, as `master` has no additional metadata to associate with cache entries, the type of the values in the map is `()`. When #661 merges, the values will actually contain metadata.

If we suspect that there are many other use-cases for `control::Cache` where it will be treated as a set rather than a map, we may want to provide a separate set of impls for `Cache<T, ()>` (like `std::HashSet`) to make the API more ergonomic in this case.
2018-04-06 13:54:16 -07:00
Andrew Seigner 1cf1a0cb13
Fix public-api config in docker-compose (#712)
The public-api in the docker-compose environment is not configured to
talk to Prometheus or Kubernetes, which is now required with the new
telemetry pipeline.

Modify the public-api config in docker-compose to connect to k8s and
prom.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 12:59:34 -07:00
Andrew Seigner 50c323c617
Use canonical k8s names, fix prom labels (#702)
The new statsummary command accepted friendly k8s names, which worked
for k8s queries, but Prometheus requires a specific key.

Modify the statsummary query to map friendly k8s names to canonical k8s
names when constructing the query. Then during the query, map the
canonical k8s name to a specific Prometheus label.

Fixes #695

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 12:34:54 -07:00
Brian Smith 15037d9618
Proxy: Improve DNS name parsing (#708)
Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-06 08:45:18 -10:00
Andrew Seigner 836168884e
Link to Grafana from Conduit Dashboard (#678)
* Link to Grafana from Conduit Dashboard

Previously the only way to access the Grafana dashboards was via direct
link, provided by the `conduit dashboard` command.

Add Grafana links throughout the Conduit Dashboard, next to all
Deployment objects. This change also modifies the behavior of the
ConduitLink helper, to enable linking to other deployments proxied by
the `conduit dashboard` command.

Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* review feedback

Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* review feedback, fix console, remove absolute

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 10:56:42 -07:00
Eliza Weisman 605e68dff6
Add pretty durations to panics from `assert_eventually!` (#677)
This PR adds the pretty-printing for durations I added in #676 to the panic message from the `assert_eventually!` macro added in #669. 

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-06 10:49:17 -07:00
Brian Smith c31f4ba993
Remove unused conversions for Destination. (#701)
These have not been used for a while; they are dead code.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-06 07:35:35 -10:00
Brian Smith 7bc4ffd0a4
Revert "Proxy: Refactor DNS name parsing and normalization (#673)" (#700)
This reverts commit 311ef410a8.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 16:49:32 -10:00
Brian Smith 1b223723bc
Revert "Proxy: Refactor poll_destination() in service discovery. (#674)" (#698)
This reverts commit 4fb9877b89.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 16:36:01 -10:00
Risha Mars 2f5b5ea5f2
Start implementing conduit stat summary endpoint (#671)
Start implementing new conduit stat summary endpoint. 
Changes the public-api to call prometheus directly instead of the
telemetry service. Wired through to `api/stat` on the web server,
as well as `conduit statsummary` on the CLI. Works for deployments only.

Current implementation just retrieves requests and mesh/total pod count 
(so latency stats are always 0). 

Uses API defined in #663
Example queries the stat endpoint will eventually satisfy in #627

This branch includes commits from @klingerf 

* run ./bin/dep ensure
* run ./bin/update-go-deps-shas
2018-04-05 17:05:06 -07:00
Brian Smith 4fb9877b89
Proxy: Refactor poll_destination() in service discovery. (#674)
No change in behavior is intended here.

Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 13:05:11 -10:00
Brian Smith 311ef410a8
Proxy: Refactor DNS name parsing and normalization (#673)
Proxy: Refactor DNS name parsing and normalization

Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 12:32:12 -10:00
Andrew Seigner 28d5007cdf
Harmonize Prometheus label usage (#690)
The Destination service used slightly different labels than the
telemetry pipeline expected, specifically, prefixed with `k8s_*`.

Make all Prometheus labels consistent by dropping `k8s_*`. Also rename
`pod_name` to `pod` for consistency with `deployement`, etc. Also update
and reorganize `proxy-metrics.md` to reflect new labelling.

Fixes #655

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-05 15:09:06 -07:00
Andrew Seigner 65be27c3c0
Fix ci job failing when new Docker image added (#691)
The master ci job executes a `docker-pull master` prior to building, to
bootstrap the Docker image cache. This command fails if the PR being
merged to master introduces a new Docker image, for example:
https://travis-ci.org/runconduit/conduit/jobs/362841328

This changes the master ci job to handle a `docker-pull master` failure
gracefully.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-05 15:01:54 -07:00
Andrew Seigner 9508e11b45
Build conduit-specific Grafana Docker image (#679)
Using a vanilla Grafana Docker image as part of `conduit install`
avoided maintaining a conduit-specific Grafana Docker image, but made
packaging dashboard json files cumbersome.

Roll our own Grafana Docker image, that includes conduit-specific
dashboard json files. This significantly decreases the `conduit install`
output size, and enables dashboard integration in the docker-compose
environment.

Fixes #567
Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-05 14:20:05 -07:00
Eliza Weisman 18fa42ebd0
Pretty-print durations in log messages (#676)
This branch adds simple pretty-printing to duration in log timeout messages. If the duration is >= 1 second, it's printed in seconds with a fractional part. If the duration is less than 1 second, it is printed in milliseconds. This simple formatting may not be sufficient as a formatting rule for all cases, but should be sufficient for printing our relatively small timeouts.

Log messages now look something like this:
```
ERROR 2018-04-04T20:05:49Z: conduit_proxy: turning operation timed out after 100 ms into 500
```

Previously, they looked like this:
```
ERROR 2018-04-04T20:07:26Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
```

I made this change partially because I wanted to make the panics from the `eventually!` macro added in #669 more readable.
2018-04-05 13:47:19 -07:00
Eliza Weisman 49bf01b0da
Add `assert_eventually!` macro to help de-flake telemetry tests (#669)
Closes #615.

Based on @olix0r's suggestion in https://github.com/runconduit/conduit/issues/613#issuecomment-376024744, this PR adds an `assert_eventually!` macro to retry an assertion a set number of times, waiting for 15 ms between retries. This is loosely based on ScalaTest's [eventually](http://doc.scalatest.org/1.8/org/scalatest/concurrent/Eventually.html).

I've rewritten the flaky telemetry tests to use the `assert_eventually!` macro, to compensate for delays in the served metrics being updated between client requests and metrics scrapes.
2018-04-05 11:23:34 -07:00
Eliza Weisman 6b370b4466
Split labels out of `prometheus.rs` into its own file (#680)
The proxy's `telemetry/metrics/prometheus.rs` file was starting to get long and hard to find one's way around in. I split the prometheus labels code out into a separate submodule and `RequestLabels` and `ResponseLabels` public. This seems like a reasonable division of the code, and the resultant files are much easier to read.
2018-04-04 15:49:17 -07:00
Oliver Gould 2dc964c583
Move control::discovery::Cache into its own module (#672)
The proxy's control::discovery module is becoming a bit dense in terms
of what it implements.

In order to make this code more understandable, and to be able to use a
similar caching strategy in other parts of the controller, the
`control::cache` module now holds discovery's cache implementation.

This module is only visible within the `control` module, and it now
exposes two new public methods: `values()` and
`set_reset_on_next_modification()`.
2018-04-04 14:27:04 -07:00
Eliza Weisman 01628bfa43
Fix missing comma in gRPC status code labels (#670)
Fixes the issue caught by @olix0r in https://github.com/runconduit/conduit/pull/661#issuecomment-378431155
2018-04-04 10:41:21 -07:00
Risha Mars d1a39ea6bf
Define a new telemetry Stat API (#663)
* Define a new telemetry Stat API

Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627.
StatSummary will replace Stat once implemented and the original Stat deleted.
2018-04-03 14:45:58 -07:00
Brian Smith 06bf78ccdf
Use Rust 1.25 to build Docker images. (#667)
Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-03 08:22:29 -10:00
Franziska von der Goltz eff848a8cf
fix pod status and count in control plane dashboard (#659)
* fix pod status and count display in control plane dashboard section:
- the control plane would show terminated and stale deployments in the UI, this is confusing and might indicate errors
- this filters out temrinated and failed component deploys from the UI
- it is to note that pending deploys will still be counted and represented with a greyed out status dot
- Fixes: #606

Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>


Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>
2018-04-03 10:39:35 -07:00
Phil Calçado 19001f8d38 Add pod-based metric_labels to destinations response (#429) (#654)
* Extracted logic from destination server
* Make tests follow style used elsewhere in the code
* Extract single interface for resolvers
* Add tests for k8s and ipv4 resolvers
* Fix small usability issues
* Update dep
* Act on feedback
* Add pod-based metric_labels to destinations response
* Add documentation on running control plane to BUILD.md

Signed-off-by: Phil Calcado <phil@buoyant.io>

* Fix mock controller in proxy tests (#656)

Signed-off-by: Eliza Weisman <eliza@buoyant.io>

* Address review feedback
* Rename files in the destination package

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-02 18:36:57 -07:00
Andrew Seigner ee042e1943
Rename grafana viz to top-line (#666)
The primary Grafana dashboard was named 'viz' from a prototype.

Rename 'viz' to 'Top Line'.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-02 18:10:35 -07:00
Brian Smith df9ead9c36
Use Go 1.10.1 to build all Go code. (#650)
Go 1.10.1 is a security release.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-02 14:58:30 -10:00
Andrew Seigner bf721466e3
Filter out conduit controller pods from Grafana (#657)
The Grafana dashboards were displaying all proxy-enabled pods, including
conduit controller pods. In the old telemetry pipeline filtering these
out required knowledge of the controller's namespace, which the
dashboards are agnostic to.

This change leverages the new `conduit_io_control_plane_component`
prometheus label to filter out proxy-enabled controller components.

Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-02 17:56:12 -07:00
Sean McArthur 47f9665b8e
proxy: allow disable protocol detection on specific ports (#648)
- Adds environment variables to configure a set of ports that, when an
  incoming connection has an SO_ORIGINAL_DST with a port matching, will
  disable protocol detection for that connection and immediately start a
  TCP proxy.
- Adds a default list of well known ports: SMTP and MySQL.

Closes #339
2018-04-02 14:24:36 -07:00