* Remove the telemetry service
The telemetry service is no longer needed, now that prometheus scrapes
metrics directly from proxies, and the public-api talks directly to
prometheus. In this branch I'm removing the service itself as well as
all of the telemetry protobuf, and updating the conduit install command
to no longer install the service. I'm also removing the old version of
the stat command, which required the telemetry service, and renaming the
statsummary command to stat.
* Fix time window tests
* Remove deprecated controller scrape config
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
The Prometheus client sometimes returns NaN if a calculation is invalid,
such as histogram_quantile when no requests have occurred.
Add IsNaN check in the public-api and set output to zero.
Fixes#747
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The tests for label metadata updates from the control plane are flaky on CI. This is likely due to the CI containers not having enough cores to execute the test proxy thread, the test proxy's controller client thread, the mock controller thread, and the test server thread simultaneously --- see #751 for more information.
For now, I'm ignoring these on CI. Eventually, I'd like to change the mock controller code in test support so that we can trigger it to send a second metadata update only after the request has finished.
I think this issue also makes merging #738 a higher priority, so that we can still have some tests running on CI that exercise some part of the label update behaviour.
PR #654 adds pod-based metric labels to the Destination API responses for cluster-local services.
This PR modifies the proxy to actually add these labels to reported Prometheus metrics for outbound requests to local services.
It enhances the proxy's `control::discovery` module to track these labels and add a `LabelRequest` middleware to the service stack built in `Bind` for labeled services. Requests transiting `LabelRequest` are given an `Extension` which contains these labels, which are then added to events produced by the `Sensors` for these requests. When these events are aggregated to Prometheus metrics, the labels are added.
I've also added some tests in `test/telemetry.rs` ensuring that these metrics are added correctly when the Destination service provides labels.
Closes#660
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
The ListPods endpoint's logic resides in the telemetry service, which is
going away.
Move ListPods logic into public-api, use new k8s informer APIs.
Fixes#694
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The new StatSummary endpoint was only providing request volume and
successs rate information.
Add support for retrieving latency stats via StatSummary. Also make
all prometheus calls in parallel, and implement kubernetes test
fixtures.
Fixes#681
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
- The listener is immediately closed on receipt of a shutdown signal.
- All in-progress server connections are now counted, and the process will
not shutdown until the connection count has dropped to zero.
- In the case of HTTP1, idle connections are closed. In the case of HTTP2,
the HTTP2 graceful shutdown steps are followed of sending various
GOAWAYs.
* Switch public API to use cached k8s resources
* Move shared informer code to separate goroutine
* Fix spelling issue
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
Previously when the proxy could tell, by parsing, the request-target
is not in the cluster, it would not override the destination. That is,
load balancing would be disabled for such destinations.
With this change, the proxy will do L7 load balancing for all HTTP
services as long as the request-target has a DNS name.
Signed-off-by: Brian Smith <brian@briansmith.org>
The success rate calculation relies on the `classification` label, but
was incorrectly specifying `fail` rather than `failure`.
Fix public api to specify `failure`. Also re-org public api tests for
easier Kubernetes and Prometheus mocking.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
No change in behavior is intended here.
Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().
Signed-off-by: Brian Smith <brian@briansmith.org>
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.
Signed-off-by: Brian Smith <brian@briansmith.org>
The StatSummary logic was implemented as a method on http_server.
Move the StatSummary logic into grpc_server, for consistency with the
other endpoints.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The Grafana dashboards key off of deployment, but had no awareness of
namespaces, causing incorrect metrics aggregation and display.
This change makes the Grafana dashboards key off of namespaces, and also
modifies the Grafana links in the Conduit dashboard to link to
namespace+deployment.
Fixes#704
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* CLI: change conduit namespace shorthand flag to -c
All of the conduit CLI subcommands accept a --conduit-namespace flag,
indicating the namespace where conduit is running. Some of the
subcommands also provide a --namespace flag, indicating the kubernetes
namespace where a user's application code is running. To prevent
confusion, I'm changing the shorthand flag for the conduit namespace to
-c, and using the -n shorthand when referring to user namespaces.
As part of this change I've also standardized the capitalization of all
of our command line flags, removed the -r shorthand for the install
--registry flag, and made the global --kubeconfig and --api-addr flags
apply to all subcommands.
* Switch flag descriptions from lowercase to Capital
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
This PR changes the proxy's `control::Cache` module from a set to a key-value map.
This change is made in order to use the values in the map to store metadata from the Destination API, but allow evictions and insertions to be based only on the `SocketAddr` of the destination entry. This will make code in PR #661 much simpler, by removing the need to wrap `SocketAddr`s in the cache in a `Labeled` struct for storing metadata, and the need for custom `Borrow` implementations on that type.
Furthermore, I've changed from using a standard library `HashSet`/`HashMap` as the underlying collection to using `IndexMap`, as we suspect that this will result in performance improvements.
Currently, as `master` has no additional metadata to associate with cache entries, the type of the values in the map is `()`. When #661 merges, the values will actually contain metadata.
If we suspect that there are many other use-cases for `control::Cache` where it will be treated as a set rather than a map, we may want to provide a separate set of impls for `Cache<T, ()>` (like `std::HashSet`) to make the API more ergonomic in this case.
The public-api in the docker-compose environment is not configured to
talk to Prometheus or Kubernetes, which is now required with the new
telemetry pipeline.
Modify the public-api config in docker-compose to connect to k8s and
prom.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The new statsummary command accepted friendly k8s names, which worked
for k8s queries, but Prometheus requires a specific key.
Modify the statsummary query to map friendly k8s names to canonical k8s
names when constructing the query. Then during the query, map the
canonical k8s name to a specific Prometheus label.
Fixes#695
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Link to Grafana from Conduit Dashboard
Previously the only way to access the Grafana dashboards was via direct
link, provided by the `conduit dashboard` command.
Add Grafana links throughout the Conduit Dashboard, next to all
Deployment objects. This change also modifies the behavior of the
ConduitLink helper, to enable linking to other deployments proxied by
the `conduit dashboard` command.
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* review feedback
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* review feedback, fix console, remove absolute
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This PR adds the pretty-printing for durations I added in #676 to the panic message from the `assert_eventually!` macro added in #669.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Start implementing new conduit stat summary endpoint.
Changes the public-api to call prometheus directly instead of the
telemetry service. Wired through to `api/stat` on the web server,
as well as `conduit statsummary` on the CLI. Works for deployments only.
Current implementation just retrieves requests and mesh/total pod count
(so latency stats are always 0).
Uses API defined in #663
Example queries the stat endpoint will eventually satisfy in #627
This branch includes commits from @klingerf
* run ./bin/dep ensure
* run ./bin/update-go-deps-shas
No change in behavior is intended here.
Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().
Signed-off-by: Brian Smith <brian@briansmith.org>
Proxy: Refactor DNS name parsing and normalization
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.
Signed-off-by: Brian Smith <brian@briansmith.org>
The Destination service used slightly different labels than the
telemetry pipeline expected, specifically, prefixed with `k8s_*`.
Make all Prometheus labels consistent by dropping `k8s_*`. Also rename
`pod_name` to `pod` for consistency with `deployement`, etc. Also update
and reorganize `proxy-metrics.md` to reflect new labelling.
Fixes#655
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The master ci job executes a `docker-pull master` prior to building, to
bootstrap the Docker image cache. This command fails if the PR being
merged to master introduces a new Docker image, for example:
https://travis-ci.org/runconduit/conduit/jobs/362841328
This changes the master ci job to handle a `docker-pull master` failure
gracefully.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Using a vanilla Grafana Docker image as part of `conduit install`
avoided maintaining a conduit-specific Grafana Docker image, but made
packaging dashboard json files cumbersome.
Roll our own Grafana Docker image, that includes conduit-specific
dashboard json files. This significantly decreases the `conduit install`
output size, and enables dashboard integration in the docker-compose
environment.
Fixes#567
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
This branch adds simple pretty-printing to duration in log timeout messages. If the duration is >= 1 second, it's printed in seconds with a fractional part. If the duration is less than 1 second, it is printed in milliseconds. This simple formatting may not be sufficient as a formatting rule for all cases, but should be sufficient for printing our relatively small timeouts.
Log messages now look something like this:
```
ERROR 2018-04-04T20:05:49Z: conduit_proxy: turning operation timed out after 100 ms into 500
```
Previously, they looked like this:
```
ERROR 2018-04-04T20:07:26Z: conduit_proxy: turning operation timed out after Duration { secs: 0, nanos: 100000000 } into 500
```
I made this change partially because I wanted to make the panics from the `eventually!` macro added in #669 more readable.
The proxy's `telemetry/metrics/prometheus.rs` file was starting to get long and hard to find one's way around in. I split the prometheus labels code out into a separate submodule and `RequestLabels` and `ResponseLabels` public. This seems like a reasonable division of the code, and the resultant files are much easier to read.
The proxy's control::discovery module is becoming a bit dense in terms
of what it implements.
In order to make this code more understandable, and to be able to use a
similar caching strategy in other parts of the controller, the
`control::cache` module now holds discovery's cache implementation.
This module is only visible within the `control` module, and it now
exposes two new public methods: `values()` and
`set_reset_on_next_modification()`.
* Define a new telemetry Stat API
Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627.
StatSummary will replace Stat once implemented and the original Stat deleted.
* fix pod status and count display in control plane dashboard section:
- the control plane would show terminated and stale deployments in the UI, this is confusing and might indicate errors
- this filters out temrinated and failed component deploys from the UI
- it is to note that pending deploys will still be counted and represented with a greyed out status dot
- Fixes: #606
Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>
Signed-off-by: Franziska von der Goltz <franziska@vdgoltz.eu>
* Extracted logic from destination server
* Make tests follow style used elsewhere in the code
* Extract single interface for resolvers
* Add tests for k8s and ipv4 resolvers
* Fix small usability issues
* Update dep
* Act on feedback
* Add pod-based metric_labels to destinations response
* Add documentation on running control plane to BUILD.md
Signed-off-by: Phil Calcado <phil@buoyant.io>
* Fix mock controller in proxy tests (#656)
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
* Address review feedback
* Rename files in the destination package
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
The Grafana dashboards were displaying all proxy-enabled pods, including
conduit controller pods. In the old telemetry pipeline filtering these
out required knowledge of the controller's namespace, which the
dashboards are agnostic to.
This change leverages the new `conduit_io_control_plane_component`
prometheus label to filter out proxy-enabled controller components.
Part of #420
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
- Adds environment variables to configure a set of ports that, when an
incoming connection has an SO_ORIGINAL_DST with a port matching, will
disable protocol detection for that connection and immediately start a
TCP proxy.
- Adds a default list of well known ports: SMTP and MySQL.
Closes#339