Commit Graph

379 Commits

Author SHA1 Message Date
Andrew Seigner 9e8cce0838
Destination service returns "Running" pod labels (#781)
When the Destination sees an IP address, it looks up Pods by that IP,
and associates Pod label data to it. If the lookup by IP returned more
than one Pod, it simply picked the first one. This is not correct,
specifically in cases where one pod is in a Running state, and others
are not.

Modify the Destination service to only return label data for Pods in the
Running state.

Fixes #773

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-17 14:42:54 -07:00
Eliza Weisman 6121afb6f2
Factor out reused test fixtures from telemetry tests (#782)
This is a fairly minor refactor to the proxy telemetry tests. b07b554d2b added a `Fixture` in the Destination service labeling tests added in #661 to reduce the repetition of copied and pasted code in those tests. I've refactored most of the other telemetry tests to also use the test fixture. Significantly less code is copied and pasted now.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-17 14:15:56 -07:00
Sean McArthur 3cd16e8e40
proxy: clean up some logs and a few warnings in proxy tests (#780)
Signed-off-by: Sean McArthur <sean@seanmonstar.com>
2018-04-17 12:53:20 -07:00
Eliza Weisman cf2d7b1d7d
proxy: move metrics::prometheus module to root metrics module (#763)
The proxy `telemetry::metrics::prometheus` module was initially added in order to give the Prometheus metrics export code a separate namespace from the controller push metrics. Since the controller push metrics code was removed from the proxy in #616, we no longer need a separate module for the Prometheus-specific metrics code. Therefore, I've moved that code to the root `telemetry::metrics` module, which should hopefully make the proxy source tree structure a little simpler.

This is a fairly trivial refactor.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-17 11:19:27 -07:00
Andrew Seigner 727521f914
Permit arbitrary time windows in public-api (#774)
The public-api previously only permitted 4 hard-coded time windows:
10s, 1m, 10m, 1h. This was primarily a relic of the recently removed
telemetry system.

Modify the public-api to validate the time string, but allow for any
window size, which is then passed through to Prometheus.

Fixes #686

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-16 17:37:17 -07:00
Eliza Weisman 64f4dfe07f
Refactor control::Cache and add tests (#733)
Closes #713. This is a follow-up from #688.

This PR makes a number of refactorings to the proxy's `control::Cache` module and removes all but one of the `clone` calls. 

The `CacheChange` enum now contains the changed key and a reference to the changed value when applicable. This simplifies `on_change` functions, which no longer have to take both a tuple of `(K, V)` and a `CacheChange` and can now simply destructure the `CacheChange`, and since the changed value is passed as a reference, the `on_change` function can now decide whether or not it should be cloned. This means that we can remove a majority of the clones previously present here.

I've also rewritten `Cache::update_union` so that it no longer clones values (twice if the cache was invalidated). There's still one `clone` call in `Cache::update_intersection`, but it seems like it will be fairly tricky to remove. However, I've moved the `V: Clone` bound to that function specifically. `Cache::clear` and `Cache::update_union` so that they no longer call `Cache::update_intersection` internally, so they don't need a `V: Clone` bound.

In addition, I've added some unit tests that test that `on_change` is called with the correct `CacheChange`s when key/value pairs are modified.
2018-04-16 16:42:55 -07:00
Brian Smith 621f3c2e56
Revert "Avoid `cargo fetch --locked` in proxy/Dockerfile. (#593)" (#767)
This reverts commit d38a2acff8.

The change being reverted here did reduce downloads that occur when
Cargo.lock is updated. However, it had the unwanted side-effect of
invalidating at least part of the Cargo download cache when other
files, including in particular files under proto/, were modified.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-16 13:27:49 -10:00
Kevin Lingerfelt 11a4359e9a
Misc cleanup following the telemetry rewrite (#771)
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-16 15:51:07 -07:00
Oliver Gould cd9f755262
v0.4.0 (#772)
Conduit 0.4.0 overhauls Conduit's telemetry system and improves service discovery
reliability.

* Web UI
  * **New** automatically-configured Grafana dashboards for all Deployments.
* Command-line interface
  * `conduit stat` has been completely rewritten to accept arguments like `kubectl get`.
    The `--to` and `--from` filters can be used to filter traffic by destination and
    source, respectively.  `conduit stat` currently can operate on `Namespace` and
    `Deployment` Kubernetes resources. More resource types will be added in the next
    release!
* Proxy (data plane)
  * **New** Prometheus-formatted metrics are now exposed on `:4191/metrics`, including
    rich destination labeling for outbound HTTP requests. The proxy no longer pushes
    metrics to the control plane.
  * The proxy now handles `SIGINT` or `SIGTERM`, gracefully draining requests until all
    are complete or `SIGQUIT` is received.
  * SMTP and MySQL (ports 25 and 3306) are now treated as opaque TCP by default. You
    should no longer have to specify `--skip-outbound-ports` to communicate with such
    services.
  * When the proxy reconnected to the controller, it could continue to send requests to
    old endpoints. Now, when the proxy reconnects to the controller, it properly removes
    invalid endpoints.
  * A bug impacting some HTTP/2 reset scenarios has been fixed.
* Service Discovery
  * Previously, the proxy failed to resolve some domain names that could be misinterpreted
    as a Kubernetes Service name. This has been fixed by extending the _Destination_ API
    with a negative acknowledgement response.
* Control Plane
  * The _Telemetry_ service and associated APIs have been removed.
* Documentation
  * Updated Roadmap
  * Added prometheus metrics guide
2018-04-16 14:42:15 -07:00
Oliver Gould 800cefdb77
Skip the proxy on the metrics port (#770)
When prometheus queries the proxy for data, these requests are reported
as inbound traffic to the pod. This leads to misleading stats when a pod
otherwise receives little/no traffic.

In order to prevent these requests being proxied, the metrics port is
now added to the default inbound skip-ports list (as is already case for
the tap server).

Fixes #769
2018-04-16 11:54:58 -07:00
Andrew Seigner c9cdd838dc
Standardize and polish Grafana for 0.4.0 release (#766)
The top-line, deployments, and health Grafana dashboards had
inconsistent layouts and data.

This change standardizes our Grafana dashboards. Every row is composed
of Success Rate, Request Rate, and Latency.

Part of #420.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-13 18:01:44 -07:00
Brian Smith 0c37067554
Reduce proto dependencies in proxy/Dockerfile (#765)
Reduce the dependencies on files under proto/ to eliminate Docker
detecting false dependencies that trigger rebuilds.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-13 14:49:55 -10:00
Andrew Seigner 77fb6d3709
Add namespace as a resource type in public-api (#760)
* Add namespace as a resource type in public-api

The cli and public-api only supported deployments as a resource type.

This change adds support for namespace as a resource type in the cli and
public-api. This also change includes:
- cli statsummary now prints `-`'s when objects are not in the mesh
- cli statsummary prints `No resources found.` when applicable
- removed `out-` from cli statsummary flags, and analagous proto changes
- switched public-api to use native prometheus label types
- misc error handling and logging fixes

Part of #627

Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* Refactor filter and groupby label formulation

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>

* Rename stat_summary.go to stat.go in cli

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>

* Update rbac privileges for namespace stats

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-13 16:53:01 -07:00
Oliver Gould cc44db054f
Remove NODE_NAME and POD_NAME env usage (#758)
* proxy: Remove pod_name and node_name

* cli: Do not inject POD_NAME and NODE_NAME env vars
2018-04-13 13:09:51 -07:00
Andrew Seigner 21886760c6
Use apps/v1beta2 for Kubernetes 1.8 compatibility (#762)
Conduit was relying on apps/v1 to Deployment and ReplicaSet APIs.
apps/v1 is not available on Kubernetes 1.8. This prevented the
public-api from starting.

Switch Conduit to use apps/v1beta2. Also increase the Kubernetes API
cache sync timeout from 10 to 60 seconds, as it was taking 11 seconds on
a test cluster.

Fixes #761

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-13 12:08:16 -07:00
Kevin Lingerfelt fb15fe7c1a
Remove the telemetry service (#757)
* Remove the telemetry service

The telemetry service is no longer needed, now that prometheus scrapes
metrics directly from proxies, and the public-api talks directly to
prometheus. In this branch I'm removing the service itself as well as
all of the telemetry protobuf, and updating the conduit install command
to no longer install the service. I'm also removing the old version of
the stat command, which required the telemetry service, and renaming the
statsummary command to stat.

* Fix time window tests

* Remove deprecated controller scrape config

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-13 11:21:29 -07:00
Oliver Gould efdfc93b50
Stop pushing telemetry reports from the proxy (#616)
Now that the controller does not depend on pushed telemetry reports, the
proxy need not depend on the telemetry API or maintain legacy sampling
logic.
2018-04-12 17:39:29 -07:00
Kevin Lingerfelt 37434d048a
Update web component to use new stat api (#753)
* Update web component to use new stat api
* Address review feedback
* Add external link icon

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-12 17:35:03 -07:00
Andrew Seigner e9b209829d
Handle NaN metrics (#750)
The Prometheus client sometimes returns NaN if a calculation is invalid,
such as histogram_quantile when no requests have occurred.

Add IsNaN check in the public-api and set output to zero.

Fixes #747

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-12 15:21:00 -07:00
Eliza Weisman b6180d8bfe
Add unit tests for Labeled middleware (#738)
I've added unit tests for the `Labeled` middleware used to add Destination labels in the proxy, as @olix0r requested in https://github.com/runconduit/conduit/pull/661#discussion_r179897783. 

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-12 15:10:01 -07:00
Eliza Weisman 61d15a6c3e
Ignore flaky telemetry tests on CI (#752)
The tests for label metadata updates from the control plane are flaky on CI. This is likely due to the CI containers not having enough cores to execute the test proxy thread, the test proxy's controller client thread, the mock controller thread, and the test server thread simultaneously --- see #751 for more information. 

For now, I'm ignoring these on CI. Eventually, I'd like to change the mock controller code in test support so that we can trigger it to send a second metadata update only after the request has finished.

I think this issue also makes merging #738 a higher priority, so that we can still have some tests running on CI that exercise some part of the label update behaviour.
2018-04-12 14:59:17 -07:00
Eliza Weisman b07b554d2b
Add labels from service discovery to proxy metrics reports (#661)
PR #654 adds pod-based metric labels to the Destination API responses for cluster-local services. 

This PR modifies the proxy to actually add these labels to reported Prometheus metrics for outbound requests to local services. 

It enhances the proxy's `control::discovery` module to track these labels and add a `LabelRequest` middleware to the service stack built in `Bind` for labeled services. Requests transiting `LabelRequest` are given an `Extension` which contains these labels, which are then added to events produced by the `Sensors` for these requests. When these events are aggregated to Prometheus metrics, the labels are added.

I've also added some tests in `test/telemetry.rs` ensuring that these metrics are added correctly when the Destination service provides labels.

Closes #660

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-12 12:54:38 -07:00
Andrew Seigner 624b87f743
Implement ListPods in public-api (#743)
The ListPods endpoint's logic resides in the telemetry service, which is
going away.

Move ListPods logic into public-api, use new k8s informer APIs.

Fixes #694

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-11 17:53:57 -07:00
Kevin Lingerfelt 47caf1ca07
Add --all-namespaces flag to CLI statsummary command (#745)
* Add --all-namespaces flag to CLI statsummary command

* Fix statsummary output formatting

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-11 16:40:25 -07:00
Andrew Seigner 259fdcd134
Add latency stats in new stat summary endpoint (#737)
The new StatSummary endpoint was only providing request volume and
successs rate information.

Add support for retrieving latency stats via StatSummary. Also make
all prometheus calls in parallel, and implement kubernetes test
fixtures.

Fixes #681

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-11 11:58:32 -07:00
Kevin Lingerfelt e1e1b6b599
Controller: add more destination labels, fix service label (#731)
* Add more destination labels, fix service label

* Update owner labels to match proxy metrics docs

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-11 10:44:52 -07:00
Sean McArthur 7f54b5253d
proxy: fix flaky tcp graceful shutdown test (#735) 2018-04-10 19:47:00 -07:00
Sean McArthur 02c6887020
proxy: improve graceful shutdown process (#684)
- The listener is immediately closed on receipt of a shutdown signal.
- All in-progress server connections are now counted, and the process will
  not shutdown until the connection count has dropped to zero.
- In the case of HTTP1, idle connections are closed. In the case of HTTP2,
  the HTTP2 graceful shutdown steps are followed of sending various
  GOAWAYs.
2018-04-10 14:15:37 -07:00
Kevin Lingerfelt 91c359e612
Switch public API to use cached k8s resources (#724)
* Switch public API to use cached k8s resources
* Move shared informer code to separate goroutine
* Fix spelling issue

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-10 11:39:31 -07:00
Brian Smith 7319cf648f
Proxy: Do L7 load balancing for all external HTTP services. (#726)
Previously when the proxy could tell, by parsing, the request-target
is not in the cluster, it would not override the destination. That is,
load balancing would be disabled for such destinations.

With this change, the proxy will do L7 load balancing for all HTTP
services as long as the request-target has a DNS name.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-10 08:07:16 -10:00
Andrew Seigner 3a341abe9a
Fix success rate calculation in public api (#723)
The success rate calculation relies on the `classification` label, but
was incorrectly specifying `fail` rather than `failure`.

Fix public api to specify `failure`. Also re-org public api tests for
easier Kubernetes and Prometheus mocking.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-10 11:04:04 -07:00
Brian Smith bc16034fd6
Proxy: Fall back to using DNS when Destination service can't find service. (#692)
Fixes #155.
2018-04-07 18:26:06 -10:00
Brian Smith c25e9c371b
Refactor poll_destination() in service discovery. (#725)
No change in behavior is intended here.

Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-07 18:15:19 -10:00
Brian Smith 7d3b715c4d
Proxy: Move DNS name normalization to service discovery (#722)
Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-06 15:04:09 -10:00
Andrew Seigner 716b392231
Move StatSummary logic into grpc server (#717)
The StatSummary logic was implemented as a method on http_server.

Move the StatSummary logic into grpc_server, for consistency with the
other endpoints.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 16:46:15 -07:00
Andrew Seigner b6bcdcc059
Namespace-aware Grafana dashboards (#716)
The Grafana dashboards key off of deployment, but had no awareness of
namespaces, causing incorrect metrics aggregation and display.

This change makes the Grafana dashboards key off of namespaces, and also
modifies the Grafana links in the Conduit dashboard to link to
namespace+deployment.

Fixes #704
Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 15:37:53 -07:00
Kevin Lingerfelt baa4d10c2f
CLI: change conduit namespace shorthand flag to -c (#714)
* CLI: change conduit namespace shorthand flag to -c

All of the conduit CLI subcommands accept a --conduit-namespace flag,
indicating the namespace where conduit is running. Some of the
subcommands also provide a --namespace flag, indicating the kubernetes
namespace where a user's application code is running. To prevent
confusion, I'm changing the shorthand flag for the conduit namespace to
-c, and using the -n shorthand when referring to user namespaces.

As part of this change I've also standardized the capitalization of all
of our command line flags, removed the -r shorthand for the install
--registry flag, and made the global --kubeconfig and --api-addr flags
apply to all subcommands.

* Switch flag descriptions from lowercase to Capital

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-04-06 14:47:31 -07:00
Eliza Weisman 8bc05472ed
Make `control::Cache` key-value in order to store discovery metadata (#688)
This PR changes the proxy's `control::Cache` module from a set to a key-value map. 

This change is made in order to use the values in the map to store metadata from the Destination API, but allow evictions and insertions to be based only on the `SocketAddr` of the destination entry. This will make code in PR #661 much simpler, by removing the need to wrap `SocketAddr`s in the cache in a `Labeled` struct for storing metadata, and the need for custom `Borrow` implementations on that type.

Furthermore, I've changed from using a standard library `HashSet`/`HashMap` as the underlying collection to using `IndexMap`, as we suspect that this will result in performance improvements. 

Currently, as `master` has no additional metadata to associate with cache entries, the type of the values in the map is `()`. When #661 merges, the values will actually contain metadata.

If we suspect that there are many other use-cases for `control::Cache` where it will be treated as a set rather than a map, we may want to provide a separate set of impls for `Cache<T, ()>` (like `std::HashSet`) to make the API more ergonomic in this case.
2018-04-06 13:54:16 -07:00
Andrew Seigner 1cf1a0cb13
Fix public-api config in docker-compose (#712)
The public-api in the docker-compose environment is not configured to
talk to Prometheus or Kubernetes, which is now required with the new
telemetry pipeline.

Modify the public-api config in docker-compose to connect to k8s and
prom.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 12:59:34 -07:00
Andrew Seigner 50c323c617
Use canonical k8s names, fix prom labels (#702)
The new statsummary command accepted friendly k8s names, which worked
for k8s queries, but Prometheus requires a specific key.

Modify the statsummary query to map friendly k8s names to canonical k8s
names when constructing the query. Then during the query, map the
canonical k8s name to a specific Prometheus label.

Fixes #695

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 12:34:54 -07:00
Brian Smith 15037d9618
Proxy: Improve DNS name parsing (#708)
Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-06 08:45:18 -10:00
Andrew Seigner 836168884e
Link to Grafana from Conduit Dashboard (#678)
* Link to Grafana from Conduit Dashboard

Previously the only way to access the Grafana dashboards was via direct
link, provided by the `conduit dashboard` command.

Add Grafana links throughout the Conduit Dashboard, next to all
Deployment objects. This change also modifies the behavior of the
ConduitLink helper, to enable linking to other deployments proxied by
the `conduit dashboard` command.

Part of #420

Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* review feedback

Signed-off-by: Andrew Seigner <siggy@buoyant.io>

* review feedback, fix console, remove absolute

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-06 10:56:42 -07:00
Eliza Weisman 605e68dff6
Add pretty durations to panics from `assert_eventually!` (#677)
This PR adds the pretty-printing for durations I added in #676 to the panic message from the `assert_eventually!` macro added in #669. 

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
2018-04-06 10:49:17 -07:00
Brian Smith c31f4ba993
Remove unused conversions for Destination. (#701)
These have not been used for a while; they are dead code.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-06 07:35:35 -10:00
Brian Smith 7bc4ffd0a4
Revert "Proxy: Refactor DNS name parsing and normalization (#673)" (#700)
This reverts commit 311ef410a8.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 16:49:32 -10:00
Brian Smith 1b223723bc
Revert "Proxy: Refactor poll_destination() in service discovery. (#674)" (#698)
This reverts commit 4fb9877b89.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 16:36:01 -10:00
Risha Mars 2f5b5ea5f2
Start implementing conduit stat summary endpoint (#671)
Start implementing new conduit stat summary endpoint. 
Changes the public-api to call prometheus directly instead of the
telemetry service. Wired through to `api/stat` on the web server,
as well as `conduit statsummary` on the CLI. Works for deployments only.

Current implementation just retrieves requests and mesh/total pod count 
(so latency stats are always 0). 

Uses API defined in #663
Example queries the stat endpoint will eventually satisfy in #627

This branch includes commits from @klingerf 

* run ./bin/dep ensure
* run ./bin/update-go-deps-shas
2018-04-05 17:05:06 -07:00
Brian Smith 4fb9877b89
Proxy: Refactor poll_destination() in service discovery. (#674)
No change in behavior is intended here.

Split poll_destination() into two parts, one that operates locally
on the DestinationSet, and the other that operates on data that isn't
wholly local to the DestinationSet. This makes the code easier to
understand. This is being done in preparation for adding DNS fallback
polling to poll_destination().

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 13:05:11 -10:00
Brian Smith 311ef410a8
Proxy: Refactor DNS name parsing and normalization (#673)
Proxy: Refactor DNS name parsing and normalization

Only the destination service needs normalized names (and even then,
that's just temporary). The rest of the code needs the name as it was
given, except case-normalized (lowercased). Because DNS fallack isn't
implemented in service discovery yet, Outbound still a temporary
workaround using FullyQualifiedName to keep things working; thta will
be removed once DNS fallback is implemented in service discovery.

Signed-off-by: Brian Smith <brian@briansmith.org>
2018-04-05 12:32:12 -10:00
Andrew Seigner 28d5007cdf
Harmonize Prometheus label usage (#690)
The Destination service used slightly different labels than the
telemetry pipeline expected, specifically, prefixed with `k8s_*`.

Make all Prometheus labels consistent by dropping `k8s_*`. Also rename
`pod_name` to `pod` for consistency with `deployement`, etc. Also update
and reorganize `proxy-metrics.md` to reflect new labelling.

Fixes #655

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2018-04-05 15:09:06 -07:00