linkerd2

Commit Graph

Author	SHA1	Message	Date
Eliza Weisman	6eec6256f7	Add transport-level metrics to simulate-proxy (#811 ) This PR adds the transport-level metrics described in #742 to the `simulate-proxy` script. This will be useful while adding these metrics to the Grafana dashboard and/or CLI. Closes #793	2018-04-19 15:18:43 -07:00
Andrew Seigner	293e00bc3e	Introduce tapByResource cli command (#802 ) The existing `tap` command is being deprecated. Introduce a `tapByResource` cli command. It supports tapping a Kubernetes resource or collection of resources, optionally filtered by outbound resources. This command will eventually replace `tap`. Part of #778 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-19 14:44:23 -07:00
Kevin Lingerfelt	653dc6bfaa	Add replication controller stats in CLI (#794 ) * Add replication controller stats in CLI * Fix pod status in stat summary tests Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-18 18:12:14 -07:00
Oliver Gould	06dd8d90ee	Introduce the TapByResource API (#778 ) This changes the public api to have a new rpc type, `TapByResource`. This api supersedes the Tap api. `TapByResource` is richer, more closely reflecting the proxy's capabilities. The proxy's Tap api is extended to select over destination labels, corresponding with those returned by the Destination api. Now both `Tap` and `TapByResource`'s responses may include destination labels. This change avoids breaking backwards compatibility by: * introducing the new `TapByResource` rpc type, opting not to change Tap * extending the proxy's Match type with a new, optional, `destination_label` field. * `TapEvent` is extended with a new, optional, `destination_meta`.	2018-04-18 15:37:07 -07:00
Andrew Seigner	1e4ac8fda8	Destination service provides pod-template-hash (#784 ) The Destination service does not provide ReplicaSet information to the proxy. The `pod-template-hash` label approximates selecting over all pods in a ReplicaSet or ReplicationController. Modify the Destination service to provide this label to the proxy. Relates to #508 and #741 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-18 14:41:27 -07:00
Kevin Lingerfelt	71a51afb40	Expose pod stats in CLI, web UI, and Grafana (#788 ) * Expose pod stats in CLI, web UI, and Grafana * Fix js api helpers test * Add outbound traffic stats to pod dashboard Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-18 11:26:47 -07:00
Andrew Seigner	9e8cce0838	Destination service returns "Running" pod labels (#781 ) When the Destination sees an IP address, it looks up Pods by that IP, and associates Pod label data to it. If the lookup by IP returned more than one Pod, it simply picked the first one. This is not correct, specifically in cases where one pod is in a Running state, and others are not. Modify the Destination service to only return label data for Pods in the Running state. Fixes #773 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-17 14:42:54 -07:00
Andrew Seigner	727521f914	Permit arbitrary time windows in public-api (#774 ) The public-api previously only permitted 4 hard-coded time windows: 10s, 1m, 10m, 1h. This was primarily a relic of the recently removed telemetry system. Modify the public-api to validate the time string, but allow for any window size, which is then passed through to Prometheus. Fixes #686 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-16 17:37:17 -07:00
Kevin Lingerfelt	11a4359e9a	Misc cleanup following the telemetry rewrite (#771 ) Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-16 15:51:07 -07:00
Andrew Seigner	77fb6d3709	Add namespace as a resource type in public-api (#760 ) * Add namespace as a resource type in public-api The cli and public-api only supported deployments as a resource type. This change adds support for namespace as a resource type in the cli and public-api. This also change includes: - cli statsummary now prints `-`'s when objects are not in the mesh - cli statsummary prints `No resources found.` when applicable - removed `out-` from cli statsummary flags, and analagous proto changes - switched public-api to use native prometheus label types - misc error handling and logging fixes Part of #627 Signed-off-by: Andrew Seigner <siggy@buoyant.io> * Refactor filter and groupby label formulation Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Rename stat_summary.go to stat.go in cli Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Update rbac privileges for namespace stats Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-13 16:53:01 -07:00
Andrew Seigner	21886760c6	Use apps/v1beta2 for Kubernetes 1.8 compatibility (#762 ) Conduit was relying on apps/v1 to Deployment and ReplicaSet APIs. apps/v1 is not available on Kubernetes 1.8. This prevented the public-api from starting. Switch Conduit to use apps/v1beta2. Also increase the Kubernetes API cache sync timeout from 10 to 60 seconds, as it was taking 11 seconds on a test cluster. Fixes #761 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-13 12:08:16 -07:00
Kevin Lingerfelt	fb15fe7c1a	Remove the telemetry service (#757 ) * Remove the telemetry service The telemetry service is no longer needed, now that prometheus scrapes metrics directly from proxies, and the public-api talks directly to prometheus. In this branch I'm removing the service itself as well as all of the telemetry protobuf, and updating the conduit install command to no longer install the service. I'm also removing the old version of the stat command, which required the telemetry service, and renaming the statsummary command to stat. * Fix time window tests * Remove deprecated controller scrape config Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-13 11:21:29 -07:00
Andrew Seigner	e9b209829d	Handle NaN metrics (#750 ) The Prometheus client sometimes returns NaN if a calculation is invalid, such as histogram_quantile when no requests have occurred. Add IsNaN check in the public-api and set output to zero. Fixes #747 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-12 15:21:00 -07:00
Andrew Seigner	624b87f743	Implement ListPods in public-api (#743 ) The ListPods endpoint's logic resides in the telemetry service, which is going away. Move ListPods logic into public-api, use new k8s informer APIs. Fixes #694 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-11 17:53:57 -07:00
Kevin Lingerfelt	47caf1ca07	Add --all-namespaces flag to CLI statsummary command (#745 ) * Add --all-namespaces flag to CLI statsummary command * Fix statsummary output formatting Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-11 16:40:25 -07:00
Andrew Seigner	259fdcd134	Add latency stats in new stat summary endpoint (#737 ) The new StatSummary endpoint was only providing request volume and successs rate information. Add support for retrieving latency stats via StatSummary. Also make all prometheus calls in parallel, and implement kubernetes test fixtures. Fixes #681 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-11 11:58:32 -07:00
Kevin Lingerfelt	e1e1b6b599	Controller: add more destination labels, fix service label (#731 ) * Add more destination labels, fix service label * Update owner labels to match proxy metrics docs Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-11 10:44:52 -07:00
Kevin Lingerfelt	91c359e612	Switch public API to use cached k8s resources (#724 ) * Switch public API to use cached k8s resources * Move shared informer code to separate goroutine * Fix spelling issue Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-10 11:39:31 -07:00
Andrew Seigner	3a341abe9a	Fix success rate calculation in public api (#723 ) The success rate calculation relies on the `classification` label, but was incorrectly specifying `fail` rather than `failure`. Fix public api to specify `failure`. Also re-org public api tests for easier Kubernetes and Prometheus mocking. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-10 11:04:04 -07:00
Andrew Seigner	716b392231	Move StatSummary logic into grpc server (#717 ) The StatSummary logic was implemented as a method on http_server. Move the StatSummary logic into grpc_server, for consistency with the other endpoints. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-06 16:46:15 -07:00
Andrew Seigner	50c323c617	Use canonical k8s names, fix prom labels (#702 ) The new statsummary command accepted friendly k8s names, which worked for k8s queries, but Prometheus requires a specific key. Modify the statsummary query to map friendly k8s names to canonical k8s names when constructing the query. Then during the query, map the canonical k8s name to a specific Prometheus label. Fixes #695 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-06 12:34:54 -07:00
Risha Mars	2f5b5ea5f2	Start implementing conduit stat summary endpoint (#671 ) Start implementing new conduit stat summary endpoint. Changes the public-api to call prometheus directly instead of the telemetry service. Wired through to `api/stat` on the web server, as well as `conduit statsummary` on the CLI. Works for deployments only. Current implementation just retrieves requests and mesh/total pod count (so latency stats are always 0). Uses API defined in #663 Example queries the stat endpoint will eventually satisfy in #627 This branch includes commits from @klingerf * run ./bin/dep ensure * run ./bin/update-go-deps-shas	2018-04-05 17:05:06 -07:00
Andrew Seigner	28d5007cdf	Harmonize Prometheus label usage (#690 ) The Destination service used slightly different labels than the telemetry pipeline expected, specifically, prefixed with `k8s_`. Make all Prometheus labels consistent by dropping `k8s_`. Also rename `pod_name` to `pod` for consistency with `deployement`, etc. Also update and reorganize `proxy-metrics.md` to reflect new labelling. Fixes #655 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-05 15:09:06 -07:00
Risha Mars	d1a39ea6bf	Define a new telemetry Stat API (#663 ) * Define a new telemetry Stat API Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627. StatSummary will replace Stat once implemented and the original Stat deleted.	2018-04-03 14:45:58 -07:00
Phil Calçado	19001f8d38	Add pod-based metric_labels to destinations response (#429 ) (#654 ) * Extracted logic from destination server * Make tests follow style used elsewhere in the code * Extract single interface for resolvers * Add tests for k8s and ipv4 resolvers * Fix small usability issues * Update dep * Act on feedback * Add pod-based metric_labels to destinations response * Add documentation on running control plane to BUILD.md Signed-off-by: Phil Calcado <phil@buoyant.io> * Fix mock controller in proxy tests (#656) Signed-off-by: Eliza Weisman <eliza@buoyant.io> * Address review feedback * Rename files in the destination package Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-02 18:36:57 -07:00
Brian Smith	df9ead9c36	Use Go 1.10.1 to build all Go code. (#650 ) Go 1.10.1 is a security release. Signed-off-by: Brian Smith <brian@briansmith.org>	2018-04-02 14:58:30 -10:00
Andrew Seigner	97546e0646	Modify simulate-proxy to be more pod-centric (#653 ) simulate-proxy uses a deployment object from kubernetes to simulate each proxy metrics endpoint. Modify simulate-proxy to instead use a pod to simulate each proxy metrics endpoint. This ensures that each metrics endpoint consistently represents a pod in kubernetes, including it's namespace, deployment, and label information. This change also adds support for: - a new `metric-ports` flag, default is `10000-10009`. - `classification`, `pod_name`, and `pod_template_hash` labels Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-03-30 13:28:45 -07:00
Phil Calçado	bbed49c5bd	Refactor destination service and add tests in preparation to add information about labels (#645 ) * Extracted logic from destination server * Make tests follow style used elsewhere in the code * Extract single interface for resolvers * Add tests for k8s and ipv4 resolvers * Fix small usability issues * Update dep * Act on feedback Signed-off-by: Phil Calcado <phil@buoyant.io>	2018-03-30 11:36:48 -07:00
Andrew Seigner	1ed4a93b5e	Higher velocity metrics from simulate-proxy (#635 ) simulate-proxy increments a single set of metrics on each iteration, and also randomizes http status codes, leaving counters unchanged across several collections. Modify simuilate-proxy to increment all metrics on each iteration, provide a 90% success rate, ensure a pod does not call itself, and increase proxy count from 3 to 10. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-03-28 13:30:02 -07:00
Kevin Lingerfelt	59c75a73a9	Add tests/utils/scripts for running integration tests (#608 ) * Add tests/utils/scripts for running integration tests Add a suite of integration tests in the `test/` directory, as well as utilities for testing in the `testutil/` directory. You can use the `bin/test-run` script to run the full suite of tests, and the `bin/test-cleanup` script to cleanup after the tests. The test/README.md file has more information about running tests. @pcalcado, @franziskagoltz, and @rmars also contributed to this change. * Create TEST.md file at the root of the repo * Update based on review feedback * Relax external service IP timeout for GKE * Update TEST.md with more info about different types of test runs * More updates to TEST.md based on review feedback Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-03-27 15:06:55 -07:00
Andrew Seigner	fe35509406	Clean up Prometheus labels scraped from proxy (#633 ) The Prometheus scrape config collects from Conduit proxies, and maps Kubernetes labels to Prometheus labels, appending "k8s_". This change keeps the resultant Prometheus labels consistent with their source Kubernetes labels. For example: "deployment" and "pod_template_hash". Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-03-27 15:01:08 -07:00
Brian Smith	7dc21f9588	Add the NoEndpoints message to the Destination API (#564 ) Have the controller tell the client whether the service exists, not just what are available. This way we can implement fallback logic to alternate service discovery mechanisms for ambigious names. Signed-off-by: Brian Smith <brian@briansmith.org> Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-03-27 10:45:41 -10:00
Andrew Seigner	12c6531546	Update docker-compose environment to match prod (#609 ) The Prometheus config in the docker-compose environment had fallen behind the prod setup. This change updates the docker-compose environment in the following ways: - Prometheus config more closely matches prod, based on #583 - simulate-proxy labels matches prod, based on #605 - add Grafana container Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-03-23 17:00:39 -07:00
Dennis Adjei-Baah	b90668a0b5	Modify simulate proxy to expose prometheus metrics (#576 ) The simulate-proxy script pushes metrics to the telemetry service. This PR modifies the script to expose metrics to a prometheus endpoint. This functionality creates a server that randomly generates response_total, request_totals, response_duration_ms and response_latency_ms. The server reads pod information from a k8s cluster and picks a random namespace to use for all exposed metrics. Tested out these changes with a locally running prometheus server. I also ran the docker-compose.yml to make sure metrics were being recorded by the prometheus docker container. fixes #498 Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>	2018-03-21 16:40:12 -07:00
Alena Varkockova	b82f89f4d9	Reuse code for metrics serving in controller (#585 ) Signed-off-by: Alena Varkockova varkockova.a@gmail.com	2018-03-19 10:33:25 -07:00
Alex Leong	9eb084c99d	Most controller listeners should only bind on localhost (#494 ) * Most controller listeners should only bind on localhost * Use default listening addresses in controller components * Review feedback * Revert test_helper change * Revert use of absolute domains Signed-off-by: Alex Leong <alex@buoyant.io>	2018-03-12 11:32:20 -07:00
Dennis Adjei-Baah	ad42f2f8ab	Retry k8s watch endpoints on error (#510 ) Shortly after conduit is installed in k8s environment. The control plane component that establishes a watch endpoint with k8s run in to networking issues during proxy initialization. During failure, each watcher fails to retry its connection to k8s watch endpoint which leads to timeouts and eventually, multiple controller pod restarts. This PR adds retry logic to each "watch" enabled package. fixes #478 Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>	2018-03-07 13:40:43 -08:00
Dennis Adjei-Baah	5a4c5aa683	Exclude telemetry generated by the control plane when requesting depl… (#493 ) When the conduit proxy is injected into the controller pod, we observe controller pod proxy stats show up as an "outbound" deployment for an unrelated upstream deployment. This may cause confusion when monitoring deployments in the service mesh. This PR filters out this "misleading" stat in the public api whenever the dashboard requests metric information for a specific deployment. * exclude telemetry generated by the control plane when requesting deployment metrics fixes #370 Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>	2018-03-05 17:58:08 -08:00
Andrew Seigner	698e65da8b	Fix flakey dns_test (#516 ) The dns_test had assumed DNS changes were deterministically ordered, but util.DiffAddresses uses a map and therefore does not guarantee ordering. Fix dns_test to sort TCP Addresses prior to comparison. Fixes #515 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-03-05 16:50:33 -08:00
Kevin Lingerfelt	8e2ef9d658	Handle ExternalName-type svcs in destination service (#490 ) * Handle ExternalName-type svcs in destination service * Move refresh interval to a global var Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-03-02 11:30:53 -08:00
Alex Leong	9b4e847555	Add DNS label validation in destination service (#464 ) Add a validation in the destination service that ensures that DNS destinations consist of valid labels. Signed-off-by: Alex Leong <alex@buoyant.io>	2018-03-01 15:49:49 -08:00
Kevin Lingerfelt	e57e74056e	Run go fix to fix context package imports (#470 ) Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-02-28 13:25:33 -08:00
Alex Leong	84ba1f3017	Ensure tap requests at least 1rps from each pod (#459 ) When attempting to tap N pods when N is greater than the target rps, a rounding error occurs that requests 0 rps from each pod and no tap data is returned. Ensure that tap requests at least 1 rps from each target pod. Tested in Kubernetes on docker-for-desktop with a 15 replica deployment and a maxRps of 10. Signed-off-by: Alex Leong <alex@buoyant.io>	2018-02-27 16:03:47 -08:00
Brian Smith	78ebd5e340	Base control plane Docker images on scratch instead of base. (#368 ) The control plane is proxied through the Conduit proxy. The Conduit proxy is based on the base image, and the control plane containers and the proxy share a networking namespace. This means we don't need the extra base utilities in the controller images since we can use the utilties in the proxy image. This is a step towards building the initial no-networking Conduit CA pod. Since the Conduit CA will not do any networking of its own, we networking debugging utilties are not helpful for it. They are actually an unnecessary risk because they could facilitate the exfiltration of the private key of the CA. (The Conduit CA pod won't have the Conduit Proxy injected into it either.) This also simplifies & slightly speeds up the building of the controller images. This is a stepping stone towards being able to build the controller images without `docker build` to improve build times. Signed-off-by: Brian Smith <brian@briansmith.org>	2018-02-23 13:03:19 -10:00
Brian Smith	cf3c8cd7bc	Use Go 1.10.0 to build Go components. (#408 ) * Use Go 1.10.0 to build Go components. Take advantage of the new build cache in Go 1.10. Future work on improving build performance will utilize the build cache further. Signed-off-by: Brian Smith <brian@briansmith.org>	2018-02-21 14:31:29 -10:00
Brian Smith	e6aad57766	Remove temporary files generated by dep in go-deps image. (#407 ) Previously Dockerfile-go-deps was converted from a multi-stage Dockefile to a single-stage Dockerfile in anticipation of enabling efficient use of `--cache-from` in CI. However, that resulted in the image ballooning in size because it contained the Git repo for every package downloaded by `dep ensure`. Bring the image back down to the proper size by removing the temporary files created. Signed-off-by: Brian Smith <brian@briansmith.org>	2018-02-21 13:06:24 -10:00
Alex Leong	552204366c	Use Prometheus to track added data plane pods. (#338 ) The instance cache that powers the ListPods API is stored in memory in the telemetry service. This means that when there are multiple replicas of the telemetry service, each replica will have a distinct, incomplete view of the added pods based on which pods report to that telemetry replica. This causes the data plane bubbles on the dashboard to not all be filled in, and to flicker with each data refresh. We create a Prometheus counter called reports_total which has pod as a label. Whenever a telemetry service instance receives a report from a pod, it increments reports_total for that pod. This allows us to remove the in-memory instance cache and instead query Prometheus to see if each pod has had a report in the last 30 seconds. Fixes #337 Signed-off-by: Alex Leong <alex@buoyant.io>	2018-02-14 16:09:55 -08:00
Andrew Seigner	1db7d2a2fb	Ensure latency quantile queries match timestamps (#348 ) In PR #298 we moved time window parsing (10s => (time.now - 10s, time.now) down the stack to immediately before the query. This had the unintended effect of creating parallel latency quantile requests with slightly different timestamps. This change parses the time window prior to latency quantile fan out, ensuring all requests have the same timestamp. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-13 16:26:54 -08:00
Andrew Seigner	50f4aa57e5	Require timestamp on all telemetry requests (#342 ) PR #298 moved summary (non-timeseries) requests to Prometheus' Query endpoint, with no timestamp provided. This Query endpoint returns a single data point with whatever timestamp was provided in the request. In the absense of a timestamp, it uses current server time. This causes the Public API to return discreet data points with slightly different timestamps, which is unexpected behavior. Modify the Public API -> Telemetry -> Prometheus request path to always require a timestamp for single data point requests. Fixes #340 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-13 13:52:21 -08:00
Brian Smith	b18fe459d4	Precompile large Go libraries in go-deps Docker image. (#332 ) On my system (i9-7960x running Docker natively in Linux) this regularly saves over 11 seconds of build time when a file under pkg/ changes and over 1.5 seconds of build time when a file under controller/ changes. Since most contributors are running Docker in a VM on less powerful computers, the savings for most contributors should be significantly greater. I imagine the savings for web/ and cli/ and proxy-init/ are similar, but I did not measure them. Signed-off-by: Brian Smith <brian@briansmith.org>	2018-02-13 11:35:10 -10:00

1 2 3

101 Commits