linkerd2

Commit Graph

Author	SHA1	Message	Date
Ivan Sim	11d1d55632	Filter out failed and completed pods from stats summary result (#1010 ) (#1065 ) Both the conduit stat command and web UI are showing failed and completed pods. This change filters out those pods before returning the result to the client. Fixes #1010 Signed-off-by: Ivan Sim <ihcsim@gmail.com>	2018-06-05 13:19:48 -07:00
Kevin Lingerfelt	ec2433e9bd	Update controller to use 'tls' metric label (#1044 ) * Update controller to use 'tls' metric label * Fix meshed column formatter Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-06-01 16:44:33 -07:00
Risha Mars	ffabdefc6c	Add queries to prometheus to determine number of fully meshed requests (#983 ) - Update the `response_total` prometheus query of the StatSummary endpoint to also break queries out by a `meshed` label. - Add a 'Secured' column to the web UI/CLI stat displays, which indicate the percentage of traffic starting and ending in the mesh This meshed label is used in the CLI/Web UI to display a column of the percentage of traffic that starts/ends in the mesh. (Which is a proxy indicator for whether that traffic is 'secured' when we add TLS by default for intra mesh requests). The `meshed` label is not yet added anywhere, so until it is supplied by the proxy, all traffic will show up as 0% secured in the web/CLI.	2018-05-24 11:05:09 -07:00
Andrew Seigner	84e6eb5c87	Fix nil pointer dereference in StatSummary (#991 ) The StatSummary endpoint was dereferencing StatSummaryRequest.Selector.Resource, causing a panic when it received an empty request. Fix StatSummary to use the nil-friendly StatSummaryRequest.GetSelector().GetResource() methods, and add a test to validate. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-05-23 13:21:49 -07:00
Risha Mars	1e6434f6de	Fix bug in the public-api where conduit stat params were ignored (#971 ) * Fix bug where we were dropping parts of the StatSummaryRequest * Add tests for prometheus query strings and for failed cases Problem In #928 I rewrote the stat api to handle 'all' as a resource type. To query for all resource types, we would copy the Resource, LabelSelector and TimeWindow of the original request, and then go through all the resource types and set Resource.Type for each resource we wanted to get. The bug is that while we copy over some fields of the original request, we didn't copy over all of them - namely Resource.Name and the Outbound resource. So the Stat endpoint would ignore any --to or --from flags, and would ignore requests for a specific named resource. Solution Copy over all fields from the request. I've also added tests for this case. In this process I've refactored the stat_summary_test code to make it a bit easier to read/use.	2018-05-18 16:06:06 -07:00
Risha Mars	b8dc83f9d2	Modify the Stat API to handle requests for resource type "all" (#928 ) Allow the Stat endpoint in the public-api to accept requests for resourceType "all". Currently, this queries Pods, Deployments, RCs and Services, but can be modified to query other resources as well. Both the CLI and web endpoints now work if you set resourceType to all. e.g. `conduit stat all`	2018-05-11 14:35:37 -07:00
Kevin Lingerfelt	4e8e1eb84d	CLI: Fix validation for service stats (#935 ) * CLI: Fix validation for service stats * Address review feedback Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-05-11 10:28:49 -07:00
Risha Mars	f94856e489	Modify the Stat endpoint to also return the number of failed conduit pods (#895 ) * Modify the Stat endpoint to also return the count of failed pods * Add comments explaining pod count stats * Rename total pod count to running pod count This is to support the service mesh overview page, as I'd like to include an indicator of failed pods there.	2018-05-08 10:35:21 -07:00
Andrew Seigner	dce31b888f	Deprecate Tap, rename TapByResource to Tap (#844 ) The `conduit tap` command is now deprecated. Replace `conduit tap` with `connduit tapByResource`. Rename tapByResource to tap. The underlying protobuf for tap remains, the tap gRPC endpoint now returns Unimplemented. Fixes #804 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-25 12:24:46 -07:00
Andrew Seigner	a0a9a42e23	Implement Public API and Tap on top of Lister (#835 ) public-api and and tap were both using their own implementations of the Kubernetes Informer/Lister APIs. This change factors out all Informer/Lister usage into the Lister module. This also introduces a new `Lister.GetObjects` method. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-24 18:10:48 -07:00
Andrew Seigner	baf4ea1a5a	Implement TapByResource in Tap Service (#827 ) The TapByResource endpoint was previously a stub. Implement end-to-end tapByResource functionality, with support for specifying any kubernetes resource(s) as target and destination. Fixes #803, #49 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-23 16:13:26 -07:00
Andrew Seigner	79bdc638b3	Service support in stat command (#809 ) The `stat` command did not support `service` as a resource type. This change adds `service` support to the `stat` command. Specifically: - as a destination resource on `--to` commands - as a target resource on `--from` commands Fixes #805 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-19 16:51:20 -07:00
Andrew Seigner	293e00bc3e	Introduce tapByResource cli command (#802 ) The existing `tap` command is being deprecated. Introduce a `tapByResource` cli command. It supports tapping a Kubernetes resource or collection of resources, optionally filtered by outbound resources. This command will eventually replace `tap`. Part of #778 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-19 14:44:23 -07:00
Kevin Lingerfelt	653dc6bfaa	Add replication controller stats in CLI (#794 ) * Add replication controller stats in CLI * Fix pod status in stat summary tests Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-18 18:12:14 -07:00
Oliver Gould	06dd8d90ee	Introduce the TapByResource API (#778 ) This changes the public api to have a new rpc type, `TapByResource`. This api supersedes the Tap api. `TapByResource` is richer, more closely reflecting the proxy's capabilities. The proxy's Tap api is extended to select over destination labels, corresponding with those returned by the Destination api. Now both `Tap` and `TapByResource`'s responses may include destination labels. This change avoids breaking backwards compatibility by: * introducing the new `TapByResource` rpc type, opting not to change Tap * extending the proxy's Match type with a new, optional, `destination_label` field. * `TapEvent` is extended with a new, optional, `destination_meta`.	2018-04-18 15:37:07 -07:00
Kevin Lingerfelt	71a51afb40	Expose pod stats in CLI, web UI, and Grafana (#788 ) * Expose pod stats in CLI, web UI, and Grafana * Fix js api helpers test * Add outbound traffic stats to pod dashboard Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-18 11:26:47 -07:00
Andrew Seigner	727521f914	Permit arbitrary time windows in public-api (#774 ) The public-api previously only permitted 4 hard-coded time windows: 10s, 1m, 10m, 1h. This was primarily a relic of the recently removed telemetry system. Modify the public-api to validate the time string, but allow for any window size, which is then passed through to Prometheus. Fixes #686 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-16 17:37:17 -07:00
Kevin Lingerfelt	11a4359e9a	Misc cleanup following the telemetry rewrite (#771 ) Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-16 15:51:07 -07:00
Andrew Seigner	77fb6d3709	Add namespace as a resource type in public-api (#760 ) * Add namespace as a resource type in public-api The cli and public-api only supported deployments as a resource type. This change adds support for namespace as a resource type in the cli and public-api. This also change includes: - cli statsummary now prints `-`'s when objects are not in the mesh - cli statsummary prints `No resources found.` when applicable - removed `out-` from cli statsummary flags, and analagous proto changes - switched public-api to use native prometheus label types - misc error handling and logging fixes Part of #627 Signed-off-by: Andrew Seigner <siggy@buoyant.io> * Refactor filter and groupby label formulation Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Rename stat_summary.go to stat.go in cli Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Update rbac privileges for namespace stats Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-13 16:53:01 -07:00
Andrew Seigner	21886760c6	Use apps/v1beta2 for Kubernetes 1.8 compatibility (#762 ) Conduit was relying on apps/v1 to Deployment and ReplicaSet APIs. apps/v1 is not available on Kubernetes 1.8. This prevented the public-api from starting. Switch Conduit to use apps/v1beta2. Also increase the Kubernetes API cache sync timeout from 10 to 60 seconds, as it was taking 11 seconds on a test cluster. Fixes #761 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-13 12:08:16 -07:00
Kevin Lingerfelt	fb15fe7c1a	Remove the telemetry service (#757 ) * Remove the telemetry service The telemetry service is no longer needed, now that prometheus scrapes metrics directly from proxies, and the public-api talks directly to prometheus. In this branch I'm removing the service itself as well as all of the telemetry protobuf, and updating the conduit install command to no longer install the service. I'm also removing the old version of the stat command, which required the telemetry service, and renaming the statsummary command to stat. * Fix time window tests * Remove deprecated controller scrape config Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-13 11:21:29 -07:00
Andrew Seigner	e9b209829d	Handle NaN metrics (#750 ) The Prometheus client sometimes returns NaN if a calculation is invalid, such as histogram_quantile when no requests have occurred. Add IsNaN check in the public-api and set output to zero. Fixes #747 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-12 15:21:00 -07:00
Andrew Seigner	624b87f743	Implement ListPods in public-api (#743 ) The ListPods endpoint's logic resides in the telemetry service, which is going away. Move ListPods logic into public-api, use new k8s informer APIs. Fixes #694 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-11 17:53:57 -07:00
Kevin Lingerfelt	47caf1ca07	Add --all-namespaces flag to CLI statsummary command (#745 ) * Add --all-namespaces flag to CLI statsummary command * Fix statsummary output formatting Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-11 16:40:25 -07:00
Andrew Seigner	259fdcd134	Add latency stats in new stat summary endpoint (#737 ) The new StatSummary endpoint was only providing request volume and successs rate information. Add support for retrieving latency stats via StatSummary. Also make all prometheus calls in parallel, and implement kubernetes test fixtures. Fixes #681 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-11 11:58:32 -07:00
Kevin Lingerfelt	91c359e612	Switch public API to use cached k8s resources (#724 ) * Switch public API to use cached k8s resources * Move shared informer code to separate goroutine * Fix spelling issue Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-04-10 11:39:31 -07:00
Andrew Seigner	3a341abe9a	Fix success rate calculation in public api (#723 ) The success rate calculation relies on the `classification` label, but was incorrectly specifying `fail` rather than `failure`. Fix public api to specify `failure`. Also re-org public api tests for easier Kubernetes and Prometheus mocking. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-10 11:04:04 -07:00
Andrew Seigner	716b392231	Move StatSummary logic into grpc server (#717 ) The StatSummary logic was implemented as a method on http_server. Move the StatSummary logic into grpc_server, for consistency with the other endpoints. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-06 16:46:15 -07:00
Andrew Seigner	50c323c617	Use canonical k8s names, fix prom labels (#702 ) The new statsummary command accepted friendly k8s names, which worked for k8s queries, but Prometheus requires a specific key. Modify the statsummary query to map friendly k8s names to canonical k8s names when constructing the query. Then during the query, map the canonical k8s name to a specific Prometheus label. Fixes #695 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-04-06 12:34:54 -07:00
Risha Mars	2f5b5ea5f2	Start implementing conduit stat summary endpoint (#671 ) Start implementing new conduit stat summary endpoint. Changes the public-api to call prometheus directly instead of the telemetry service. Wired through to `api/stat` on the web server, as well as `conduit statsummary` on the CLI. Works for deployments only. Current implementation just retrieves requests and mesh/total pod count (so latency stats are always 0). Uses API defined in #663 Example queries the stat endpoint will eventually satisfy in #627 This branch includes commits from @klingerf * run ./bin/dep ensure * run ./bin/update-go-deps-shas	2018-04-05 17:05:06 -07:00
Risha Mars	d1a39ea6bf	Define a new telemetry Stat API (#663 ) * Define a new telemetry Stat API Proposal definition for a new Stat API, for the purposes of satisfying the queries proposed in #627. StatSummary will replace Stat once implemented and the original Stat deleted.	2018-04-03 14:45:58 -07:00
Dennis Adjei-Baah	5a4c5aa683	Exclude telemetry generated by the control plane when requesting depl… (#493 ) When the conduit proxy is injected into the controller pod, we observe controller pod proxy stats show up as an "outbound" deployment for an unrelated upstream deployment. This may cause confusion when monitoring deployments in the service mesh. This PR filters out this "misleading" stat in the public api whenever the dashboard requests metric information for a specific deployment. * exclude telemetry generated by the control plane when requesting deployment metrics fixes #370 Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>	2018-03-05 17:58:08 -08:00
Kevin Lingerfelt	e57e74056e	Run go fix to fix context package imports (#470 ) Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-02-28 13:25:33 -08:00
Andrew Seigner	1db7d2a2fb	Ensure latency quantile queries match timestamps (#348 ) In PR #298 we moved time window parsing (10s => (time.now - 10s, time.now) down the stack to immediately before the query. This had the unintended effect of creating parallel latency quantile requests with slightly different timestamps. This change parses the time window prior to latency quantile fan out, ensuring all requests have the same timestamp. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-13 16:26:54 -08:00
Andrew Seigner	50f4aa57e5	Require timestamp on all telemetry requests (#342 ) PR #298 moved summary (non-timeseries) requests to Prometheus' Query endpoint, with no timestamp provided. This Query endpoint returns a single data point with whatever timestamp was provided in the request. In the absense of a timestamp, it uses current server time. This causes the Public API to return discreet data points with slightly different timestamps, which is unexpected behavior. Modify the Public API -> Telemetry -> Prometheus request path to always require a timestamp for single data point requests. Fixes #340 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-13 13:52:21 -08:00
Andrew Seigner	261586b862	Fix pointer copying (#330 ) The Public APIs stat endpoint copies a slice of values to a slice of pointers prior to gRPC response. Go's range clause re-uses the same pointer for each iteration of the loop, causing a slice of {1,2,3} becoming {3,3,3}. Fix the range loop to directly reference pointers in the slice of values, ignoring the range variable. Also add tests to catch this case. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-10 11:04:28 -08:00
Andrew Seigner	bffa5ff3e6	Concurrent Telemetry requests (#323 ) All requests from the public API service to the Telemetry service were done serially. In some cases a single request to the public API's Stat endpoint resulted in 5 serial requests to the Telemetry service. Make all requests from the Public API to Telemetry concurrent. Signed-off-by: Andrew Seigner <siggy@buoyant.io> Part of #299	2018-02-09 17:11:20 -08:00
Eliza Weisman	458e9d2ac5	Remove per-path metrics from telemetry pipeline (#317 ) Follow-up from #315. Now that the UIs don't report per-path metrics, we can remove the path label from Prometheus, the path aggregation and filtering options from the telemetry API, and the path field from the proxy report API. I've modified the tests to no longer expect the removed fields, and manually verified that Conduit still works after making these changes. Closes #265 Signed-off-by: Eliza Weisman <eliza@buoyant.io>	2018-02-09 14:20:28 -08:00
Andrew Seigner	33e3c3ace9	Optimize Prometheus queries (#298 ) Prometheus queries from the Telemetry service were taking seconds or 10s of seconds. Optimize these queries: - Move all summary queries requiring a single point data off of Prometheus' QueryRange() endpoint, onto Query() - Set `defaultVectorRange` to 30s, and also use it regardless of time window Also add tests for grpc_server and telemetry server Signed-off-by: Andrew Seigner <siggy@buoyant.io> Fixes #260	2018-02-09 10:55:07 -08:00
Eliza Weisman	2015d992cc	Remove pod-level metrics from web and CLI (#304 ) This PR updates the web UI to remove the pod detail page, and to remove the links to that page from pod names in metrics tables. It also removes the `pods` option from `conduit stat`, and the `sourcePod` and `targetPod` fields from the controller API proto's `MetricMetadata` message. I've updated the `conduit stat` tests to reflect these changes, and manually verified the web UI changes. Closes #261 Signed-off-by: Eliza Weisman <eliza@buoyant.io>	2018-02-08 19:07:10 -08:00
Phil Calçado	9c03764a29	Remove hardcoded port and shared state for http test (#282 ) We now create a new test HTTP server per test case instead of sharing it across them all. This should solve the data races we have experienced on Travis. Signed-off-by: Phil Calcado <phil@buoyant.io>	2018-02-06 13:48:14 -05:00
Andrew Seigner	4156af786d	Enable race detection in ci (#259 ) We previously did not have race detection enabled because our tests would fail. Following #249, this is no longer the case. Enable race detection in ci and build instructions. This change also fixes client_test.go attempting to allocate a 2GB buffer due to bad test input. Fixes #173 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-02 15:04:52 -08:00
Andrew Seigner	277c06cf1e	Simplify and refactor k8s labels and annnotations (#227 ) The conduit.io/* k8s labels and annotations we're redundant in some cases, and not flexible enough in others. This change modifies the labels in the following ways: `conduit.io/plane: control` => `conduit.io/controller-component: web` `conduit.io/controller: conduit` => `conduit.io/controller-ns: conduit` `conduit.io/plane: data` => (remove, redundant with `conduit.io/controller-ns`) It also centralizes all k8s labels and annotations into pkg/k8s/labels.go, and adds tests for the install command. Part of #201 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2018-02-01 14:12:06 -08:00
Risha Mars	a9d4a3d74e	Add more prometheus instrumentation (latency, response size) (#174 ) We added basic prometheus instrumentation, but this only encapsulated basic go metrics and request counts. This adds latency and response size metrics exporting as well, to the public-api server, theweb server and the telemetry server. Since the util function in grpc.go was basically used to wrap the server creation in a prometheus handler, I added the other prometheus constants in there and renamed the file to prometheus.go. - Add request duration and response size instrumentation to web and public api - Also add latency monitoring to telemetry service requests - Rename util/grpc.go to util/prometheus.go	2018-02-01 09:50:31 -08:00
Kevin Lingerfelt	4a76c6448b	Update cli subcommands to print errors when encountered (#221 ) Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-01-29 11:28:19 -08:00
Kevin Lingerfelt	7399df83f1	Set conduit version to match conduit docker tags (#208 ) * Set conduit version to match conduit docker tags Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Remove --skip-inbound-ports for emojivoto Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Rename git_sha => git_sha_head Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Switch to using the go linker for setting the version Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Log conduit version when go servers start Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Cleanup conduit script Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Add --short flag to head sha command Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Set CONDUIT_VERSION in docker-compose env Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-01-26 11:43:45 -08:00
Phil Calçado	9410da471a	Better error handling for Tap (#177 ) Previously, running `$conduit tap` would return a `Unexpected EOF` error when the server wasn't available. This was due to a few problems with the way we were handling errors all the way down the tap server. This change fixes that and cleans some of the protobuf-over-HTTP code. - first step towards #49 - closes #106	2018-01-25 11:49:38 -05:00
Dennis Adjei-Baah	f7af375e73	Remove scheme requirement for api-addr flag in conduit CLI (#126 ) * Allow external controller public api clients that don't rely on a kubeconfig to interact with Conduit CLI Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>	2018-01-17 17:12:44 -08:00
Kevin Lingerfelt	fd3cfcb5d9	Move healthcheck proto to separate file, use throughout (#150 ) * Move healthcheck proto to separate file, use throughout Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Remove Check message from healthcheck.proto Signed-off-by: Kevin Lingerfelt <kl@buoyant.io> * Standardize healthcheck protobuf import name Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>	2018-01-17 11:15:38 -08:00
Phil Calçado	612bd0f7a0	Add --verbose option to CLI (#154 ) * Use stdout as writer for tap command fixes #136 Signed-off-by: Phil Calcado <phil@buoyant.io> * Add --log-level to command line Signed-off-by: Phil Calcado <phil@buoyant.io>	2018-01-17 12:06:43 -05:00

1 2

58 Commits