linkerd2

Commit Graph

Author	SHA1	Message	Date
Tarun Pothulapati	cd2e911be3	viz: add data-plane and prometheus healthchecks (#5602 ) * viz: add data-plane and prometheus healthchecks Fixes #5325 This branch adds the remaining healthchecks for the viz extension i.e - Data-plane metrics check in Prometheus - `--proxy` mode which also checks for tap injections based on annotations. For this, The following changes were needed - Category.ID is made public so that --proxy toggleness can be allowed - Made tap env key as a field so that it can be re-used for checks simplify viz.NewHealthChecker by removing the need to pass categoryIDs and instead using hc.appendCategories directly at the caller to add the required categories. This is possible by dividing the vizCategories into separate functions Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-02-01 23:01:13 +05:30
Matei David	0ce9e84a94	Introduce V1 to CRDs and Mutating Hooks (#5603 ) Closes #5484 ### Changes --- Overview: * Update golden files and make necessary spec changes * Update test files for viz * Add v1 to healthcheck and uninstall * Fix link-crd clusterDomain field validation - To update to v1, I had to change crd schemas to be version-based (i.e each version has to declare its own schema). I noticed an error in the link-crd (`targetClusterDomain` was `targetDomainName`). Also, additionalPrinterColumns are also version-dependent as a field now. - For `admissionregistration` resources I had to add an additional `admissionReviewVersions` field -- I included `v1` and `v1beta1`. - In `healthcheck.go` and `resources.go` (used by `uninstall`) I had to make some changes to the client-go versions (i.e from `v1beta1` to `v1` for admissionreg and apiextension) so that we don't see any warning messages when uninstalling or when we do any install checks. I tested again different cli and k8s versions to have a bit more confidence in the changes (in addition to automated tests), hope the cases below will be enough, if not let me know and I can test further. ### Tests Linkerd local build CLI + k8s 1.19+ `install/check/mc-check/mc-install/mc-link/viz-install/viz-check/uninstall/` ``` $ kubectl version Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2+k3s1", GitCommit:"1d4adb0301b9a63ceec8cabb11b309e061f43d5f", GitTreeState:"clean", BuildDate:"2021-01-14T23:52:37Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} $ bin/linkerd version Client version: git-b0fd2ec8 Server version: unavailable $ bin/linkerd install \| kubectl apply -f - - no errors, no version warnings - $ bin/linkerd check --expected-version git-b0fd2ec8 Status check results are :tick: # MC $ bin/linkerd mc install \| k apply -f - - no erros, no version warnings - $ bin/linkerd mc check Status check results are :tick: $ bin/linkerd mc link foo \| k apply -f - # test crd creation # had a validation error here because the schema had targetDomainName instead of targetClusterDomain # changed, rebuilt cli, re-installed mc, tried command again secret/cluster-credentials-foo created link.multicluster.linkerd.io/foo created ... # VIZ $ bin/linkerd viz install \| k apply -f - - no errors, no version warnings - $ bin/linkerd viz check - no errors, no version warnings - Status check results are :tick: $ bin/linkerd uninstall \| k delete -f - - no errors, no version warnings - ``` Linkerd local build CLI + k8s 1.17 `check-pre/install/mc-check/mc-install/mc-link/viz-install/viz-check` ``` $ kubectl version Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.17-rc1+k3s1", GitCommit:"e8c9484078bc59f2cd04f4018b095407758073f5", GitTreeState:"clean", BuildDate:"2021-01-14T06:20:56Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"} $ bin/linkerd version Client version: git-3d2d4df1 # made changes to link-crd after prev test case Server version: unavailable $ bin/linkerd check --pre --expected-version git-3d2d4df1 - no errors, no version warnings - Status check results are :tick: $ bin/linkerd install \| k apply -f - - no errors, no version warnings - $ bin/linkerd check --expected-version git-3d2d4df1 - no errors, no version warnings - Status check results are :tick: $ bin/linkerd mc install \| k apply -f - - no errors, no version warnings - $ bin/linkerd mc check - no errors, no version warnings - Status check results are :tick: $ bin/linkerd mc link --cluster-name foo \| k apply -f - bin/linkerd mc link --cluster-name foo \| k apply -f - secret/cluster-credentials-foo created link.multicluster.linkerd.io/foo created # VIZ $ bin/linkerd viz install \| k apply -f - - no errors, no version warnings - $ bin/linkerd viz check - no errors, no version warnings - - hangs up indefinitely after linkerd-viz can talk to Kubernetes ``` Linkerd edge (21.1.3) CLI + k8s 1.17 (already installed) `check` ``` $ linkerd version Client version: edge-21.1.3 Server version: git-3d2d4df1 $ linkerd check - no errors - - warnings: mismatch between cli & control plane, control plane not up to date (both expected) - Status check results are :tick: ``` Linkerd stable (2.9.2) CLI + k8s 1.17 (already installed) `check/uninstall` ``` $ linkerd version Client version: stable-2.9.2 Server version: git-3d2d4df1 $ linkerd check × control plane ClusterRoles exist missing ClusterRoles: linkerd-linkerd-tap see https://linkerd.io/checks/#l5d-existence-cr for hints Status check results are × # viz wasn't installed, hence the error, installing viz didn't help since # the res is named `viz-tap` now # moving to uninstall $ linkerd uninstall \| k delete -f - - no warnings, no errors - ``` _Note_: I used `go test ./cli/cmd/... --generate` which is why there are so many changes 😨 Signed-off-by: Matei David <matei.david.35@gmail.com>	2021-02-01 09:18:13 -05:00
Hu Shuai	5e3d5190c3	Add unit test for pkg/healthcheck/sidecar.go (#5609 ) Signed-off-by: Hu Shuai <hus.fnst@cn.fujitsu.com>	2021-01-27 16:56:14 -05:00
Tarun Pothulapati	4f0601e632	jaeger: cli and check logic cleanup (#5564 ) This branch cleans up some of the unnecessary logic that is not needed and thus making the check logic similar to that of other extensions, namely viz. Includes the following cleanups: - Remove `namespace` flag in jaeger CLI and make the fetching logic dynamic and use it in check and dashboard. - Use `hc.KubeAPIClient` instead of creating our own in jaeger check. - Move injection checks up before we run the readiness checks This change adds a new extension namespace exist check for jaeger. Also, Updates integration tests to run the check commands. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-01-22 23:31:35 +05:30
Alejandro Pedraza	8ac5360041	Extract from public-api all the Prometheus dependencies, and moves things into a new viz component 'linkerd-metrics-api' (#5560 ) * Protobuf changes: - Moved `healthcheck.proto` back from viz to `proto/common` as it remains being used by the main `healthcheck.go` library (it was moved to viz by #5510). - Extracted from `viz.proto` the IP-related types and put them in `/controller/gen/common/net` to be used by both the public and the viz APIs. * Added chart templates for new viz linkerd-metrics-api pod * Spin-off viz healthcheck: - Created `viz/pkg/healthcheck/healthcheck.go` that wraps the original `pkg/healthcheck/healthcheck.go` while adding the `vizNamespace` and `vizAPIClient` fields which were removed from the core `healthcheck`. That way the core healthcheck doesn't have any dependencies on viz, and viz' healthcheck can now be used to retrieve viz api clients. - The core and viz healthcheck libs are now abstracted out via the new `healthcheck.Runner` interface. - Refactored the data plane checks so they don't rely on calling `ListPods` - The checks in `viz/cmd/check.go` have been moved to `viz/pkg/healthcheck/healthcheck.go` as well, so `check.go`'s sole responsibility is dealing with command business. This command also now retrieves its viz api client through viz' healthcheck. * Removed linkerd-controller dependency on Prometheus: - Removed the `global.prometheusUrl` config in the core values.yml. - Leave the Heartbeat's `-prometheus` flag hard-coded temporarily. TO-DO: have it automatically discover viz and pull Prometheus' endpoint (#5352). * Moved observability gRPC from linkerd-controller to viz: - Created a new gRPC server under `viz/metrics-api` moving prometheus-dependent functions out of the core gRPC server and into it (same thing for the accompaigning http server). - Did the same for the `PublicAPIClient` (now called just `Client`) interface. The `VizAPIClient` interface disappears as it's enough to just rely on the viz `ApiClient` protobuf type. - Moved the other files implementing the rest of the gRPC functions from `controller/api/public` to `viz/metrics-api` (`edge.go`, `stat_summary.go`, etc.). - Also simplified some type names to avoid stuttering. * Added linkerd-metrics-api bootstrap files. At the same time, we strip out of the public-api's `main.go` file the prometheus parameters and other no longer relevant bits. * linkerd-web updates: it requires connecting with both the public-api and the viz api, so both addresses (and the viz namespace) are now provided as parameters to the container. * CLI updates and other minor things: - Changes to command files under `cli/cmd`: - Updated `endpoints.go` according to new API interface name. - Updated `version.go`, `dashboard` and `uninstall.go` to pull the viz namespace dynamically. - Changes to command files under `viz/cmd`: - `edges.go`, `routes.go`, `stat.go` and `top.go`: point to dependencies that were moved from public-api to viz. - Other changes to have tests pass: - Added `metrics-api` to list of docker images to build in actions workflows. - In `bin/fmt` exclude protobuf generated files instead of entire directories because directories could contain both generated and non-generated code (case in point: `viz/metrics-api`). * Add retry to 'tap API service is running' check * mc check shouldn't err when viz is not available. Also properly set the log in multicluster/cmd/root.go so that it properly displays messages when --verbose is used	2021-01-21 18:26:38 -05:00
Yashvardhan Kukreja	b67bbe157b	add jaeger check: to confirm whether the jaeger injector pod is in running state or not (#5528 ) Currently, the linkerd jaeger check runs multiple checks but it doesn't have a check to confirm the state of the jaeger injector to be running. This commit adds that required check to confirm the running state of the jaeger injector pod. Fixes #5495 Signed-off-by: Yashvardhan Kukreja <yash.kukreja.98@gmail.com>	2021-01-19 08:35:16 +05:30
Tarun Pothulapati	0a2f1f3a26	viz: add check sub-command (#5496 ) * viz: add check sub-command This adds a new `viz check` cmd performing checks for the resources in linkerd-viz extension. Checks include resource checks and the health of resources, certs, etc Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-01-15 15:31:45 -05:00
Alejandro Pedraza	f3b1ebfa99	Separate observability API (#5510 ) * Separate observability API Closes #5312 This is a preliminary step towards moving all the observability API into `/viz`, by first moving its protobuf into `viz/metrics-api`. This should facilitate review as the go files are not moved yet, which will happen in a followup PR. There are no user-facing changes here. - Moved `proto/common/healthcheck.proto` to `viz/metrics-api/proto/healthcheck.prot` - Moved the contents of `proto/public.proto` to `viz/metrics-api/proto/viz.proto` except for the `Version` Stuff. - Merged `proto/controller/tap.proto` into `viz/metrics-api/proto/viz.proto` - `grpc_server.go` now temporarily exposes `PublicAPIServer` and `VizAPIServer` interfaces to separate both APIs. This will get properly split in a followup. - The web server provides handlers for both interfaces. - `cli/cmd/public_api.go` and `pkg/healthcheck/healthcheck.go` temporarily now have methods to access both APIs. - Most of the CLI commands will use the Viz API, except for `version`. The other changes in the go files are just changes in the imports to point to the new protobufs. Other minor changes: - Removed `git add controller/gen` from `bin/protoc-go.sh`	2021-01-13 14:34:54 -05:00
Tarun Pothulapati	ff841d54fc	viz: add a retry check for core control-plane pods before install (#5434 ) * viz: add a retry check for core control-plane pods before install This commit adds a new check so that `viz install` waits till the control-plane pods are up. For this to work, the `prometheus` sub-system check in control-plane self-check has been removed, as we re-use healthchecks to perform this. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-01-07 23:52:09 +05:30
Tarun Pothulapati	68c02d82d1	healthcheck: simplify Checker construction with a builder (#5475 ) Currently, Each new instance of `Checker` type have to manually set all the fields with the `NewChecker()`, even though most use-cases are fine with the defaults. This branch makes this simpler by using the Builder pattern, so that the users of `Checker` can override the defaults by using specific field methods when needed. Thus simplifying the code. This also removes some of the methods that were specific to tests, and replaces them with the currently used ones. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-01-06 14:32:39 -08:00
Naga Venkata Pradeep Namburi	df84a08ac8	Fix typo in healthcheck error message (#5445 ) Fixes #5438 Signed-off-by: pradeepnnv <pradeepnnv@gmail.com>	2021-01-06 09:44:07 +05:30
Lutz Behnke	8d50631727	remove check comparing ca.crt field in identity issuer secret and trust anchors in config (#5424 ) Currently the CA bundles in the config value `global.IdentityTrustAnchorsPEM` must not contain more than one certificate when the schema type is set to `kubernetes.io/tls` or the command `linkerd check` will fail. This change remove the comparison between the trust anchors configured in the linkerd config map and the contents of the `ca.crt` field of the identity issuer K8s secret. This is an alternative to MR #5396, which I will close as a result of the discussion with @adleong Fixes #5292 Signed-off-by: Lutz Behnke <lutz.behnke@finleap.com>	2020-12-23 11:14:02 -08:00
Tarun Pothulapati	2087c95dd8	viz: move some components into linkerd-viz (#5340 ) * viz: move some components into linkerd-viz This branch moves the grafana,prometheus,web, tap components into a new viz chart, following the same extension model that multi-cluster and jaeger follow. The components in viz are not injected during install time, and will go through the injector. The `viz install` does not have any cli flags to customize the install directly but instead follow the Helm way of customization by using flags such as `set`, `set-string`, `values`, `set-files`. Changes Include - Move `grafana`, `prometheus`, `web`, `tap` templates into viz extension. - Remove all add-on related charts, logic and tests w.r.t CLI & Helm. - Clean up `linkerd2/values.go` & `linkerd2/values.yaml` to not contain fields related to viz components. - Update `linkerd check` Healthchecks to not check for viz components. - Create a new top level `viz` directory with CLI logic and Helm charts. - Clean fields in the `viz/Values.yaml` to be in the `<component>.<property>` model. Ex: `prometheus.resources`, `dashboard.image.tag`, etc so that it is consistent everywhere. Testing ```bash # Install the Core Linkerd Installation ./bin/linkerd install \| k apply -f - # Wait for the proxy-injector to be ready # Install the Viz Extension ./bin/linkerd cli viz install \| k apply -f - # Customized Install ./bin/linkerd cli viz install --set prometheus.enabled=false \| k apply -f - ``` What is not included in this PR: - Move of Controller from core install into the viz extension. - Simplification and refactoring of the core chart i.e removing `.global`, etc. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-12-23 20:17:31 +05:30
Kevin Leimkuhler	f6c8d27d83	Add mulitcluster check command (#5410 ) ## What This change moves the `linkerd check --multicluster` functionality under it's own multicluster subcommand: `linkerd multicluster check`. There should be no functional changes as a result of this change. `linkerd check` no longer checks for anything multicluster related and the `--multicluster` flag has been removed. ## Why Closes #5208 The bulk of these changes are moving all the multicluster checks from `pkg/healthcheck` into the multicluster package. Doing this completely separates it from core Linkerd. It still uses `pkg/healtcheck` when possible, but anything that is used only by `multicluster check` has been moved. Note the the `kubernetes-api` and `linkerd-existence` checks are run. These checks are required for setting up the Linkerd health checker. They set the health checker's `kubeAPI`, `linkerdConfig`, and `apiClient` fields. These could be set manually so that the only check the user sees is `linkerd-multicluster`, but I chose not to do this. If any of the setting functions errors, it would just tell the user to run `linkerd check` and ensure the installation is correct. I find the user error handling to be better by including these required checks since they should be run in the first place. ## How to test Installing Linkerd and multicluster should result in a basic check output: ``` $ bin/linkerd install \|kubectl apply -f - .. $ bin/linkerd check .. $ bin/linkerd multicluster install \|kubectl apply -f - .. $ bin/linkerd multicluster check kubernetes-api -------------- √ can initialize the client √ can query the Kubernetes API linkerd-existence ----------------- √ 'linkerd-config' config map exists √ heartbeat ServiceAccount exist √ control plane replica sets are ready √ no unschedulable pods √ controller pod is running √ can initialize the client √ can query the control plane API linkerd-multicluster -------------------- √ Link CRD exists Status check results are √ ``` After linking a cluster: ``` $ bin/linkerd multicluster check kubernetes-api -------------- √ can initialize the client √ can query the Kubernetes API linkerd-existence ----------------- √ 'linkerd-config' config map exists √ heartbeat ServiceAccount exist √ control plane replica sets are ready √ no unschedulable pods √ controller pod is running √ can initialize the client √ can query the control plane API linkerd-multicluster -------------------- √ Link CRD exists √ Link resources are valid * k3d-y √ remote cluster access credentials are valid * k3d-y √ clusters share trust anchors * k3d-y √ service mirror controller has required permissions * k3d-y √ service mirror controllers are running * k3d-y × all gateway mirrors are healthy probe-gateway-k3d-y.linkerd-multicluster mirrored from cluster [k3d-y] has no endpoints see https://linkerd.io/checks/#l5d-multicluster-gateways-endpoints for hints Status check results are × ``` Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>	2020-12-21 15:50:17 -05:00
Alejandro Pedraza	d661054795	Fix CLI install/upgrade overriding settings in HA (#5399 ) Fixes #5385 ## The problems - `linkerd install --ha` isn't honoring flags - `linkerd upgrade --ha` is overridding existing configs silently or failing with an error - Upgrading HA instances from before 2.9 to version 2.9.1 results in configs being overridden silently, or the upgrade fails with an error ## The cause The change in #5358 attempted to fix `linkerd install --ha` that was only applying some of the `values-ha.yaml` defaults, by calling `charts.NewValues(true)` and merging that with the values built from `values.yaml` overriden by the flags. It turns out the `charts.NewValues()` implementation was by itself merging against `values.yaml` and as a result any flag was getting overridden by its default. This also happened when doing `linkerd upgrade --ha` on an existing instance, which could result in silently overriding settings, or it could also fail loudly like for example when upgrading set up that has an external issuer (in this case the issuer cert won't be able to be read during upgrade and an error would occur as described in #5385). Finally, when doing `linkerd upgrade` (no --ha flag) on an HA install from before 2.9 results in configs getting overridden as well (silently or with an error) because in order to generate the `linkerd-config-overrides` secret, the original install flags are retrieved from `linkerd-config` via the `loadStoredValuesLegacy()` function which then effectively ends up performing a `linkerd upgrade` with all the flags used for `linkerd install` and falls into the same trap as above. ## The fix In `values.go` the faulting merging logic is not used anymore, so now `NewValues()` only returns the default values from `values.yaml` and doesn't require an argument anymore. It calls `readDefaults()` which now only returns the appropriate values depending on whether we're on HA or not. There's a new function `MergeHAValues()` that merges `values-ha.yaml` into the current values (it doesn't look into `values.yaml` anymore), which is only used when processing the `--ha` flag in `options.go`. ## How to test To replicate the issue try setting a custom setting and check it's not applied: ```bash linkerd install --ha --controller-log level debug \| grep log.level - -log-level=info ``` ## Followup This wasn't caught because we don't have HA integration tests. Now that our test infra is based on k3d, it should be easy to make such a test using a cluster with multiple nodes. Either that or issuing `linkerd install --ha` with additional configs and compare against a golden file.	2020-12-18 12:11:52 -05:00
Tarun Pothulapati	589f36c4c2	jaeger: add check sub command (#5295 ) * jaeger: add check sub command This adds a new `linkerd jaeger check` command to have checks w.r.t jaeger extension. This is similar to that of the `linkerd check` cmd. As jaeger is a separate package, It was a bit complex for this to work as not all types and fields from healthcheck pkg are public, Helper funcs were used to mitigate this. This has the following changes: - Adds a new `check.go` file under the jaeger extension pkg - Moves some commonly needed funcs and types from `cli/cmd/check.go` and `pkg/healthcheck/health.go` into `pkg/healthcheck/healthcheck_output.go`. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-12-17 00:26:34 +05:30
Alex Leong	cdc57d1af0	Use linkerd-jaeger extension for control plane tracing (#5299 ) Now that tracing has been split out of the main control plane and into the linkerd-jaeger extension, we remove references to tracing from the main control plane including: * removing the tracing components from the main control plane chart * removing the tracing injection logic from the main proxy injector and inject CLI (these will be added back into the new injector in the linkerd-jaeger extension) * removing tracing related checks (these will be added back into `linkerd jaeger check`) * removing related tests We also update the `--control-plane-tracing` flag to configure the control plane components to send traces to the linkerd-jaeger extension. To make sure this works even when the linkerd-jaeger extension is installed in a non-default namespace, we also add a `--control-plane-tracing-namespace` flag which can be used to change the namespace that the control plane components send traces to. Note that for now, only the control plane components send traces; the proxies in the control plane do not. This is because the linkerd-jaeger injector is not yet available. However, this change adds the appropriate namespace annotations to the control plane namespace to configure the proxies to send traces to the linkerd-jaeger extension once the linkerd-jaeger injector is available. I tested this by doing the following: 1. bin/linkerd install \| kubectl apply -f - 1. bin/helm install jaeger jaeger/charts/jaeger 1. bin/linkerd upgrade --control-plane-tracing=true \| kubectl apply -f - 1. kubectl -n linkerd-jaeger port-forward svc/jaeger 16686 1. open http://localhost:16686 1. see traces from the linkerd control plane Signed-off-by: Alex Leong <alex@buoyant.io>	2020-12-08 14:34:26 -08:00
Alejandro Pedraza	9cbfb08a38	Bump proxy-init to v1.3.8 (#5283 )	2020-11-27 09:07:34 -05:00
hodbn	92eb174e06	Add safe accessor for Global in linkerd-config (#5269 ) CLI crashes if linkerd-config contains unexpected values. Add a safe accessor that initializes an empty Global on the first access. Refactor all accesses to use the newly introduced accessor using gopls. Add test for linkerd-config data without Global. Fixes #5215 Co-authored-by: Itai Schwartz <yitai27@gmail.com> Signed-off-by: Hod Bin Noon <bin.noon.hod@gmail.com>	2020-11-23 12:45:58 -08:00
Tarun Pothulapati	b389054d53	cli: Don't check for SAN in root and intermediate certs (#5237 ) As discussed in #5228, it is not correct for root and intermediate certs to have SAN. This PR updates the check to not verify the intermediate issuer cert with the identity dns name (which checks with SAN and not CN as the the `verify` func is used to verify leaf certs and not root and intermediate certs). This PR also avoids setting a SAN field when generating certs in the `install` command. Fixes #5228	2020-11-18 15:30:39 -08:00
Alejandro Pedraza	5a707323e6	Update proxy-init to v1.3.7 (#5221 ) This upgrades both the proxy-init image itself, and the go dependency on proxy-init as a library, which fixes CNI in k3s and any host using binaries coming from BusyBox, where `nsenter` has an issue parsing arguments (see rancher/k3s#1434).	2020-11-13 15:59:14 -05:00
Tarun Pothulapati	262d5e041c	charts: Do not store .component in linkerd-config (#5144 ) * charts: Do not store .component in linkerd-config This removes the `.component` fields from `Values.go` and also prevents them from being emitted into `linkerd-config` by attaching them into a temporary variable during injection. This also simplies inbound and outbound Skip ports helm logic and adds quotes to them. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-11-02 20:41:37 +05:30
Alex Leong	da194f5dc3	Warn when webhook certificates near expiry (#5155 ) Fixes #5149 Before: ``` linkerd-webhooks-and-apisvc-tls ------------------------------- × tap API server has valid cert certificate will expire on 2020-10-28T20:22:32Z see https://linkerd.io/checks/#l5d-tap-cert-valid for hints ``` After: ``` linkerd-webhooks-and-apisvc-tls ------------------------------- √ tap API server has valid cert ‼ tap API server cert is valid for at least 60 days certificate will expire on 2020-10-28T20:22:32Z see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints √ proxy-injector webhook has valid cert ‼ proxy-injector cert is valid for at least 60 days certificate will expire on 2020-10-29T18:17:03Z see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints √ sp-validator webhook has valid cert ‼ sp-validator cert is valid for at least 60 days certificate will expire on 2020-10-28T20:21:34Z see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints ``` Signed-off-by: Alex Leong <alex@buoyant.io>	2020-10-30 11:48:51 -07:00
Tarun Pothulapati	4c106e9c08	cli: make check return SkipError when there is no prometheus configured (#5150 ) Fixes #5143 The availability of prometheus is useful for some calls in public-api that the check uses. This change updates the ListPods in public-api to still return the pods even when prometheus is not configured. For a test that exclusively checks for prometheus metrics, we have a gate which checks if a prometheus is configured and skips it othervise. Signed-off-by: Tarun Pothulapati tarunpothulapati@outlook.com	2020-10-29 19:57:11 +05:30
Alejandro Pedraza	177669b377	Remove code refs to controllerImageVersion (#5119 ) Followup to #5100 We had both `controllerImageVersion` and `global.controllerImageVersion` configs, but only the latter was taken into account in the chart templates, so this change removes all of its references.	2020-10-21 13:40:25 -05:00
Oliver Gould	84b1a826bd	Replace global.proxy.destinationGetNetworks with global.clusterNetworks (#5110 ) There is no longer a proxy config `DESTINATION_GET_NETWORKS`. Instead of reflecting this implementation in our values.yaml, this changes this variable to the more general `clusterNetworks` to emphasize its similarity to `clusterDomain` for the purposes of discovery.	2020-10-20 19:05:31 -07:00
Alex Leong	9701f1944e	Stop rendering addon config (#5078 ) The linkerd-addon-config is no longer used and can be safely removed. Signed-off-by: Alex Leong <alex@buoyant.io>	2020-10-16 11:07:51 -07:00
Tarun Pothulapati	2a5e7dba62	Handle grafana add-on config repair (#5059 ) * Handle grafana add-on config repair Fixes #5014 In Grafana Add-On, Default fields i.e `grafana.image.name`, `grafana.name` have been removed from `linkerd-config-addons` after `2.8.1`. Only overriden values are stored in `linkerd-config-addons` as of now. Hence, `grafana.image.name` has to be removed from `linkerd-config-addons` unless they are overriden so that updates to it can take place especially the move from `gcr` to `ghcr`. This also removes `grafana.name` field if they are set to default, as its removed. This problem will not occur again even if we update default values, as default values are not stored in `linekrd-config-addons` anymore for all add-ons. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-10-13 13:12:49 -07:00
Tarun Pothulapati	faf77798f0	Update check to use new linkerd-config.values (#5023 ) This branch updates the check functionality to read the new `linkerd-config.values` which contains the full Values struct showing the current state of the Linkerd installation. (being added in #5020 ) This is done by adding a new `FetchCurrentConfiguraiton` which first tries to get the latest, if not falls back to the older `linkerd-config` protobuf format.` Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-10-01 11:19:25 -07:00
Lutz Behnke	de098cd52d	make api service secrets compatible to cert manager (#4737 ) Currently the secrets for the proxy-injector, sp-validator webhooks and tap API service are using the Opaque secret type and linkerd-specific field names. This makes it impossible to use cert-manager (https://github.com/jetstack/cert-manager) to provisions and rotate the secrets for these services. This change converts the secrets defined in the linkerd2 helm charts and the controller use the kubernetes.io/tls format instead. This format is used for secrets containing the generated secrets by cert-manager. Signed-off-by: Lutz Behnke <lutz.behnke@finleap.com>	2020-09-29 09:17:09 -05:00
Tarun Pothulapati	d0caaa86c4	Bump k8s client-go to v0.19.2 (#5002 ) Fixes #4191 #4993 This bumps Kubernetes client-go to the latest v0.19.2 (We had to switch directly to 1.19 because of this issue). Bumping to v0.19.2 required upgrading to smi-sdk-go v0.4.1. This also depends on linkerd/stern#5 This consists of the following changes: - Fix ./bin/update-codegen.sh by adding the template path to the gen commands, as it is needed after we moved to GOMOD. - Bump all k8s related dependencies to v0.19.2 - Generate CRD types, client code using the latest k8s.io/code-generator - Use context.Context as the first argument, in all code paths that touch the k8s client-go interface Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-09-28 12:45:18 -05:00
Tarun Pothulapati	ecce5b91f6	tests: Add Calico CNI deep integration tests (#4952 ) * tests: Add new CNI deep integration tests Fixes #3944 This PR adds a new test, called cni-calico-deep which installs the Linkerd CNI plugin on top of a cluster with Calico and performs the current integration tests on top, thus validating various Linkerd features when CNI is enabled. For Calico to work, special config is required for kind which is at `cni-calico.yaml` This is different from the CNI integration tests that we run in cloud integration which performs the CNI level integration tests. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-09-23 19:58:28 +05:30
Tarun Pothulapati	f75b9fe374	tracing: Move default values into addon-chart (#4951 ) * tracing: Move default values into chart This branch updates the tracing add-on's values into their own chart's values.yaml (just like grafana and prometheus). This prevents them from being saved into `linkerd-config-addons` where only the overridden values are stored. Thus allowing us to change the defaults. This also - Updates the check command to fall back to default values, if there are no overridden name fields. - Updates jaeger to `1.19.2` Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-09-15 15:19:25 -05:00
Alejandro Pedraza	ccf027c051	Push docker images to ghcr.io instead of gcr.io (#4953 ) * Push docker images to ghcr.io instead of gcr.io The `cloud_integration.yml` and `release.yml` workflows were modified to log into ghcr.io, and remove the `Configure gcloud` step which is no longer necessary. Note that besides the changes to cloud_integration.yml and release.yml, there was a change to the upgrade-stable integration test so that we do linkerd upgrade --addon-overwrite to reset the addons settings because in stable-2.8.1 the Grafana image was pegged to gcr.io/linkerd-io/grafana in linkerd-config-addons. This will need to be mentioned in the 2.9 upgrade notes. Also the egress integration test has a debug container that now is pegged to the edge-20.9.2 tag. Besides that, the other changes are just a global search and replace (s/gcr.io\/linkerd-io/ghcr.io\/linkerd/).	2020-09-10 15:16:24 -05:00
Zahari Dichev	084bb678c7	Perform TLS checks on injector, sp validator and tap (#4924 ) * Check sp-validator,proxy-injector and tap certs Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>	2020-09-10 11:21:23 -05:00
Alex Leong	33ddd4e357	Use correct component name in multicluster checks (#4921 ) The multicluster checks make sure that the correct resources exist for each service mirror controller. When looking up these resources, it uses the `linkerd.io/control-plane-component=linkerd-service-mirror` label selector. However, these resources have the label `linkerd.io/control-plane-component=service-mirror`. This causes the resource lookup to fail to find the resource and the check spuriously fails. ``` × service mirror controller has required permissions missing ServiceAccounts: linkerd-service-mirror-self missing ClusterRoles: linkerd-service-mirror-access-local-resources-self missing ClusterRoleBindings: linkerd-service-mirror-access-local-resources-self missing Roles: linkerd-service-mirror-read-remote-creds-self missing RoleBindings: linkerd-service-mirror-read-remote-creds-self see https://linkerd.io/checks/#l5d-multicluster-source-rbac-correct for hints \| * no service mirror controller deployment for Link self ``` Instead, use the correct label selector when looking up these resources. Signed-off-by: Alex Leong <alex@buoyant.io>	2020-08-31 13:40:53 -07:00
Tarun Pothulapati	c9c5d97405	Remove SMI-Metrics charts and commands (#4843 ) Fixes #4790 This PR removes both the SMI-Metrics templates along with the experimental sub-commands. This also removes pkg `smi-metrics` as there is no direct use of it without the commands. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-08-24 14:35:33 -07:00
Zahari Dichev	c25f0a3af5	Triger kube-system HA check based on webhook failure policy (#4861 ) This PR changes the HA check that verifies that the `config.linkerd.io/admission-webhooks=disabled` is present on kube-system to be enabled only when the failure policy for the proxy injector webhook is set to `Fail`. This allows users to skip this check in cases when the label is removed because the namespace is managed by the cloud provider like in the case described in #4754 Fix #4754 Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>	2020-08-17 13:56:03 +03:00
Josh Soref	72aadb540f	Spelling (#4872 ) This PR corrects misspellings identified by the [check-spelling action](https://github.com/marketplace/actions/check-spelling). The misspellings have been reported at `aaf440489e (commitcomment-41423663)` The action reports that the changes in this PR would make it happy: `5b82c6c5ca` Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately. Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>	2020-08-12 21:59:50 -07:00
Alejandro Pedraza	4876a94ed0	Update proxy-init version to v1.3.6 (#4850 ) Supersedes #4846 Bump proxy-init to v1.3.6, containing CNI fixes and support for multi-arch builds. #4846 included this in v1.3.5 but proxy.golang.org refused to update the modified SHA	2020-08-11 11:54:00 -05:00
Tarun Pothulapati	7e5804d1cf	grafana: move default values into values file (#4755 ) This PR moves default values into add-on specific values.yaml thus allowing us to update default values as they would not be present in linkerd-config-addons cm. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-08-06 13:57:28 -07:00
Alex Leong	024a35a3d3	Move multicluster API connectivity checks earlier (#4819 ) Fixes #4774 When a service mirror controller is unable to connect to the target cluster's API, the service mirror controller crashes with the error that it has failed to sync caches. This error lacks the necessary detail to debug the situation. Unfortunately, client-go does not surface more useful information about why the caches failed to sync. To make this more debuggable we do a couple things: 1. When creating the target cluster api client, we eagerly issue a server version check to test the connection. If the connection fails, the service-mirror-controller logs now look like this: ``` time="2020-07-30T23:53:31Z" level=info msg="Got updated link broken: {Name:broken Namespace:linkerd-multicluster TargetClusterName:broken TargetClusterDomain:cluster.local TargetClusterLinkerdNamespace:linkerd ClusterCredentialsSecret:cluster-credentials-broken GatewayAddress:35.230.81.215 GatewayPort:4143 GatewayIdentity:linkerd-gateway.linkerd-multicluster.serviceaccount.identity.linkerd.cluster.local ProbeSpec:ProbeSpec: {path: /health, port: 4181, period: 3s} Selector:{MatchLabels:map[] MatchExpressions:[{Key:mirror.linkerd.io/exported Operator:Exists Values:[]}]}}" time="2020-07-30T23:54:01Z" level=error msg="Unable to create cluster watcher: cannot connect to api for target cluster remote: Get \"https://36.199.152.138/version?timeout=32s\": dial tcp 36.199.152.138:443: i/o timeout" ``` This error also no longer causes the service mirror controller to crash. Updating the Link resource will cause the service mirror controller to reload the credentials and try again. 2. We rearrange the checks in `linkerd check --multicluster` to perform the target API connectivity checks before the service mirror controller checks. This means that we can validate the target cluster API connection even if the service mirror controller is not healthy. We also add a server version check here to quickly determine if the connection is healthy. Sample check output: ``` linkerd-multicluster -------------------- √ Link CRD exists √ Link resources are valid * broken W0730 16:52:05.620806 36735 transport.go:243] Unable to cancel request for promhttp.RoundTripperFunc × remote cluster access credentials are valid * failed to connect to API for cluster: [broken]: Get "https://36.199.152.138/version?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) see https://linkerd.io/checks/#l5d-smc-target-clusters-access for hints W0730 16:52:35.645499 36735 transport.go:243] Unable to cancel request for promhttp.RoundTripperFunc × clusters share trust anchors Problematic clusters: * broken: unable to fetch anchors: Get "https://36.199.152.138/api/v1/namespaces/linkerd/configmaps/linkerd-config?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) see https://linkerd.io/checks/#l5d-multicluster-clusters-share-anchors for hints √ service mirror controller has required permissions * broken √ service mirror controllers are running * broken × all gateway mirrors are healthy wrong number of (0) gateway metrics entries for probe-gateway-broken.linkerd-multicluster see https://linkerd.io/checks/#l5d-multicluster-gateways-endpoints for hints √ all mirror services have endpoints ‼ all mirror services are part of a Link mirror service voting-svc-gke.emojivoto is not part of any Link see https://linkerd.io/checks/#l5d-multicluster-orphaned-services for hints ``` Some logs from the underlying go network libraries sneak into the output which is kinda gross but I don't think it interferes too much with being able to understand what's going on. Signed-off-by: Alex Leong <alex@buoyant.io>	2020-08-05 11:48:23 -07:00
cpretzer	670caaf8ff	Update to proxy-init v1.3.4 (#4815 ) Signed-off-by: Charles Pretzer <charles@buoyant.io>	2020-07-30 15:58:58 -05:00
Alex Leong	a1543b33e3	Add support for service-mirror selectors (#4795 ) * Add selector support Signed-off-by: Alex Leong <alex@buoyant.io> * Removed unused labels Signed-off-by: Alex Leong <alex@buoyant.io>	2020-07-30 10:07:14 -07:00
Alejandro Pedraza	2aea2221ed	Fixed `linkerd check` not finding Prometheus (#4797 ) * Fixed `linkerd check` not finding Prometheus ## The Problem `linkerd check` run right after install is failing because it can't find the Prometheus Pod. ## The Cause The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready. Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries. ## The Fix The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind. This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.	2020-07-27 11:54:03 -05:00
Alex Leong	d540e16c8b	Make service mirror controller per target cluster (#4710 ) This PR removes the service mirror controller from `linkerd mc install` to `linkerd mc link`, as described in https://github.com/linkerd/rfc/pull/31. For fuller context, please see that RFC. Basic multicluster functionality works here including: * `linkerd mc install` installs the Link CRD but not any service mirror controllers * `linkerd mc link` creates a Link resource and installs a service mirror controller which uses that Link * The service mirror controller creates and manages mirror services, a gateway mirror, and their endpoints. * The `linkerd mc gateways` command lists all linked target clusters, their liveliness, and probe latences. * The `linkerd check` multicluster checks have been updated for the new architecture. Several checks have been rendered obsolete by the new architecture and have been removed. The following are known issues requiring further work: * the service mirror controller uses the existing `mirror.linkerd.io/gateway-name` and `mirror.linkerd.io/gateway-ns` annotations to select which services to mirror. it does not yet support configuring a label selector. * an unlink command is needed for removing multicluster links: see https://github.com/linkerd/linkerd2/issues/4707 * an mc uninstall command is needed for uninstalling the multicluster addon: see https://github.com/linkerd/linkerd2/issues/4708 Signed-off-by: Alex Leong <alex@buoyant.io>	2020-07-23 14:32:50 -07:00
Tarun Pothulapati	986e0d4627	prometheus: add add-on checks (#4756 ) As linkerd-prometheus is optional now, the checks are also separated and should only work when the prometheus add-on is installed. This is done by re-using the add-on check code.	2020-07-23 18:03:24 +05:30
Tarun Pothulapati	b7e9507174	Remove/Relax prometheus related checks (#4724 ) * Removes/Relaxes prometheus related checks Now that prometheus is an add-on, There can be cases where prometheus is disabled at which the check should show a warning but not fail. This decouples the tight depedency. This changes the following checks: - Removes serviceAccount and pod checks in the CLI. - Relaxes `linkerd-api` checks to only check for prometheus access when the URL is not empty. This should work seamlessly with external prometheus as that URL will be passed and it performs the same check. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-07-20 14:24:00 -07:00
Tarun Pothulapati	2a099cb496	Move Prometheus as an Add-On (#4362 ) This moves Prometheus as a add-on, thus making it optional but enabled by default. The also make `linkerd-prometheus` more configurable, and allow it to have its own life-cycle for upgrades, configuration, etc. This work will be followed by documentation that help users configure existing Prometheus to work with Linkerd. Changes Include: - moving prometheus manifests into a separate chart at `charts/add-ons/prometheus`, and adding it as a dependency to `linkerd2` - implement the `addOn` interface to support the same with CLI. - include configuration in `linkerd-config-addons` User Facing Changes: The default install experience does not change much but for users who have already configured Prometheus differently, would need to apply the same using the new configuration fields present in chart README	2020-07-09 23:29:03 +05:30
Zahari Dichev	73010149ce	Do not treat evicted pods as failed in healthchecks (#4732 ) When a k8s pod is evicted its Phase is set to Failed and the reason is set to Evicted. Because in the ListPods method of the public APi we only transmit the phase and treat it as Status, the healthchecks assume such evicted data plane pods to be failed. Since this check is retryable, the results is that linkerd check --proxy appears to hang when there are evicted pods. As @adleong correctly pointed out here, the presence of evicted pod is not something that we should make the checks fail. This change modifies the publci api to set the Pod.Status to "Evicted" for evicted pods. The healtcheks are also modified to not treat evicted pods as error cases. Fix #4690 Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>	2020-07-09 14:22:27 +03:00

1 2 3 4 5

203 Commits