Commit Graph

181 Commits

Author SHA1 Message Date
Alex Leong da194f5dc3
Warn when webhook certificates near expiry (#5155)
Fixes #5149 

Before:

```
linkerd-webhooks-and-apisvc-tls
-------------------------------
× tap API server has valid cert
    certificate will expire on 2020-10-28T20:22:32Z
    see https://linkerd.io/checks/#l5d-tap-cert-valid for hints
```

After:

```
linkerd-webhooks-and-apisvc-tls
-------------------------------
√ tap API server has valid cert
‼ tap API server cert is valid for at least 60 days
    certificate will expire on 2020-10-28T20:22:32Z
    see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints
√ proxy-injector webhook has valid cert
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2020-10-29T18:17:03Z
    see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints
√ sp-validator webhook has valid cert
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2020-10-28T20:21:34Z
    see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints
```

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-10-30 11:48:51 -07:00
Tarun Pothulapati 4c106e9c08
cli: make check return SkipError when there is no prometheus configured (#5150)
Fixes #5143

The availability of prometheus is useful for some calls in public-api
that the check uses. This change updates the ListPods in public-api
to still return the pods even when prometheus is not configured.

For a test that exclusively checks for prometheus metrics, we have a gate
which checks if a prometheus is configured and skips it othervise.

Signed-off-by: Tarun Pothulapati tarunpothulapati@outlook.com
2020-10-29 19:57:11 +05:30
Alejandro Pedraza 177669b377
Remove code refs to controllerImageVersion (#5119)
Followup to #5100

We had both `controllerImageVersion` and `global.controllerImageVersion`
configs, but only the latter was taken into account in the chart
templates, so this change removes all of its references.
2020-10-21 13:40:25 -05:00
Oliver Gould 84b1a826bd
Replace global.proxy.destinationGetNetworks with global.clusterNetworks (#5110)
There is no longer a proxy config `DESTINATION_GET_NETWORKS`. Instead of
reflecting this implementation in our values.yaml, this changes this
variable to the more general `clusterNetworks` to emphasize its
similarity to `clusterDomain` for the purposes of discovery.
2020-10-20 19:05:31 -07:00
Alex Leong 9701f1944e
Stop rendering addon config (#5078)
The linkerd-addon-config is no longer used and can be safely removed.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-10-16 11:07:51 -07:00
Tarun Pothulapati 2a5e7dba62
Handle grafana add-on config repair (#5059)
* Handle grafana add-on config repair

Fixes #5014

In Grafana Add-On, Default fields i.e `grafana.image.name`, `grafana.name`
have been removed from `linkerd-config-addons` after `2.8.1`. Only
overriden values are stored in `linkerd-config-addons` as of now.
Hence, `grafana.image.name` has to be removed from
`linkerd-config-addons` unless they are overriden so that updates
to it can take place especially the move from `gcr` to `ghcr`.

This also removes `grafana.name` field if they are set to default, as
its removed.

This problem will not occur again even if we update default values, as
default values are not stored in `linekrd-config-addons` anymore for all
add-ons.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-10-13 13:12:49 -07:00
Tarun Pothulapati faf77798f0
Update check to use new linkerd-config.values (#5023)
This branch updates the check functionality to read
the new `linkerd-config.values` which contains the full
Values struct showing the current state of the Linkerd
installation. (being added in #5020 )

This is done by adding a new `FetchCurrentConfiguraiton`
which first tries to get the latest, if not falls back
to the older `linkerd-config` protobuf format.`

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-10-01 11:19:25 -07:00
Lutz Behnke de098cd52d
make api service secrets compatible to cert manager (#4737)
Currently the secrets for the proxy-injector, sp-validator webhooks and tap API service are using the Opaque secret type and linkerd-specific field names. This makes it impossible to use cert-manager (https://github.com/jetstack/cert-manager) to provisions and rotate the secrets for these services. This change converts the secrets defined in the linkerd2 helm charts and the controller use the kubernetes.io/tls format instead. This format is used for secrets containing the generated secrets by cert-manager.

Signed-off-by: Lutz Behnke <lutz.behnke@finleap.com>
2020-09-29 09:17:09 -05:00
Tarun Pothulapati d0caaa86c4
Bump k8s client-go to v0.19.2 (#5002)
Fixes #4191 #4993

This bumps Kubernetes client-go to the latest v0.19.2 (We had to switch directly to 1.19 because of this issue). Bumping to v0.19.2 required upgrading to smi-sdk-go v0.4.1. This also depends on linkerd/stern#5

This consists of the following changes:

- Fix ./bin/update-codegen.sh by adding the template path to the gen commands, as it is needed after we moved to GOMOD.
- Bump all k8s related dependencies to v0.19.2
- Generate CRD types, client code using the latest k8s.io/code-generator
- Use context.Context as the first argument, in all code paths that touch the k8s client-go interface

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-09-28 12:45:18 -05:00
Tarun Pothulapati ecce5b91f6
tests: Add Calico CNI deep integration tests (#4952)
* tests: Add new CNI deep integration tests

Fixes #3944

This PR adds a new test, called cni-calico-deep which installs the Linkerd CNI
plugin on top of a cluster with Calico and performs the current integration tests on top, thus
validating various Linkerd features when CNI is enabled. For Calico
to work, special config is required for kind which is at `cni-calico.yaml`

This is different from the CNI integration tests that we run in
cloud integration which performs the CNI level integration tests.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-09-23 19:58:28 +05:30
Tarun Pothulapati f75b9fe374
tracing: Move default values into addon-chart (#4951)
* tracing: Move default values into chart

This branch updates the tracing add-on's values into their own chart's values.yaml
(just like grafana and prometheus). This prevents them from being saved into
`linkerd-config-addons` where only the overridden values are stored. Thus allowing
us to change the defaults.

This also
-  Updates the check command to fall back to default values, if there are no
overridden name fields.
- Updates jaeger to `1.19.2`

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-09-15 15:19:25 -05:00
Alejandro Pedraza ccf027c051
Push docker images to ghcr.io instead of gcr.io (#4953)
* Push docker images to ghcr.io instead of gcr.io

The `cloud_integration.yml` and `release.yml` workflows were modified to
log into ghcr.io, and remove the `Configure gcloud` step which is no
longer necessary.

Note that besides the changes to cloud_integration.yml and release.yml, there was a change to the upgrade-stable integration test so that we do linkerd upgrade --addon-overwrite to reset the addons settings because in stable-2.8.1 the Grafana image was pegged to gcr.io/linkerd-io/grafana in linkerd-config-addons. This will need to be mentioned in the 2.9 upgrade notes.

Also the egress integration test has a debug container that now is pegged to the edge-20.9.2 tag.

Besides that, the other changes are just a global search and replace (s/gcr.io\/linkerd-io/ghcr.io\/linkerd/).
2020-09-10 15:16:24 -05:00
Zahari Dichev 084bb678c7
Perform TLS checks on injector, sp validator and tap (#4924)
* Check sp-validator,proxy-injector and tap certs

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-09-10 11:21:23 -05:00
Alex Leong 33ddd4e357
Use correct component name in multicluster checks (#4921)
The multicluster checks make sure that the correct resources exist for each service mirror controller.  When looking up these resources, it uses the `linkerd.io/control-plane-component=linkerd-service-mirror` label selector.  However, these resources have the label `linkerd.io/control-plane-component=service-mirror`.  This causes the resource lookup to fail to find the resource and the check spuriously fails.

```
× service mirror controller has required permissions
    missing ServiceAccounts: linkerd-service-mirror-self
missing ClusterRoles: linkerd-service-mirror-access-local-resources-self
missing ClusterRoleBindings: linkerd-service-mirror-access-local-resources-self
missing Roles: linkerd-service-mirror-read-remote-creds-self
missing RoleBindings: linkerd-service-mirror-read-remote-creds-self
    see https://linkerd.io/checks/#l5d-multicluster-source-rbac-correct for hints
|         * no service mirror controller deployment for Link self
```

Instead, use the correct label selector when looking up these resources.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-08-31 13:40:53 -07:00
Tarun Pothulapati c9c5d97405
Remove SMI-Metrics charts and commands (#4843)
Fixes #4790

This PR removes both the SMI-Metrics templates along with the
experimental sub-commands. This also removes pkg `smi-metrics`
as there is no direct use of it without the commands.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-08-24 14:35:33 -07:00
Zahari Dichev c25f0a3af5
Triger kube-system HA check based on webhook failure policy (#4861)
This PR changes the HA check that verifies that the `config.linkerd.io/admission-webhooks=disabled` is present on kube-system to be enabled only when the failure policy for the proxy injector webhook is set to `Fail`. This allows users to skip this check in cases when the label is removed because the namespace is managed by the cloud provider like in the case described in #4754

Fix #4754

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-08-17 13:56:03 +03:00
Josh Soref 72aadb540f
Spelling (#4872)
This PR corrects misspellings identified by the [check-spelling action](https://github.com/marketplace/actions/check-spelling).

The misspellings have been reported at aaf440489e (commitcomment-41423663)

The action reports that the changes in this PR would make it happy: 5b82c6c5ca

Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
2020-08-12 21:59:50 -07:00
Alejandro Pedraza 4876a94ed0
Update proxy-init version to v1.3.6 (#4850)
Supersedes #4846

Bump proxy-init to v1.3.6, containing CNI fixes and support for
multi-arch builds.
#4846 included this in v1.3.5 but proxy.golang.org refused to update the
modified SHA
2020-08-11 11:54:00 -05:00
Tarun Pothulapati 7e5804d1cf
grafana: move default values into values file (#4755)
This PR moves default values into add-on specific values.yaml thus
allowing us to update default values as they would not be present in
linkerd-config-addons cm.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-08-06 13:57:28 -07:00
Alex Leong 024a35a3d3
Move multicluster API connectivity checks earlier (#4819)
Fixes #4774

When a service mirror controller is unable to connect to the target cluster's API, the service mirror controller crashes with the error that it has failed to sync caches.  This error lacks the necessary detail to debug the situation.  Unfortunately, client-go does not surface more useful information about why the caches failed to sync.

To make this more debuggable we do a couple things:

1. When creating the target cluster api client, we eagerly issue a server version check to test the connection.  If the connection fails, the service-mirror-controller logs now look like this:

```
time="2020-07-30T23:53:31Z" level=info msg="Got updated link broken: {Name:broken Namespace:linkerd-multicluster TargetClusterName:broken TargetClusterDomain:cluster.local TargetClusterLinkerdNamespace:linkerd ClusterCredentialsSecret:cluster-credentials-broken GatewayAddress:35.230.81.215 GatewayPort:4143 GatewayIdentity:linkerd-gateway.linkerd-multicluster.serviceaccount.identity.linkerd.cluster.local ProbeSpec:ProbeSpec: {path: /health, port: 4181, period: 3s} Selector:{MatchLabels:map[] MatchExpressions:[{Key:mirror.linkerd.io/exported Operator:Exists Values:[]}]}}"
time="2020-07-30T23:54:01Z" level=error msg="Unable to create cluster watcher: cannot connect to api for target cluster remote: Get \"https://36.199.152.138/version?timeout=32s\": dial tcp 36.199.152.138:443: i/o timeout"
```

This error also no longer causes the service mirror controller to crash.  Updating the Link resource will cause the service mirror controller to reload the credentials and try again.

2. We rearrange the checks in `linkerd check --multicluster` to perform the target API connectivity checks before the service mirror controller checks.  This means that we can validate the target cluster API connection even if the service mirror controller is not healthy.  We also add a server version check here to quickly determine if the connection is healthy.  Sample check output:

```
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
	* broken
W0730 16:52:05.620806   36735 transport.go:243] Unable to cancel request for promhttp.RoundTripperFunc
× remote cluster access credentials are valid
            * failed to connect to API for cluster: [broken]: Get "https://36.199.152.138/version?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    see https://linkerd.io/checks/#l5d-smc-target-clusters-access for hints

W0730 16:52:35.645499   36735 transport.go:243] Unable to cancel request for promhttp.RoundTripperFunc
× clusters share trust anchors
    Problematic clusters:
    * broken: unable to fetch anchors: Get "https://36.199.152.138/api/v1/namespaces/linkerd/configmaps/linkerd-config?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
    see https://linkerd.io/checks/#l5d-multicluster-clusters-share-anchors for hints
√ service mirror controller has required permissions
	* broken
√ service mirror controllers are running
	* broken
× all gateway mirrors are healthy
        wrong number of (0) gateway metrics entries for probe-gateway-broken.linkerd-multicluster
    see https://linkerd.io/checks/#l5d-multicluster-gateways-endpoints for hints
√ all mirror services have endpoints
‼ all mirror services are part of a Link
        mirror service voting-svc-gke.emojivoto is not part of any Link
    see https://linkerd.io/checks/#l5d-multicluster-orphaned-services for hints
```

Some logs from the underlying go network libraries sneak into the output which is kinda gross but I don't think it interferes too much with being able to understand what's going on.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-08-05 11:48:23 -07:00
cpretzer 670caaf8ff
Update to proxy-init v1.3.4 (#4815)
Signed-off-by: Charles Pretzer <charles@buoyant.io>
2020-07-30 15:58:58 -05:00
Alex Leong a1543b33e3
Add support for service-mirror selectors (#4795)
* Add selector support

Signed-off-by: Alex Leong <alex@buoyant.io>

* Removed unused labels

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-07-30 10:07:14 -07:00
Alejandro Pedraza 2aea2221ed
Fixed `linkerd check` not finding Prometheus (#4797)
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
2020-07-27 11:54:03 -05:00
Alex Leong d540e16c8b
Make service mirror controller per target cluster (#4710)
This PR removes the service mirror controller from `linkerd mc install` to `linkerd mc link`, as described in https://github.com/linkerd/rfc/pull/31.  For fuller context, please see that RFC.

Basic multicluster functionality works here including:
* `linkerd mc install` installs the Link CRD but not any service mirror controllers
* `linkerd mc link` creates a Link resource and installs a service mirror controller which uses that Link
* The service mirror controller creates and manages mirror services, a gateway mirror, and their endpoints.
* The `linkerd mc gateways` command lists all linked target clusters, their liveliness, and probe latences.
* The `linkerd check` multicluster checks have been updated for the new architecture.  Several checks have been rendered obsolete by the new architecture and have been removed.

The following are known issues requiring further work:
* the service mirror controller uses the existing `mirror.linkerd.io/gateway-name` and `mirror.linkerd.io/gateway-ns` annotations to select which services to mirror.  it does not yet support configuring a label selector.
* an unlink command is needed for removing multicluster links: see https://github.com/linkerd/linkerd2/issues/4707
* an mc uninstall command is needed for uninstalling the multicluster addon: see https://github.com/linkerd/linkerd2/issues/4708

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-07-23 14:32:50 -07:00
Tarun Pothulapati 986e0d4627
prometheus: add add-on checks (#4756)
As linkerd-prometheus is optional now, the checks are also separated
and should only work when the prometheus add-on is installed.

This is done by re-using the add-on check code.
2020-07-23 18:03:24 +05:30
Tarun Pothulapati b7e9507174
Remove/Relax prometheus related checks (#4724)
* Removes/Relaxes prometheus related checks

Now that prometheus is an add-on, There can be cases where prometheus is
disabled at which the check should show a warning but not fail. This
decouples the tight depedency.

This changes the following checks:

- Removes serviceAccount and pod checks in the CLI.
- Relaxes `linkerd-api` checks to only check for prometheus access when
the URL is not empty. This should work seamlessly with external
prometheus as that URL will be passed and it performs the same
check.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-07-20 14:24:00 -07:00
Tarun Pothulapati 2a099cb496
Move Prometheus as an Add-On (#4362)
This moves Prometheus as a add-on, thus making it optional but enabled by default. The also make `linkerd-prometheus` more configurable, and allow it to have its own life-cycle for upgrades, configuration, etc.

This work will be followed by documentation that help users configure existing Prometheus to work with Linkerd.

**Changes Include:**
- moving prometheus manifests into a separate chart at `charts/add-ons/prometheus`, and adding it as a dependency to `linkerd2`
- implement the `addOn` interface to support the same with CLI.
- include configuration in `linkerd-config-addons`

**User Facing Changes:**
The default install experience does not change much but for users who have already configured Prometheus differently, would need to apply the same using the new configuration fields present in chart README
2020-07-09 23:29:03 +05:30
Zahari Dichev 73010149ce
Do not treat evicted pods as failed in healthchecks (#4732)
When a k8s pod is evicted its Phase is set to Failed and the reason is set to Evicted. Because in the ListPods method of the public APi we only transmit the phase and treat it as Status, the healthchecks assume such evicted data plane pods to be failed. Since this check is retryable, the results is that linkerd check --proxy appears to hang when there are evicted pods. As @adleong correctly pointed out here, the presence of evicted pod is not something that we should make the checks fail.

This change modifies the publci api to set the Pod.Status to "Evicted" for evicted pods. The healtcheks are also modified to not treat evicted pods as error cases.

Fix #4690

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-07-09 14:22:27 +03:00
Suraj Deshmukh d7dbe9cbff
Fix spelling mistakes using codespell (#4700)
Using following command the wrong spelling were found and later on
fixed:

```
codespell --skip CHANGES.md,.git,go.sum,\
    controller/cmd/service-mirror/events_formatting.go,\
    controller/cmd/service-mirror/cluster_watcher_test_util.go,\
    SECURITY_AUDIT.pdf,.gcp.json.enc,web/app/img/favicon.png \
    --ignore-words-list=aks,uint,ans,files\' --check-filenames \
    --check-hidden
```

Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com>
2020-07-07 17:07:22 -05:00
Zahari Dichev 5a2f326bb5
Surface scheduling errors on retry (#4683)
Currently linkerd check appears to hang on HA installations where there are pods that are unscheduable. In reality it is just wating on a condition that might never become true without showing any useful information (i.e. which pods are not scheduled). This change adds sets the `surfaceErrorOnRetry: true` so the user gets feedback wrt to what conditions are not met yet instead of simply being shown waiting for check to complete.

Fix #4680

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-06-30 18:14:21 +03:00
Zahari Dichev 51c48694d4
Make uncheduble pods check warning only (#4675)
Currently commands that need access to the public api are executing the `LinkerdControlPlaneExistenceChecks` This set of checks includes one that specifically checks that there is no unscheduable pods. In fact in order to run commands like stat and edge we do not need to meet that requirement.

This change relaxes all this by makind the no unschedulable pods a warning only check. Fixes #3940

Signed-off-by: Zahari Dichev zaharidichev@gmail.com
2020-06-30 16:55:17 +03:00
Zahari Dichev 7c98e89bdc
Make `service mirror controller is running check` retry (#4650)
This PR makes the service mirror controller is running retry on failure. This brings the check in line with the rest of the checks that verify that certain Linkerd components are running. It is especially useful in integration tests when we want to wait for the service mirror component to be initialized for a certain amount of time before we simply fail the linkerd check command

Fix #4642

Signed-off-by: Zahari Dichev zaharidichev@gmail.com
2020-06-22 20:33:43 +03:00
Tarun Pothulapati 4219955bdb
multicluster: checks for misconfigured mirror services (#4552)
Fixes #4541 

This PR adds the following checks
-  if a mirrored service has endpoints. (This includes gateway mirrors too).
-  if an exported service is referencing a gateway that does not exist.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
Signed-off-by: Alex Leong <alex@buoyant.io>

Co-authored-by: Alex Leong <alex@buoyant.io>
2020-06-08 15:29:34 -07:00
Kevin Leimkuhler d7f84e6c7b
Change help text to use source/target terminology in service-mirror and healthchecks (#4524)
Change terminology from local/remote to source/target in service-mirror and
healthchecks help text.

This does not change any variable, function, struct, or field names since
testing is still improving

Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-06-02 15:21:52 -04:00
Alex Leong 33bd81692a
Add list of successful gateways in multicluster check (#4516)
Fixes #4478 

We add some additional output text when the "all remote cluster gateways are alive" check succeeds to list the gateways that have been detected as alive.  In order to do this, we have added an `VerboseSuccess` error type.  Even though this type implements the `error` interface, it represents a success which contains additional information to be printed.

Sample output when dead gateways are detected:

```
[...]
√ service mirror controller can access remote clusters
× all remote cluster gateways are alive
    Some gateways are not alive:
	* cluster: [gke], gateway: [linkerd-multicluster/linkerd-gateway]
    see https://linkerd.io/checks/#l5d-multicluster-remote-gateways-alive for hints
√ clusters share trust anchors
```

Sample output when all gateways are alive:

```
[...]
√ service mirror controller can access remote clusters
√ all remote cluster gateways are alive
	* cluster: [gke], gateway: [linkerd-multicluster/linkerd-gateway]
√ clusters share trust anchors
```

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-06-01 13:57:13 -07:00
Alex Leong 16d2d4bf81
Add multicluster daisy chain check (#4483)
A mirror-service is one that has been created by the mirror service controller and resolves to a gateway in another cluster.  If a mirror service is exported (and thus mirrored into another cluster) this creates a "daisy chain" where requests can come in to the cluster through the local gateway and be immediately sent out of the cluster to a remote gateway.  If the remote gateway is in the source cluster, this can create an infinite loop.

Similarly, if an exported service routes to a mirror service by a traffic split, the same daisy chain effect occurs.

One example where this can come up is with multicluster fail-over.  If both clusters simultaneously fail-over even a portion of their traffic, a loop is created.

We add a check that detects either of the above conditions and warns of the existence of a daisy chain.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-06-01 12:10:59 -07:00
Tarun Pothulapati a8158dbeac
Add HealthChecks for Tracing Add-On (#4407)
Adds health-checks for tracing add-on, along with a refactor to have safe casts.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-05-26 22:10:23 +05:30
Tarun Pothulapati 555fb14403
separate multi-cluster checks and run after add-ons (#4468) 2020-05-26 12:07:03 +05:30
Alex Leong acacf2e023
Add --close-wait-timeout inject flag (#4409)
Depends on https://github.com/linkerd/linkerd2-proxy-init/pull/10

Fixes #4276 

We add a `--close-wait-timeout` inject flag which configures the proxy-init container to run with `privileged: true` and to set `nf_conntrack_tcp_timeout_close_wait`. 

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-05-21 14:14:14 -07:00
Zahari Dichev 3a3e407848
Tweak check hint anchors (#4449)
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-05-20 23:17:51 +03:00
Zahari Dichev 31e33d18d3
Enable service mirroring to work in private networks (#4440)
This change creates a gateway proxy for every gateway. This enables the probe worker to leverage the destination service functionality in order to discover the identity of the gateway.

Fix #4411

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-05-20 19:48:36 +03:00
Zahari Dichev 6574f124a7
Restrict Service mirror RBACs (#4426)
This PR introduces a few changes that were requested after a bit of service mirror reviewing.

- we restrict the RBACs so the service mirror controller cannot read secrets in all namespaces but only in the one that it is installed in
- we unify the namespace namings so all multicluster resources are installedi n `linkerd-multicluster` on both clusters
- fixed checks to account for changes

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-05-20 17:08:01 +03:00
Tarun Pothulapati e91dbda287
Add health checks for grafana add-on (#4321)
* Add health checks for grafana add-on

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update testCheck command and fixes

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* fix checkContainersRunnning function

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* linting fix

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update test golden files

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* use hc.ControlPlanePods instead of k8s API

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* use hc.controlPLanePods directly

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* remove unnecessary comments

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* proper comments

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update pod checks to use retries

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add values key check

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-05-14 23:18:43 +05:30
Tarun Pothulapati 45ccc24a89
Move grafana templates into a separate sub-chart as a add-on (#4320)
* adds grafana manifests as a sub-chart

- moves grafana templates into its own chart
- implement add-on interface Grafana struct
- also add relevant conditions for grafana

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* remove redundant grafana fields in Values

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update golden files

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* fix values issue

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* remove extra grafanaImage value

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add add-on upgrade tests

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* fix golden file tests

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add grafana field to linkerd-config-addons

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* Don't apply nil configuration

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update golden files

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* make checks relaxed for grafana

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update test to not test on grafana

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update TestServiceAccountsMatch to contain extra members

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* replace map[string]interface{} with Grafana for better readability

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update golden files

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-05-11 22:22:14 +05:30
Zahari Dichev 3008f1f87f
Add check for validating that remote clusters share the same trust an… (#4311)
Add check for validating that remote clusters share the same trust anchors

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-05-11 09:59:15 +03:00
Zahari Dichev 4e82ba8878
Multicluster checks (#4279)
Multicluster checks

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-05-05 10:19:38 +03:00
Alejandro Pedraza 2cd48bc488
Go test failure message wrappers to create GH Annotations (#4292)
* Go test failure message wrappers to create GH Annotations

First part of #4176

## Problem

Failures in go tests need to be properly formatted as Github annotations
so that we can fetch them through Github's API for aggregation and
analysis.

## Solution

A wrapper for error messages has been created in `testutil/annotations.go`.
The idea is that instead of throwing test failures like this:

```go
t.Failf("error retrieving data;\nExpected: %#v\nActual: %#v", expected,
actual)
```

We'd throw them like this:
```go
testutil.AnnotationFatalf("error retrieving data", "error retrieving data;\nExpected: %#v\nActual: %#v", expected,
actual)
```

That will continue reporting the error as before (when using `go test`
or another test runner), but as a side-effect it will also send to
stdout something like:

```
::error file=pkg/inject_test.go,line=133::error retrieving data
```
Which becomes a GH annotation, visible in the CI run summary screen.

The fist string art is used to have the GH annotation be a generic error message
that can be aggregated and counted across multiple test runs. If `testutil.Fatalf(str, args...)`
is called instead, the original error message will be used.

Note that that the output will be produced only when the env var
`GH_ANNOTATION` is set (which will when tests are triggered from a
Github Actions workflow).

Besides `testutil/annotation.go` and its accompanying unit test file,
other changes were made in other tests as examples, the plan being that
in a further PR _all_ the tests will use these wrappers.
2020-05-01 16:16:06 -05:00
Alex Leong e962bf1968
Improve proxy version diagnostics (#4244)
It can be difficult to know which versions of the proxy are running in your cluster, especially when you have pods running at multiple different proxy versions.

We add two pieces of CLI functionality to assist with this:

The `linkerd check --proxy` command will now list all data plane pods which are not up-to-date rather than just printing the first one it encounters:

```
‼ data plane is up-to-date
    Some data plane pods are not running the current version:
	* default/books-84958fff5-95j75 (git-ca760bdd)
	* default/authors-57c6dc9b47-djldq (git-ca760bdd)
	* default/traffic-85f58ccb66-vxr49 (git-ca760bdd)
	* default/release-name-smi-metrics-899c68958-5ctpz (git-ca760bdd)
	* default/webapp-6975dc796f-2ngh4 (git-ca760bdd)
	* default/webapp-6975dc796f-z4bc4 (git-ca760bdd)
	* emojivoto/voting-54ffc5787d-wj6cp (git-ca760bdd)
	* emojivoto/vote-bot-7b54d6999b-57srw (git-ca760bdd)
	* emojivoto/emoji-5cb99f85d8-5bhvm (git-ca760bdd)
	* emojivoto/web-7988674b8b-zfvvm (git-ca760bdd)
	* default/webapp-6975dc796f-d2fbc (git-ca760bdd)
	* default/curl (git-7f6bbc73)
    see https://linkerd.io/checks/#l5d-data-plane-version for hints
```

The `linkerd version` command now supports a `--proxy` flag which will list all proxy versions running in the cluster and the number of pods running each version:

```
linkerd version --proxy
Client version: dev-7b9d475f-alex
Server version: edge-20.4.1
Proxy versions:
	edge-20.4.1 (10 pods)
	git-ca760bdd (11 pods)
	git-7f6bbc73 (1 pods)
```

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-04-16 11:28:19 -07:00
Alex Leong 7b9d475ffc
Gate SMI-Metrics behind an install flag (#4240)
This change adds a `--smi-metrics` install flag which controls if the SMI-metrics controller and associated RBAC and APIService resources are installed.  The flag defaults to false and is hidden.

We plan to remove this flag or default it to true if and when the SMI-Metrics integration graduates from experimental.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-04-09 14:34:08 -07:00
Alejandro Pedraza 573060bacc
New test for checking SA lists are synced (#4201)
Followup to #4193

This is to verify that the list of SA installed, as well as the list of
SA in the linkerd-psp RoleBinding match the list of expected SA defined
in `healthcheck.go`.
2020-03-26 12:54:31 -05:00