Fixes#4541
This PR adds the following checks
- if a mirrored service has endpoints. (This includes gateway mirrors too).
- if an exported service is referencing a gateway that does not exist.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
Signed-off-by: Alex Leong <alex@buoyant.io>
Co-authored-by: Alex Leong <alex@buoyant.io>
This change modifies the linkerd-gateway component to use the inbound
proxy, rather than nginx, for gateway. This allows us to detect loops and
propagate identity through the gateway.
This change also cleans up port naming to `mc-gateway` and `mc-probe`
to resolve conflicts with Kubernetes validation.
---
* proxy: v2.99.0
The proxy can now operate as gateway, routing requests from its inbound
proxy to the outbound proxy, without passing the requests to a local
application. This supports Linkerd's multicluster feature by adding a
`Forwarded` header to propagate the original client identity and assist
in loop detection.
---
* Add loop detection to inbound & TCP forwarding (linkerd/linkerd2-proxy#527)
* Test loop detection (linkerd/linkerd2-proxy#532)
* fallback: Unwrap errors recursively (linkerd/linkerd2-proxy#534)
* app: Split inbound/outbound constructors into components (linkerd/linkerd2-proxy#533)
* Introduce a gateway between inbound and outbound (linkerd/linkerd2-proxy#540)
* gateway: Add a Forwarded header (linkerd/linkerd2-proxy#544)
* gateway: Return errors instead of responses (linkerd/linkerd2-proxy#547)
* Fail requests that loop through the gateway (linkerd/linkerd2-proxy#545)
* inject: Support config.linkerd.io/enable-gateway
This change introduces a new annotation,
config.linkerd.io/enable-gateway, that, when set, enables the proxy to
act as a gateway, routing all traffic targetting the inbound listener
through the outbound proxy.
This also removes the nginx default listener and gateway port of 4180,
instead using 4143 (the inbound port).
* proxy: v2.100.0
This change modifies the inbound gateway caching so that requests may be
routed to multiple leaves of a traffic split.
---
* inbound: Do not cache gateway services (linkerd/linkerd2-proxy#549)
Change terminology from local/remote to source/target in service-mirror and
healthchecks help text.
This does not change any variable, function, struct, or field names since
testing is still improving
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
There are a few notable things happening in this PR:
- the probe manager has been decoupled from the cluster_watcher. Now its only responsibility is to watch for mirrored gateways beeing created and to probe them. This means that probes are initiated for all gateways no matter whether there are mirrored services being paired
- the number of paired services is derived from the existing services in the cluster rather than being published as a metric by the prober
- there are no events being exchanged between the cluster watcher and the probe manager
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Fixes#4478
We add some additional output text when the "all remote cluster gateways are alive" check succeeds to list the gateways that have been detected as alive. In order to do this, we have added an `VerboseSuccess` error type. Even though this type implements the `error` interface, it represents a success which contains additional information to be printed.
Sample output when dead gateways are detected:
```
[...]
√ service mirror controller can access remote clusters
× all remote cluster gateways are alive
Some gateways are not alive:
* cluster: [gke], gateway: [linkerd-multicluster/linkerd-gateway]
see https://linkerd.io/checks/#l5d-multicluster-remote-gateways-alive for hints
√ clusters share trust anchors
```
Sample output when all gateways are alive:
```
[...]
√ service mirror controller can access remote clusters
√ all remote cluster gateways are alive
* cluster: [gke], gateway: [linkerd-multicluster/linkerd-gateway]
√ clusters share trust anchors
```
Signed-off-by: Alex Leong <alex@buoyant.io>
A mirror-service is one that has been created by the mirror service controller and resolves to a gateway in another cluster. If a mirror service is exported (and thus mirrored into another cluster) this creates a "daisy chain" where requests can come in to the cluster through the local gateway and be immediately sent out of the cluster to a remote gateway. If the remote gateway is in the source cluster, this can create an infinite loop.
Similarly, if an exported service routes to a mirror service by a traffic split, the same daisy chain effect occurs.
One example where this can come up is with multicluster fail-over. If both clusters simultaneously fail-over even a portion of their traffic, a loop is created.
We add a check that detects either of the above conditions and warns of the existence of a daisy chain.
Signed-off-by: Alex Leong <alex@buoyant.io>
This change adds a `allow` and `link` commands, effectivelly enabling a cluster to have more than one set of credentials that allow it to be mirrored.
Fx #4461
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Co-authored-by: Alex Leong <alex@buoyant.io>
THis PR addresses two problems:
- when a resync happens (or the mirror controller is restarted) we incorrectly classify the remote gateway as a mirrored service that is not mirrored anymore and we delete it
- when updating services due to a gateway update, we need to select only the services for the particular cluster
The latter fixes#4451
Depends on https://github.com/linkerd/linkerd2-proxy-init/pull/10Fixes#4276
We add a `--close-wait-timeout` inject flag which configures the proxy-init container to run with `privileged: true` and to set `nf_conntrack_tcp_timeout_close_wait`.
Signed-off-by: Alex Leong <alex@buoyant.io>
This change creates a gateway proxy for every gateway. This enables the probe worker to leverage the destination service functionality in order to discover the identity of the gateway.
Fix#4411
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
This PR introduces a few changes that were requested after a bit of service mirror reviewing.
- we restrict the RBACs so the service mirror controller cannot read secrets in all namespaces but only in the one that it is installed in
- we unify the namespace namings so all multicluster resources are installedi n `linkerd-multicluster` on both clusters
- fixed checks to account for changes
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
The Linkerd control plane components' admin servers have an idle connection timeout of 10 seconds. This means that they will close connections which have been idle for 10 seconds. These components are also configured with a 10 second period for liveness checks. This introduces a race condition where connections will be idle for approximately 10 seconds between liveness checks and can idle out, potentially causing the next liveness check to fail.
We remove the idle timeout so that the connection stays alive.
Certain install flags are intended to help with Linkerd development and generally are not useful (and are potentially confusing) to users.
We hide these flags in release (edge or stable) builds of the CLI but show them in all other builds. The list of affected flags is:
* control-plane-version
* proxy-image
* proxy-version
* image-pull-policy
* init-image
* init-image-version
Signed-off-by: Alex Leong <alex@buoyant.io>
This allows end user flexibility for options such as log format. Rather than bubbling up such possible config options into helm values, extra arguments provides more flexibility.
Add prometheusAlertmanagers value allows configuring a list of statically targetted alertmanager instances.
Use rule configmaps for prometheus rules. They take a list of {name,subPath,configMap} values and mounts them accordingly. Provided that subpaths end with _rules.yml or _rules.yaml they should be loaded by prometheus as per prometheus.yml's rule_files content.
Signed-off-by: Naseem <naseem@transit.app>
* Go test failure message wrappers to create GH Annotations
First part of #4176
## Problem
Failures in go tests need to be properly formatted as Github annotations
so that we can fetch them through Github's API for aggregation and
analysis.
## Solution
A wrapper for error messages has been created in `testutil/annotations.go`.
The idea is that instead of throwing test failures like this:
```go
t.Failf("error retrieving data;\nExpected: %#v\nActual: %#v", expected,
actual)
```
We'd throw them like this:
```go
testutil.AnnotationFatalf("error retrieving data", "error retrieving data;\nExpected: %#v\nActual: %#v", expected,
actual)
```
That will continue reporting the error as before (when using `go test`
or another test runner), but as a side-effect it will also send to
stdout something like:
```
::error file=pkg/inject_test.go,line=133::error retrieving data
```
Which becomes a GH annotation, visible in the CI run summary screen.
The fist string art is used to have the GH annotation be a generic error message
that can be aggregated and counted across multiple test runs. If `testutil.Fatalf(str, args...)`
is called instead, the original error message will be used.
Note that that the output will be produced only when the env var
`GH_ANNOTATION` is set (which will when tests are triggered from a
Github Actions workflow).
Besides `testutil/annotation.go` and its accompanying unit test file,
other changes were made in other tests as examples, the plan being that
in a further PR _all_ the tests will use these wrappers.
* Support Multi-stage install with Add-Ons
* add upgrade tests for add-ons
* add multi stage upgrade unit tests
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
* use downward API to mount labels to the proxy container as a volume
* add namespace as a label to the pod
* add a trace inject test
* add downwardAPi for controlplaneTracing
* add controlPlaneTracing condition to volumeMounts
* update add-ons to have workload-ns
* add workload-ns label to control-plane components
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
Upgrade Linkerd's base docker image to use go 1.14.2 in order to stay modern.
The only code change required was to update a test which was checking the error message of a `crypto/x509.CertificateInvalidError`. The error message of this error changed between go versions. We update the test to not check for the specific error string so that this test passes regardless of go version.
Signed-off-by: Alex Leong <alex@buoyant.io>
It can be difficult to know which versions of the proxy are running in your cluster, especially when you have pods running at multiple different proxy versions.
We add two pieces of CLI functionality to assist with this:
The `linkerd check --proxy` command will now list all data plane pods which are not up-to-date rather than just printing the first one it encounters:
```
‼ data plane is up-to-date
Some data plane pods are not running the current version:
* default/books-84958fff5-95j75 (git-ca760bdd)
* default/authors-57c6dc9b47-djldq (git-ca760bdd)
* default/traffic-85f58ccb66-vxr49 (git-ca760bdd)
* default/release-name-smi-metrics-899c68958-5ctpz (git-ca760bdd)
* default/webapp-6975dc796f-2ngh4 (git-ca760bdd)
* default/webapp-6975dc796f-z4bc4 (git-ca760bdd)
* emojivoto/voting-54ffc5787d-wj6cp (git-ca760bdd)
* emojivoto/vote-bot-7b54d6999b-57srw (git-ca760bdd)
* emojivoto/emoji-5cb99f85d8-5bhvm (git-ca760bdd)
* emojivoto/web-7988674b8b-zfvvm (git-ca760bdd)
* default/webapp-6975dc796f-d2fbc (git-ca760bdd)
* default/curl (git-7f6bbc73)
see https://linkerd.io/checks/#l5d-data-plane-version for hints
```
The `linkerd version` command now supports a `--proxy` flag which will list all proxy versions running in the cluster and the number of pods running each version:
```
linkerd version --proxy
Client version: dev-7b9d475f-alex
Server version: edge-20.4.1
Proxy versions:
edge-20.4.1 (10 pods)
git-ca760bdd (11 pods)
git-7f6bbc73 (1 pods)
```
Signed-off-by: Alex Leong <alex@buoyant.io>
This change adds a `--smi-metrics` install flag which controls if the SMI-metrics controller and associated RBAC and APIService resources are installed. The flag defaults to false and is hidden.
We plan to remove this flag or default it to true if and when the SMI-Metrics integration graduates from experimental.
Signed-off-by: Alex Leong <alex@buoyant.io>
Here we upgrade our dependencies on client-go to 0.17.4 and smi-sdk-go to 0.3.0. Since smi-sdk-go uses client-go 0.17.4, these upgrades must be performed simultaneously.
This also requires simultaneously upgrading our dependency on linkerd/stern to a SHA which also uses client-go 0.17.4. This keeps all of our transitive dependencies synchronized on one version of client-go.
This ALSO requires updating our codegen scripts to use the 0.17.4 version of code-generator and running it to generate 0.17.4 compatible generated code. I took this opportunity to update our code generation script to properly use the version of code-generater from `go.mod` rather than a hardcoded SHA.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Handle automountServiceAccountToken
Return error during inject if pod spec has `automountServiceAccountToken: false`
Signed-off-by: Mayank Shah <mayankshah1614@gmail.com>
Followup to #4193
This is to verify that the list of SA installed, as well as the list of
SA in the linkerd-psp RoleBinding match the list of expected SA defined
in `healthcheck.go`.
Fixes#3943
The Linkerd clock skew check requires that all nodes in the cluster have reported a heartbeat within (approximately) the last minute. However, in Kubernetes 1.17, the default heartbeat interval is 5 minutes. This means that the clock skew check will often fail in Kubernetes 1.17 clusters.
We relax the check to only require that heartbeats have been detected in the past 5 minutes, matching the default heartbeat interval in Kubernetes 1.17. We also switch this check to be a warning so that clusters which are configured with longer heartbeat intervals don't see this as a fatal error.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Add missing SAs to linkerd check
This adds the service accounts `linkerd-destination` and
`linkerd-smi-metrics` that were missing from the "control plane
ServiceAccounts exist" check.
When injecting a Cronjob with no
`spec.jobTemplate.spec.template.metadata` we were getting the following
error:
```
Error transforming resources: jsonpatch add operation does not apply:
doc is missing path:
"/spec/jobTemplate/spec/template/metadata/annotations"
```
This only happens to Cronjobs because other workloads force having at
least a label there that is used in `spec.selector` (at least as of v1
workloads).
With this fix, if no metadata is detected, then we add it in the json patch when
injecting, prior to adding the injection annotation.
I've added a couple of new unit tests, one that verifies that this
doesn't remove metadata contents in Cronjobs that do have that metadata,
and another one that tests injection in Cronjobs that don't have
metadata (which I verified it failed prior to this fix).
This version contains an fix for a bug that was rejecting all requests on clusters configured with an empty list of allowed client names. Because smi-metrics is an apiservice, this was also preventing namespaces from terminating.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Upgrade golangci-lint to v1.23.8
This should help with some timeouts we're seeing in CI.
I fixed some new warnings found in `inject.go` and `uninject.go`.
Also we now have to explicitly disable linting `/controller/gen`.
The linter was also complaining that in `/pkg/k8s/fake.go` the
`spClient.Interface` and `tsclient.Interface` returned in the function
`newFakeClientSetsFromManifests()` aren't used, but I opted to ignore
that to leave them available for future tests.
* Bump proxy-init to v1.3.2
Bumped `proxy-init` version to v1.3.2, fixing an issue with `go.mod`
(linkerd/linkerd2-proxy-init#9).
This is a non-user-facing fix.