The multicluster checks make sure that the correct resources exist for each service mirror controller. When looking up these resources, it uses the `linkerd.io/control-plane-component=linkerd-service-mirror` label selector. However, these resources have the label `linkerd.io/control-plane-component=service-mirror`. This causes the resource lookup to fail to find the resource and the check spuriously fails.
```
× service mirror controller has required permissions
missing ServiceAccounts: linkerd-service-mirror-self
missing ClusterRoles: linkerd-service-mirror-access-local-resources-self
missing ClusterRoleBindings: linkerd-service-mirror-access-local-resources-self
missing Roles: linkerd-service-mirror-read-remote-creds-self
missing RoleBindings: linkerd-service-mirror-read-remote-creds-self
see https://linkerd.io/checks/#l5d-multicluster-source-rbac-correct for hints
| * no service mirror controller deployment for Link self
```
Instead, use the correct label selector when looking up these resources.
Signed-off-by: Alex Leong <alex@buoyant.io>
All of the code for the service mirror controller lives in the `linkerd/linkerd2/controller/cmd` package. It is typical for control plane components to only have a `main.go` entrypoint in the cmd package. This can sometimes make it hard to find the service mirror code since I wouldn't expect it to be in the cmd package.
We move the majority of the code to a dedicated controller package, leaving only main.go in the cmd package. This is purely organizational; no behavior change is expected.
Signed-off-by: Alex Leong <alex@buoyant.io>
## edge-20.8.4
* Fixed a problem causing the `enable-endpoint-slices` flag to not be persisted
when set via `linkerd upgrade` (thanks @Matei207!)
* Removed SMI-Metrics templates and experimental sub-commands
* Use `--frozen-lockfile` to avoid accidental update of dashboard JS
dependencies in CI (thanks @tharun208!)
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Fixes#4790
This PR removes both the SMI-Metrics templates along with the
experimental sub-commands. This also removes pkg `smi-metrics`
as there is no direct use of it without the commands.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
## What/How
@adleong pointed out in #4780 that when enabling slices during an upgrade, the new value does not persist in the `linkerd-config` ConfigMap. I took a closer look and it seems that we were never overwriting the values in case they were different.
* To fix this, I added an if block when validating and building the upgrade options -- if the current flag value differs from what we have in the ConfigMap, then change the ConfigMap value.
* When doing so, I made sure to check that if the cluster does not support `EndpointSlices` yet the flag is set to true, we will error out. This is done similarly (copy&paste similarily) to what's in the install part.
* Additionally, I have noticed that the helm ConfigMap template stored the flag value under `enableEndpointSlices` field name. I assume this was not changed in the initial PR to reflect the changes made in the protocol buffer. The API (and thus the CLI) uses the field name `endpointSliceEnabled` instead. I have changed the config template so that helm installations will use the same field, which can then be used in the destination service or other components that may implement slice support in the future.
Signed-off-by: Matei David <matei.david.35@gmail.com>
unit tests in ci are runned using yarn install.
So, there will be some update to the dependencies.
This is fixed by passing --frozen-lockfile in ci workflow
Fixes#3838
Signed-off-by: Tharun <rajendrantharun@live.com>
This edge release adds support for [topology-aware service routing][1]
to the Destination controller. When providing service discovery updates
to proxies, the Destination controller will now filter endpoints based
on the service's topology preferences. Additionally, this release
includes bug fixes for the `linkerd check` CLI command and web
dashboard.
* CLI
* `linkerd check` will no longer warn about a looser webhook failure
policy in HA mode
* Controller
* Added support for [topology-aware service routing][1] to the
Destination controller (thanks @Matei207)
* Changed the Destination controller to always return destination
overrides for service profiles when no traffic split is present
* Web UI
* Fixed Tap `Authority` dropdown not being populated (thanks to
@tharun208!)
[1]: https://kubernetes.io/docs/concepts/services-networking/service-topology/
Tap component is calling fetch metrics with skip_stats and authority
service type is not sent.So, authority dropdown is not getting populated.
Added a seperate call to get metrics for authority
Fixes#4697
Signed-off-by: Tharun <rajendrantharun@live.com>
## Motivation
#4879
## Solution
When no traffic split exists for services, return a single destination override
with a weight of 100%.
Using the destination client on a new linkerd installation, this results in the
following output for `linkerd-identity` service:
```
❯ go run controller/script/destination-client/main.go -method getProfile -path linkerd-identity.linkerd.svc.cluster.local:8080
INFO[0000] retry_budget:{retry_ratio:0.2 min_retries_per_second:10 ttl:{seconds:10}} dst_overrides:{authority:"linkerd-identity.linkerd.svc.cluster.local.:8080" weight:100000}
INFO[0000]
```
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
[Link to RFC](https://github.com/linkerd/rfc/pull/23)
### What
---
* PR that puts together all past pieces of the puzzle to deliver topology-aware service routing, as specified in the [Kubernetes docs](https://kubernetes.io/docs/concepts/services-networking/service-topology/) but with a much better load balancing algorithm and all the coolness of linkerd :)
* The first piece of this PR is focused on adding topology metadata: topology preference for services and topology `<k,v>` pairs for endpoints.
* The second piece of this PR puts together the new context format and fetching the source node topology metadata in order to allow for endpoints filtering.
* The final part is doing the filtering -- passing all of the metadata to the listener and on every `Add` filtering endpoints based on the topology preference of the service, topology `<k,v>` pairs of endpoints and topology of the source (again `<k,v>` pairs).
### How
---
* **Collecting metadata**:
- Services do not have values for topology keys -- the topological keys defined in a service's spec are only there to dictate locality preference for routing; as such, I decided to store them in an array, they will be taken exactly as they are found in the service spec, this ensures we respect the preference order.
- For EndpointSlices, we are using a map -- an EndpointSlice has locality information in the form of `<k,v>` pair, where the key is a topological key (similar to what's listed in the service) and the value is the locality information -- e.g `hostname: minikube`. For each address we now have a map of topology values which gets populated when we translate the endpoints to an address set. Because normal Endpoints do not have any topology information, we create each address with an empty map which is subsequently populated ONLY for slices in the `endpointSliceToAddressSet` function.
* **Filtering endpoints**:
- This was a tricky part and filled me with doubts. I think there are a few ways to do this, but this is how I "envisioned" it. First, the `endpoint_translator.go` should be the one to do the filtering; this means that on subscription, we need to feed all of the relevant metadata to the listener. To do this, I created a new function `AddTopologyFilter` as part of the listener interface.
- To complement the `AddTopologyFilter` function, I created a new `TopologyFilter` struct in `endpoints_watcher.go`. I then embedded this structure in all listeners that implement the interface. The structure holds the source topology (source node), a boolean to tell if slices are activated in case we need to double check (or write tests for the function) and the service preference. We create the filter on Subscription -- we have access to the k8s client here as well as the service, so it's the best point to collect all of this data together. Addresses all have their own topology added to them so they do not have to be collected by the filter.
- When we add a new set of addresses, we check to see if slices are enabled -- chances are if slices are enabled, service topology might be too. This lets us skip this step if the latest version is not adopted. Prior to sending an `Add` we filter the endpoints -- if the preference is registered by the filter we strictly enforce it, otherwise nothing changes.
And that's pretty much it.
Signed-off-by: Matei David <matei.david.35@gmail.com>
Add a static check that ensures the generated files from the proto definitions have not changed.
Fix#4669
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Removed usage of `GITCOOKIE_SH`, which was a script stored in a secret
to authenticate requests against googlesource.com, to avoid hitting
rate limits when pulling go dependencies from that source. Now that we
use go modules, deps are pulled from http://proxy.golang.org/ and this
is no longer needed.
This PR changes the HA check that verifies that the `config.linkerd.io/admission-webhooks=disabled` is present on kube-system to be enabled only when the failure policy for the proxy injector webhook is set to `Fail`. This allows users to skip this check in cases when the label is removed because the namespace is managed by the cloud provider like in the case described in #4754Fix#4754
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
## Motivation
These changes came up when testing mock identity. I found it useful for the
destination client to print the identity of endpoints.
```
❯ go run controller/script/destination-client/main.go -method get -path h1.test.example.com:8080
INFO[0000] Add:
INFO[0000] labels: map[concrete:h1.test.example.com:8080]
INFO[0000] - 127.0.0.1:4143
INFO[0000] - labels: map[addr:127.0.0.1:4143 h2:false]
INFO[0000] - protocol hint: UNKNOWN
INFO[0000] - identity: dns_like_identity:{name:"foo.ns1.serviceaccount.identity.linkerd.cluster.local"}
INFO[0000]
```
I also fixed a log line in the proxy-identity where used the wrong value for the
CSR path
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This PR corrects misspellings identified by the [check-spelling action](https://github.com/marketplace/actions/check-spelling).
The misspellings have been reported at aaf440489e (commitcomment-41423663)
The action reports that the changes in this PR would make it happy: 5b82c6c5ca
Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.
Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
Add PriceKinetics, part of GVC Australia to the adopters list.
Signed-off-by: Steve Gray <steve.gray@ladbrokes.com.au>
Co-authored-by: Steve Gray <steve.gray@ladbrokes.com.au>
Fixes#4708
Adds a `linkerd multicluster uninstall` command which outputs the manifests required to uninstall the mutlicluster components. This command first checks that no links exist and advises that any links must be removed with `linkerd multicluster unlink` before proceeding. Typical usage is:
```
linkerd multicluster uninstall | kubectl delete -f -
```
Signed-off-by: Alex Leong <alex@buoyant.io>
When the Link CRD does not exist, multicluster checks in `linkerd check` will be skipped. The `--multicluster` flag is intended to force these checks on, but was being ignored.
We update the options to force the multicluster checks on when the `--multicluster` flag is used, as intended.
Now when `linkerd check --multicluster` is run on a cluster without the multicluster support installed, it gives the following output:
```
linkerd-multicluster
--------------------
× Link CRD exists
multicluster.linkerd.io/Link CRD is missing: the server could not find the requested resource
see https://linkerd.io/checks/#l5d-multicluster-link-crd-exists for hints
Status check results are ×
```
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#4511
Add the `linkerd.io/control-plane-component: gateway` label to the multicluster gateway. Change the value of `linkerd.io/control-plane-component` from `linkerd-service-mirror` to `service-mirror` for the service mirror controller.
These changes are for consistency and should not result in any change in functionality.
Signed-off-by: Alex Leong <alex@buoyant.io>
This edge adds multi-arch support to Linkerd! Our docker images and CLI now
support the amd64, arm64, and arm architectures.
* Multicluster
* Added a multicluster unlink command for removing multicluster links
* Improved multicluster checks to be more informative when the remote API is
not reachable
* Proxy
* Enabled a multi-threaded runtime to substantially improve latency especially
when the proxy is serving requests for many concurrent connections
* Other
* Fixed an issue where the debug sidecar image was missing during upgrades
(thanks @javaducky!)
* Updated all control plane plane and proxy container images to be multi-arch
to support amd64, arm64, and arm (thanks @aliariff!)
* Fixed an issue where check was failing when DisableHeartBeat was set to true
(thanks @mvaal!)
Supersedes #4846
Bump proxy-init to v1.3.6, containing CNI fixes and support for
multi-arch builds.
#4846 included this in v1.3.5 but proxy.golang.org refused to update the
modified SHA
The upgrade tests were failing due to hardcoded certificates which had expired. Additionally, these tests contained large swaths of yaml that made it very difficult to understand the semantics of each test case and even more difficult to maintain.
We greatly improve the readability and maintainability of these tests by using a slightly different approach. Each test follows this basic structure:
* Render an install manifest
* Initialize a fake k8s client with the install manifest (and sometimes additional manifests)
* Render an upgrade manifest
* Parse the manifests as yaml tree structures
* Perform a structured diff on the yaml tree structured and look for expected and unexpected differences
The install manifests are generated dynamically using the regular install flow. This means that we no longer need large sections of hardcoded yaml in the tests themselves. Additionally, we now asses the output by doing a structured diff against the install manifest. This means that we no longer need golden files with explicit expected output.
All test cases were preserved except for the following:
* Any test cases related to multiphase install (config/control plane) were not replicated. This flow doesn't follow the same pattern as the tests above because the install and upgrade manifests are not expected to be the same or similar. I also felt that these tests were lower priority because the multiphase install/upgrade feature does not seem to be very popular and is a potential candidate for deprecation.
* Any tests involving upgrading from a very old config were not replicated. The code to generate these old style configs is no longer present in the codebase so in order to test this case, we would need to resort to hardcoded install manifests. These tests also seemed low priority to me because Linkerd versions that used the old config are now over 1 year old so it may no longer be critical that we support upgrading from them. We generally recommend that users upgrading from an old version of Linkerd do so by upgrading through each major version rather than directly to the latest.
Signed-off-by: Alex Leong <alex@buoyant.io>
* When releasing, build and upload the amd64, arm64 and arm architectures builds for the CLI
* Refactored `Dockerfile-bin` so it has separate stages for single and multi arch builds. The latter stage is only used for releases.
Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
This PR moves default values into add-on specific values.yaml thus
allowing us to update default values as they would not be present in
linkerd-config-addons cm.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
This release enables a multi-threaded runtime. Previously, the proxy
would only ever use a single thread for data plane processing; now, when
the proxy is allocated more than 1 CPU share, the proxy allocates a
thread per available CPU. This has shown substantial latency
improvements in benchmarks, especially when the proxy is serving
requests for many concurrent connections.
---
* Add a `multicore` feature flag (linkerd/linkerd2-proxy#611)
* Add `multicore` to default features (linkerd/linkerd2-proxy#612)
* admin: add an endpoint to dump spawned Tokio tasks (linkerd/linkerd2-proxy#595)
* trace: roll `tracing` and `tracing-subscriber` dependencies (linkerd/linkerd2-proxy#615)
* stack: Add NewService::into_make_service (linkerd/linkerd2-proxy#618)
* trace: tweak tracing & test support for the multithreaded runtime (linkerd/linkerd2-proxy#616)
* Make FailFast cloneable (linkerd/linkerd2-proxy#617)
* Move HTTP detection & server into linkerd2_proxy_http (linkerd/linkerd2-proxy#619)
* Mark tap integration tests as flakey (linkerd/linkerd2-proxy#621)
* Introduce a SkipDetect layer to preempt detection (linkerd/linkerd2-proxy#620)
Fixes#4774
When a service mirror controller is unable to connect to the target cluster's API, the service mirror controller crashes with the error that it has failed to sync caches. This error lacks the necessary detail to debug the situation. Unfortunately, client-go does not surface more useful information about why the caches failed to sync.
To make this more debuggable we do a couple things:
1. When creating the target cluster api client, we eagerly issue a server version check to test the connection. If the connection fails, the service-mirror-controller logs now look like this:
```
time="2020-07-30T23:53:31Z" level=info msg="Got updated link broken: {Name:broken Namespace:linkerd-multicluster TargetClusterName:broken TargetClusterDomain:cluster.local TargetClusterLinkerdNamespace:linkerd ClusterCredentialsSecret:cluster-credentials-broken GatewayAddress:35.230.81.215 GatewayPort:4143 GatewayIdentity:linkerd-gateway.linkerd-multicluster.serviceaccount.identity.linkerd.cluster.local ProbeSpec:ProbeSpec: {path: /health, port: 4181, period: 3s} Selector:{MatchLabels:map[] MatchExpressions:[{Key:mirror.linkerd.io/exported Operator:Exists Values:[]}]}}"
time="2020-07-30T23:54:01Z" level=error msg="Unable to create cluster watcher: cannot connect to api for target cluster remote: Get \"https://36.199.152.138/version?timeout=32s\": dial tcp 36.199.152.138:443: i/o timeout"
```
This error also no longer causes the service mirror controller to crash. Updating the Link resource will cause the service mirror controller to reload the credentials and try again.
2. We rearrange the checks in `linkerd check --multicluster` to perform the target API connectivity checks before the service mirror controller checks. This means that we can validate the target cluster API connection even if the service mirror controller is not healthy. We also add a server version check here to quickly determine if the connection is healthy. Sample check output:
```
linkerd-multicluster
--------------------
√ Link CRD exists
√ Link resources are valid
* broken
W0730 16:52:05.620806 36735 transport.go:243] Unable to cancel request for promhttp.RoundTripperFunc
× remote cluster access credentials are valid
* failed to connect to API for cluster: [broken]: Get "https://36.199.152.138/version?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
see https://linkerd.io/checks/#l5d-smc-target-clusters-access for hints
W0730 16:52:35.645499 36735 transport.go:243] Unable to cancel request for promhttp.RoundTripperFunc
× clusters share trust anchors
Problematic clusters:
* broken: unable to fetch anchors: Get "https://36.199.152.138/api/v1/namespaces/linkerd/configmaps/linkerd-config?timeout=30s": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
see https://linkerd.io/checks/#l5d-multicluster-clusters-share-anchors for hints
√ service mirror controller has required permissions
* broken
√ service mirror controllers are running
* broken
× all gateway mirrors are healthy
wrong number of (0) gateway metrics entries for probe-gateway-broken.linkerd-multicluster
see https://linkerd.io/checks/#l5d-multicluster-gateways-endpoints for hints
√ all mirror services have endpoints
‼ all mirror services are part of a Link
mirror service voting-svc-gke.emojivoto is not part of any Link
see https://linkerd.io/checks/#l5d-multicluster-orphaned-services for hints
```
Some logs from the underlying go network libraries sneak into the output which is kinda gross but I don't think it interferes too much with being able to understand what's going on.
Signed-off-by: Alex Leong <alex@buoyant.io>
Some installations upgrading from versions prior to 2.7.x may have missing debug image name and version. This fix ensures that the default values are in place for this scenario and additionally upgrades the version of debug image with the control plane version.
Signed-off-by: Paul Balogh <javaducky@gmail.com>
Build ARM docker images in the release workflow.
# Changes:
- Add a new env key `DOCKER_MULTIARCH` and `DOCKER_PUSH`. When set, it will build multi-arch images and push them to the registry. See https://github.com/docker/buildx/issues/59 for why it must be pushed to the registry.
- Usage of `crazy-max/ghaction-docker-buildx ` is necessary as it already configured with the ability to perform cross-compilation (using QEMU) so we can just use it, instead of manually set up it.
- Usage of `buildx` now make default global arguments. (See: https://docs.docker.com/engine/reference/builder/#automatic-platform-args-in-the-global-scope)
# Follow-up:
- Releasing the CLI binary file in ARM architecture. The docker images resulting from these changes already build in the ARM arch. Still, we need to make another adjustment like how to retrieve those binaries and to name it correctly as part of Github Release artifacts.
Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
The job started failing consistently today with:
```
##[error]devblackops/github-action-psscriptanalyzer/v2/action.yml (Line:
30, Col: 9): Unexpected value ''
##[error]devblackops/github-action-psscriptanalyzer/v2/action.yml (Line:
30, Col: 9): Unexpected value ''
##[error]System.ArgumentException: Unexpected type 'NullToken'
encountered while reading 'outputs'. The type 'MappingToken' was
expected.
```
It seems it's something in Github that changed today that is clashing
with the `devblackops/github-action-psscriptanalyzer` action.
I've raised devblackops/github-action-psscriptanalyzer#12
Fixes#4707
In order to remove a multicluster link, we add a `linkerd multicluster unlink` command which produces the yaml necessary to delete all of the resources associated with a `linkerd multicluster link`. These are:
* the link resource
* the service mirror controller deployment
* the service mirror controller's RBAC
* the probe gateway mirror for this link
* all mirror services for this link
This command follows the same pattern as the `linkerd uninstall` command in that its output is expected to be piped to `kubectl delete`. The typical usage of this command is:
```
linkerd --context=source multicluster unlink --cluster-name=foo | kubectl --context=source delete -f -
```
This change also fixes the shutdown lifecycle of the service mirror controller by properly having it listen for the shutdown signal and exit its main loop.
A few alternative designs were considered:
I investigated using owner references as suggested [here](https://github.com/linkerd/linkerd2/issues/4707#issuecomment-653494591) but it turns out that owner references must refer to resources in the same namespace (or to cluster scoped resources). This was not feasible here because a service mirror controller can create mirror services in many different namespaces.
I also considered having the service mirror controller delete the mirror services that it created during its own shutdown. However, this could lead to scenarios where the controller is killed before it finishes deleting the services that it created. It seemed more reliable to have all the deletions happen from `kubectl delete`. Since this is the case, we avoid having the service mirror controller delete mirror services, even when the link is deleted, to avoid the race condition where the controller and CLI both attempt to delete the same mirror services and one of them fails with a potentially alarming error message.
Signed-off-by: Alex Leong <alex@buoyant.io>