* Introduce multicluster gateway api handler in web api server
* Added MetricsUtil for Gateway metrics
* Added gateway api helper
* Added Gateway Component
Updated metricsTable component to support gateway metrics
Added handler for gateway
Fixes#4601
Signed-off-by: Tharun <rajendrantharun@live.com>
* Add sidecar container support for linkerd-prometheus
Adds a new setting to the Prometheus' Helm config, allowing adding any kind of sidecar containers to the main container.
The specific use case that inspired this was for exporting data from Prometheus to external systems (e.g. cloudwatch, stackdriver, datadog) using a process that watches the prometheus write-ahead log (WAL).
Signed-off-by: Nathan J. Mehl <n@oden.io>
* Fixed `linkerd check` not finding Prometheus
## The Problem
`linkerd check` run right after install is failing because it can't find the Prometheus Pod.
## The Cause
The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.
Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.
## The Fix
The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.
This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same
It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
Add a new structure on the destination controller side to keep track of contextual information.
The token format has been changed from ns:<namespace> to a JSON format so that more variables can be
encdoed in the token. As part of this PR, a new field 'nodeName' has been added to help with service
topologies.
Fixes#4498
Signed-off-by: Matei David <matei.david.35@gmail.com>
Removed the dependency on the base image, and instead install the needed packages in the Dockerfiles for debug and CNI.
Also removed some obsolete info from BUILD.md
Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
This PR removes the service mirror controller from `linkerd mc install` to `linkerd mc link`, as described in https://github.com/linkerd/rfc/pull/31. For fuller context, please see that RFC.
Basic multicluster functionality works here including:
* `linkerd mc install` installs the Link CRD but not any service mirror controllers
* `linkerd mc link` creates a Link resource and installs a service mirror controller which uses that Link
* The service mirror controller creates and manages mirror services, a gateway mirror, and their endpoints.
* The `linkerd mc gateways` command lists all linked target clusters, their liveliness, and probe latences.
* The `linkerd check` multicluster checks have been updated for the new architecture. Several checks have been rendered obsolete by the new architecture and have been removed.
The following are known issues requiring further work:
* the service mirror controller uses the existing `mirror.linkerd.io/gateway-name` and `mirror.linkerd.io/gateway-ns` annotations to select which services to mirror. it does not yet support configuring a label selector.
* an unlink command is needed for removing multicluster links: see https://github.com/linkerd/linkerd2/issues/4707
* an mc uninstall command is needed for uninstalling the multicluster addon: see https://github.com/linkerd/linkerd2/issues/4708
Signed-off-by: Alex Leong <alex@buoyant.io>
## edge-20.7.4
This edge release adds support for the new Kubernetes [EndpointSlice] resource
to the Destination controller. Using the EndpointSlice API is more efficient
for the Kubernetes control plane than using the Endpoints API. If the cluster
supports EndpointSlices (a beta feature in Kubernetes 1.17), Linkerd can be
installed with `--enable-endpoint-slices` flag to use this resource rather
than the Endpoints API.
* Added fish shell completions to the `linkerd` command (thanks @WLun001!)
* Enabled the support for EndpointSlices (thanks @Matei207!)
* Separated prometheus checks and made them runnable only when the add-on
is enabled
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
As linkerd-prometheus is optional now, the checks are also separated
and should only work when the prometheus add-on is installed.
This is done by re-using the add-on check code.
* Migrate CI to docker buildx and other improvements
## Motivation
- Improve build times in forks. Specially when rerunning builds because of some flaky test.
- Start using `docker buildx` to pave the way for multiplatform builds.
## Performance improvements
These timings were taken for the `kind_integration.yml` workflow when we merged and rerun the lodash bump PR (#4762)
Before these improvements:
- when merging: `24:18`
- when rerunning after merge (docker cache warm): `19:00`
- when running the same changes in a fork (no docker cache): `32:15`
After these improvements:
- when merging: `25:38`
- when rerunning after merge (docker cache warm): `19:25`
- when running the same changes in a fork (docker cache warm): `19:25`
As explained below, non-forks and forks now use the same cache, so the important take is that forks will always start with a warm cache and we'll no longer see long build times like the `32:15` above.
The downside is a slight increase in the build times for non-forks (up to a little more than a minute, depending on the case).
## Build containers in parallel
The `docker_build` job in the `kind_integration.yml`, `cloud_integration.yml` and `release.yml` workflows relied on running `bin/docker-build` which builds all the containers in sequence. Now each container is built in parallel using a matrix strategy.
## New caching strategy
CI now uses `docker buildx` for building the container images, which allows using an external cache source for builds, a location in the filesystem in this case. That location gets cached using actions/cache, using the key `{{ runner.os }}-buildx-${{ matrix.target }}-${{ env.TAG }}` and the restore key `${{ runner.os }}-buildx-${{ matrix.target }}-`.
For example when building the `web` container, its image and all the intermediary layers get cached under the key `Linux-buildx-web-git-abc0123`. When that has been cached in the `main` branch, that cache will be available to all the child branches, including forks. If a new branch in a fork asks for a key like `Linux-buildx-web-git-def456`, the key won't be found during the first CI run, but the system falls back to the key `Linux-buildx-web-git-abc0123` from `main` and so the build will start with a warm cache (more info about how keys are matched in the [actions/cache docs](https://docs.github.com/en/actions/configuring-and-managing-workflows/caching-dependencies-to-speed-up-workflows#matching-a-cache-key)).
## Packet host no longer needed
To benefit from the warm caches both in non-forks and forks like just explained, we're required to ditch doing the builds in Packet and now everything runs in the github runners VMs.
As a result there's no longer separate logic for non-forks and forks in the workflow files; `kind_integration.yml` was greatly simplified but `cloud_integration.yml` and `release.yml` got a little bigger in order to use the actions artifacts as a repository for the images built. This bloat will be fixed when support for [composite actions](https://github.com/actions/runner/blob/users/ethanchewy/compositeADR/docs/adrs/0549-composite-run-steps.md) lands in github.
## Local builds
You still are able to run `bin/docker-build` or any of the `docker-build.*` scripts. And to make use of buildx, run those same scripts after having set the env var `DOCKER_BUILDKIT=1`. Using buildx supposes you have installed it, as instructed [here](https://github.com/docker/buildx).
## Other
- A new script `bin/docker-cache-prune` is used to remove unused images from the cache. Without that the cache grows constantly and we can rapidly hit the 5GB limit (when the limit is attained the oldest entries get evicted).
- The `go-deps` dockerfile base image was changed from `golang:1.14.2` (ubuntu based) to `golang-1:14.2-alpine` also to conserve cache space.
# Addressed separately in #4875:
Got rid of the `go-deps` image and instead added something similar on top of all the Dockerfiles dealing with `go`, as a first stage for those Dockerfiles. That continues to serve as a way to pre-populate go's build cache, which speeds up the builds in the subsequent stages. That build should in theory be rebuilt automatically only when `go.mod` or `go.sum` change, and now we don't require running `bin/update-go-deps-shas`. That script was removed along with all the logic elsewhere that used it, including the `go_dependencies` job in the `static_checks.yml` github workflow.
The list of modules preinstalled was moved from `Dockerfile-go-deps` to a new script `bin/install-deps`. I couldn't find a way to generate that list dynamically, so whenever a slow-to-compile dependency is found, we have to make sure it's included in that list.
Although this simplifies the dev workflow, note that the real motivation behind this was a limitation in buildx's `docker-container` driver that forbids us from depending on images that haven't been pushed to a registry, so we have to resort to building the dependencies as a first stage in the Dockerfiles.
The GitHub Actions build status badge was referencing an old workflow
named `CI`, which always shows red, and is no longer used.
Fix the build status badge to reference the `Release` workload.
Also slightly reformat the matrix build yaml, as the list has grown a
bit.
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Small PR that uncomments the `EndpointSliceAcess` method and cleans up left over todos in the destination service.
* Based on the past three PRs related to `EndpointSlices` (#4663#4696#4740); they should now be functional (albeit prone to bugs) and ready to use.
Signed-off-by: Matei David <matei.david.35@gmail.com>
* Removes/Relaxes prometheus related checks
Now that prometheus is an add-on, There can be cases where prometheus is
disabled at which the check should show a warning but not fail. This
decouples the tight depedency.
This changes the following checks:
- Removes serviceAccount and pod checks in the CLI.
- Relaxes `linkerd-api` checks to only check for prometheus access when
the URL is not empty. This should work seamlessly with external
prometheus as that URL will be passed and it performs the same
check.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
## edge-20.7.3
This edge release introduces an install flag for EndpointSlices. With this flag,
endpoint slices can be used as a resource in the destination service instead of
the endpoints resource.
* Introduce CLI and Helm install flag for EndpointSlices (thanks @Matei207!)
* Internal improvements to the CI process for testing Helm installations
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
EndpointSlices have been made opt-in due to their experimental nature. This PR
introduces a new install flag 'enableEndpointSlices' that will allow adopters to
specify in their cli install or helm install step whether they would like to
use endpointslices as a resource in the destination service, instead of the
endpoints k8s resource.
Signed-off-by: Matei David <matei.david.35@gmail.com>
https://github.com/linkerd/linkerd2-proxy/pull/593 changed the proxy
release process to produce platform-specific binaries.
This change modifies the bin/fetch-proxy script to fetch amd64-specific
binaries. The proxy version has been updated to v1.104.1, which includes
no code changes since v1.104.0.
Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
The `tests` variable wasn't being properly initialized, which resulted
in the `helm-deep` tests being repeated, and without cleanup in between,
the attempt to create resources that were already there caused an error.
## Motivation
Closes#3916
This adds the ability to get profiles for services by IP address.
### Change in behavior
When the destination server receives a `GetProfile` request with an IP address,
it now tries to map that IP address to a service.
If the IP address maps to an existing service, then the destination server
returns the profile stream subscribes for updates to the _service_--this is the
existing behavior. If the IP changes to a new service, the stream will still
send updates for the first service the IP address corresponded to since that is
what it is subscribed to.
If the IP address does not map to an existing service, then the destination
server returns the profile stream but does not subscribe for updates. The stream
will receive one update, the default profile.
### Solution
This change uses the `IPWatcher` within the destination server to check for what
services an IP address correspond to. By adding a new method `GetSvc` to
`IPWatcher`, the server now calls this method if `GetProfile` receives a request
with an IP address.
## Testing
Install linkerd on a cluster and get the cluster IP of any service:
```bash
❯ kubectl get -n linkerd svc/linkerd-tap -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
linkerd-tap ClusterIP 10.104.57.90 <none> 8088/TCP,443/TCP 16h linkerd.io/control-plane-component=tap
```
Run the destination server:
```bash
❯ go run controller/cmd/main.go destination -kubeconfig ~/.kube/config
```
Get the profile for the tap service by IP address:
```bash
❯ go run controller/script/destination-client/main.go -method getProfile -path 10.104.57.90:8088
INFO[0000] retry_budget:{retry_ratio:0.2 min_retries_per_second:10 ttl:{seconds:10}}
INFO[0000]
```
Get the profile for an IP address that does not correspond to a service:
```bash
❯ go run controller/script/destination-client/main.go -method getProfile -path 10.256.0.1:8088
INFO[0000] retry_budget:{retry_ratio:0.2 min_retries_per_second:10 ttl:{seconds:10}}
INFO[0000]
```
You can add and remove settings for the service profile for tap and get updates.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This creates a new integration test target that launches the deep suite,
using a linkerd instance installed through Helm.
I've added a `global.proxyInit.ignoreInboundPorts=1234,5678` override
during install and enhanced the injection test to catch problems like
what we saw in #4679.
The deep integration tests started failing on GKE.
Originally, this was thought to be a cleanup issue, but we have not cleaned up
deep integration tests in the past. We install Linkerd once, and then run all
the tests serially.
In thinking it's been a while since we've run a full deep tests on GKE, we may
just need more resources when running them now.
This increases the node count of the GKE cluster that we run on from 1 to 2.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
An unappropriate variable reuse resulted in the failure of the test for
upgrading using manifests. This only happened when the upgrade was
retried a second time (when there's a discrepancy in the heartbeat cron
schedule, which is bening).
This edge release moves Linkerd's bundled Prometheus into an add-on.
This makes the Linkerd Prometheus more configurable, gives it a separate
upgrade lifecycle from the rest of the control plane, and allows users
to disable the bundled Prometheus instance. In addition, this release
includes fixes for several issues, including a regression where the
proxy would fail to report OpenCensus spans.
* Prometheus is now an optional add-on, enabled by default
* Custom tolerations can now be specified for control plane resources
when installing with Helm (thanks @DesmondH0!)
* Evicted data plane pods are no longer considered to be failed by
`linkerd check --proxy`, fixing an issue where the check would be
retried indefinitely as long as evicted pods are present
* Fixed a regression where proxy spans were not reported to OpenCensus
* Fixed a bug where the proxy injector would fail to render skipped port
lists when installed with Helm
* Internal improvements to the proxy for lower latencies under high
concurrency
* Thanks to @Hellcatlk and @surajssd for adding new unit tests and
spelling fixes!
This fixes the deep integration test which currently only calls `run_test` for
`edges` integration test.
This occurs because `run_test "${tests[@]}"` will pass an entire array of
filenames when `run_test` only expects *one* filename.
The solution is to loop through `tests` and call `run_test` for each file.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This moves Prometheus as a add-on, thus making it optional but enabled by default. The also make `linkerd-prometheus` more configurable, and allow it to have its own life-cycle for upgrades, configuration, etc.
This work will be followed by documentation that help users configure existing Prometheus to work with Linkerd.
**Changes Include:**
- moving prometheus manifests into a separate chart at `charts/add-ons/prometheus`, and adding it as a dependency to `linkerd2`
- implement the `addOn` interface to support the same with CLI.
- include configuration in `linkerd-config-addons`
**User Facing Changes:**
The default install experience does not change much but for users who have already configured Prometheus differently, would need to apply the same using the new configuration fields present in chart README
The splitStringListToPorts helm function is currently incorrectly formating a list of ports as an array of Port objects that look ike {"port" : 555}. The config map protobuf representation however expects that the ignoreOutboundPorts and ignoreInboundPorts fields are are list of PortRange objects ({"portRange" : 555}).
This was causing the injector to return an empty string when trying to parse a PortRange object resulting in the ports not getting set correctly when injecting workloads. Note that this is happening only with helm installations as this is when we are actually using a helm template for outputting the config map.
To fix that the splitStringListToPorts helm function is changed to format the objects as the json representation of PortRange and is renamed to splitStringListToPortRanges
Fix: #4679
Signed-off-by: Zahari Dichev zaharidichev@gmail.com
When a k8s pod is evicted its Phase is set to Failed and the reason is set to Evicted. Because in the ListPods method of the public APi we only transmit the phase and treat it as Status, the healthchecks assume such evicted data plane pods to be failed. Since this check is retryable, the results is that linkerd check --proxy appears to hang when there are evicted pods. As @adleong correctly pointed out here, the presence of evicted pod is not something that we should make the checks fail.
This change modifies the publci api to set the Pod.Status to "Evicted" for evicted pods. The healtcheks are also modified to not treat evicted pods as error cases.
Fix#4690
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
* update helm render tests to read child charts values.yaml
Helm installation by default, considers values.yaml for dependend charts
and uses them in rendering. This function is being used for add-ons to
keep the default template values, allowing further overriden from the
parent chart's i.e linkerd2 values.yaml or --addon-config through CLI.
This PR updates the Helm tests to reflect the same i.e consider
values.yaml of chart dependencies if present.
This does not have any UX changes but helps with the follow up
add-on related work.
Removed `controller/proxy-injector/webhook_ops.go` and `controller/sp-validator/webhook_ops.go` that we used when we first introduced webhooks to dynamically create their configs, but we ended up doing that upfront at install time.
Using following command the wrong spelling were found and later on
fixed:
```
codespell --skip CHANGES.md,.git,go.sum,\
controller/cmd/service-mirror/events_formatting.go,\
controller/cmd/service-mirror/cluster_watcher_test_util.go,\
SECURITY_AUDIT.pdf,.gcp.json.enc,web/app/img/favicon.png \
--ignore-words-list=aks,uint,ans,files\' --check-filenames \
--check-hidden
```
Signed-off-by: Suraj Deshmukh <surajd.service@gmail.com>
The function triggering the test for k8s custom cluster domain was
misnamed, and thus the test wasn't being run.
This also adds some extra error handling to catch this and other
potential issues.
Introduce support for the EndpointSlice k8s resource (k8s v1.16+) in the destination service.
Through this PR, in the EndpointsWatcher, there will be a dedicated informer for EndpointSlice;
the informer cannot run at the same time as the Endpoints resource informer. The main difference
is that EndpointSlices have a one-to-many relationship with a service, they provide better performance benefits,
dual-stack addresses and more. EndpointSlice support also implies service topology and other k8s related features.
Validated and tested manually, as well as with dedicated unit tests.
Closes#4501
Signed-off-by: Matei David <matei.david.35@gmail.com>
Based on the [EndpointSlice PR](https://github.com/linkerd/linkerd2/pull/4663), this is just the k8s/api support for endpointslices to shorten the first PR.
* Adds CRD
* Adds functions that check whether the cluster has EndpointSlice access
* Adds discovery & endpointslice informers to api.
Signed-off-by: Matei David <matei.david.35@gmail.com>
Added Sue BV to the adopters list
Added Youmail to list of adopters (#4694)
Signed-off-by: Freddy Andersen <fandersen@youmail.com>
Co-authored-by: Freddy Andersen <53147+heimdull@users.noreply.github.com>
Co-authored-by: Alex Leong <alex@buoyant.io>
This release increases the default buffer size to match the proxy's
in-flight request limit. This reduces contention in overload--especially
high-concurrency--situations, substantially reducing tail latency.
---
* update test-support clients and servers to be natively async (linkerd/linkerd2-proxy#580)
* Print build diagnostics in docker (linkerd/linkerd2-proxy#583)
* update test controllers to std::future/Tonic; remove threads (linkerd/linkerd2-proxy#585)
* buffer: Box the inner service's reponse future (linkerd/linkerd2-proxy#586)
* Eliminate Bind & Listen traits (linkerd/linkerd2-proxy#584)
* cache: replace Lock with Buffer (linkerd/linkerd2-proxy#587)
This PR adds a new cli test to see if installation yamls are correctly
generated even on windows, this is important because of all the file
path difference between windows and Linux, and if any code uses a wrong
format might cause the chart generation commands to fail on windows.
This creates a separate workflow for both release and integration.
Also, all the exisiting integration tests are moved in to
/tests/integration to separate from /test/cli as this test does not fall
under integration tests category