Commit Graph

118 Commits

Author SHA1 Message Date
Alejandro Pedraza 59eca5bb82
Use full-length actions SHAs in CI (#5668)
Github now requires that actions pined to a SHA use the full-length SHA,
otherwise CI throws a warning like:
```
actions/checkout@722adc6 looks like the shortened version of a commit
SHA. Referencing actions by the short SHA will be disabled soon. Please
see
https://docs.github.com/en/actions/learn-github-actions/security-hardening-for-github-actions#using-third-party-actions.
```
2021-02-04 16:51:53 -05:00
Alejandro Pedraza 8ac5360041
Extract from public-api all the Prometheus dependencies, and moves things into a new viz component 'linkerd-metrics-api' (#5560)
* Protobuf changes:
- Moved `healthcheck.proto` back from viz to `proto/common` as it remains being used by the main `healthcheck.go` library (it was moved to viz by #5510).
- Extracted from `viz.proto` the IP-related types and put them in `/controller/gen/common/net` to be used by both the public and the viz APIs.

* Added chart templates for new viz linkerd-metrics-api pod

* Spin-off viz healthcheck:
- Created `viz/pkg/healthcheck/healthcheck.go` that wraps the original `pkg/healthcheck/healthcheck.go` while adding the `vizNamespace` and `vizAPIClient` fields which were removed from the core `healthcheck`. That way the core healthcheck doesn't have any dependencies on viz, and viz' healthcheck can now be used to retrieve viz api clients.
- The core and viz healthcheck libs are now abstracted out via the new `healthcheck.Runner` interface.
- Refactored the data plane checks so they don't rely on calling `ListPods`
- The checks in `viz/cmd/check.go` have been moved to `viz/pkg/healthcheck/healthcheck.go` as well, so `check.go`'s sole responsibility is dealing with command business. This command also now retrieves its viz api client through viz' healthcheck.

* Removed linkerd-controller dependency on Prometheus:
- Removed the `global.prometheusUrl` config in the core values.yml.
- Leave the Heartbeat's `-prometheus` flag hard-coded temporarily. TO-DO: have it automatically discover viz and pull Prometheus' endpoint (#5352).

* Moved observability gRPC from linkerd-controller to viz:
- Created a new gRPC server under `viz/metrics-api` moving prometheus-dependent functions out of the core gRPC server and into it (same thing for the accompaigning http server).
- Did the same for the `PublicAPIClient` (now called just `Client`) interface. The `VizAPIClient` interface disappears as it's enough to just rely on the viz `ApiClient` protobuf type.
- Moved the other files implementing the rest of the gRPC functions from `controller/api/public` to `viz/metrics-api` (`edge.go`, `stat_summary.go`, etc.).
- Also simplified some type names to avoid stuttering.

* Added linkerd-metrics-api bootstrap files. At the same time, we strip out of the public-api's `main.go` file the prometheus parameters and other no longer relevant bits.

* linkerd-web updates: it requires connecting with both the public-api and the viz api, so both addresses (and the viz namespace) are now provided as parameters to the container.

* CLI updates and other minor things:
- Changes to command files under `cli/cmd`:
  - Updated `endpoints.go` according to new API interface name.
  - Updated `version.go`, `dashboard` and `uninstall.go` to pull the viz namespace dynamically.
- Changes to command files under `viz/cmd`:
  - `edges.go`, `routes.go`, `stat.go` and `top.go`: point to dependencies that were moved from public-api to viz.
- Other changes to have tests pass:
  - Added `metrics-api` to list of docker images to build in actions workflows.
  - In `bin/fmt` exclude protobuf generated files instead of entire directories because directories could contain both generated and non-generated code (case in point: `viz/metrics-api`).

* Add retry to 'tap API service is running' check

* mc check shouldn't err when viz is not available. Also properly set the log in multicluster/cmd/root.go so that it properly displays messages when --verbose is used
2021-01-21 18:26:38 -05:00
Matei David c63fbdf0e4
Introduce OpenAPIV3 validation for CRDs (#5573)
* Introduce OpenAPIV3 validation for CRDs

* Add validation to link crd
* Add validation to sp using kube-gen
* Add openapiv3 under schema fields in specific versions
* Modify fields to rid spec of yaml errors
* Add top level validation for all three CRDs

Signed-off-by: Matei David <matei.david.35@gmail.com>
2021-01-21 11:56:28 -05:00
Kevin Leimkuhler 308a1f3ff3
Use linkerd path in test-cleanup (#5498)
## What this fixes

When clusters are cleaned up after tests in CI, the `bin/test-cleanup` script is
responsible for clearing the cluster of all testing resources.

Right now this does not work as expected because the script uses the `linkerd`
binary instead of the Linkerd path that is passed in to the `tests` script.

There are cases where different binaries have different uninstall behavior and
the script can complete with an incomplete uninstallation.

## How it fixes

`test-cleanup` now takes a linkerd path argument. This is used to specify the
Linkerd binary that should be used when running in the `uninstall` commands.

This value is passed through from the `tests` invocation which means that in CI,
the same binary is used for running tests as well as cleaning up the cluster.

Additionally, specifying the k8s context has now moved from an argument to the
`--context` flag. This is similar to how `tests` script works because it's not
always required.

## How to use

Shown here:

``` $ bin/test-cleanup -h Cleanup Linkerd integration tests.

Usage:
    test-cleanup [--context k8s_context] /path/to/linkerd

Examples:
    # Cleanup tests in non-default context test-cleanup --context k8s_context
    /path/to/linkerd

Available Commands:
    --context: use a non-default k8s context
```

## edge-21.1.1

This edge release introduces a new "opaque transport" feature that allows the
proxy to securely transport server-speaks-first and otherwise opaque TCP
traffic. Using the `config.linkerd.io/opaque-ports` annotation on pods and
namespaces, users can configure ports that should skip the proxy's protocol
detection.

Additionally, a new `linkerd-viz` extension has been introduced that separates
the installation of the Grafana, Prometheus, web, and tap components. This
extension closely follows the Jaeger and multicluster extensions; users can
`install` and `uninstall` with the `linkerd viz ..` command as well as configure
for HA with the `--ha` flag.

The `linkerd viz install` command does not have any cli flags to customize the
install directly, but instead follows the Helm way of customization by using
flags such as `set`, `set-string`, `values`, `set-files`.

Finally, a new `/shutdown` admin endpoint that may only be accessed over the
loopback network has been added. This allows batch jobs to gracefully terminate
the proxy on completion. The `linkerd-await` utility can be used to automate
this.

* Added a new `linkerd multicluster check` command to validate that the
  `linkerd-multicluster` extension is working correctly
* Fixed description in the `linkerd edges` command (thanks @jsoref!)
* Moved the Grafana, Prometheus, web, and tap components into a new Viz chart,
  following the same extension model that multicluster and Jaeger follow
* Introduced a new "opaque transport" feature that allows the proxy to securely
  transport server-speaks-first and otherwise opaque TCP traffic
* Removed the check comparing the `ca.crt` field in the identity issuer secret
  and the trust anchors in the Linkerd config; these values being different is
  not a failure case for the `linkerd check` command (thanks @cypherfox!)
* Removed the Prometheus check from the `linkerd check` command since it now
  depends on a component that is installed with the Viz extension
* Fixed error messages thrown by the cert checks in `linkerd check` (thanks
  @pradeepnnv!)
* Added PodDisruptionBudgets to the control plane components so that they cannot
  be all terminated at the same time during disruptions (thanks @tustvold!)
* Fixed an issue that displayed the wrong `linkerd.io/proxy-version` when it is
  overridden by annotations (thanks @mateiidavid!)
* Added support for custom registries in the `linkerd-viz` helm chart (thanks
  @jimil749!)
* Renamed `proxy-mutator` to `jaeger-injector` in the `linkerd-jaeger` extension
* Added a new `/shutdown` admin endpoint that may only be accessed over the
  loopback network allowing batch jobs to gracefully terminate the proxy on
  completion
* Introduced the `linkerd identity` command, used to fetch the TLS certificates
  for injected pods (thanks @jimil749)
* Fixed an issue with the CNI plugin where it was incorrectly terminating and
  emitting error events (thanks @mhulscher!)
* Re-added support for non-LoadBalancer service types in the
  `linkerd-multicluster` extension

Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2021-01-08 15:24:14 -05:00
Kevin Leimkuhler dd837be375
Build jaeger-webhook in release CI (#5381)
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-12-14 11:42:51 -05:00
Joakim Roubert 377b38f0bf
bin/protoc-diff: Don't assume Debian and don't install unzip (#5347)
- Do unzip check but don't install; leave installation to user
- Move unzip check to bin/protoc that actually uses unzip
- Make sure the protoc scripts can be called from any directory

Fixes #5337

Signed-off-by: Joakim Roubert <joakim.roubert@axis.com>
2020-12-09 09:12:38 -05:00
Alex Leong cdc57d1af0
Use linkerd-jaeger extension for control plane tracing (#5299)
Now that tracing has been split out of the main control plane and into the linkerd-jaeger extension, we remove references to tracing from the main control plane including:

* removing the tracing components from the main control plane chart
* removing the tracing injection logic from the main proxy injector and inject CLI (these will be added back into the new injector in the linkerd-jaeger extension)
* removing tracing related checks (these will be added back into `linkerd jaeger check`)
* removing related tests

We also update the `--control-plane-tracing` flag to configure the control plane components to send traces to the linkerd-jaeger extension.  To make sure this works even when the linkerd-jaeger extension is installed in a non-default namespace, we also add a `--control-plane-tracing-namespace` flag which can be used to change the namespace that the control plane components send traces to.

Note that for now, only the control plane components send traces; the proxies in the control plane do not.  This is because the linkerd-jaeger injector is not yet available.  However, this change adds the appropriate namespace annotations to the control plane namespace to configure the proxies to send traces to the linkerd-jaeger extension once the linkerd-jaeger injector is available.

I tested this by doing the following:

1. bin/linkerd install | kubectl apply -f -
1. bin/helm install jaeger jaeger/charts/jaeger
1. bin/linkerd upgrade --control-plane-tracing=true | kubectl apply -f -
1. kubectl -n linkerd-jaeger port-forward svc/jaeger 16686
1. open http://localhost:16686
1. see traces from the linkerd control plane

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-12-08 14:34:26 -08:00
Alejandro Pedraza 94574d4003
Add automatic readme generation for charts (#5316)
* Add automatic readme generation for charts

The current readmes for each chart is generated
manually and doesn't contain all the information available.

Utilize helm-docs to automatically fill out readme.mds
for the helm charts by pulling metadata from values.yml.

Fixes #4156

Co-authored-by: GMarkfjard <gabma047@student.liu.se>
2020-12-02 14:37:45 -05:00
Alejandro Pedraza deca7ede08
Consolidate integration tests under k3d (#5245)
* Consolidate integration tests under k3d

Fixes #5007

Simplified integration tests by moving all to k3d. Previously things were running in Kind, except for the multicluster tests, which implied some extra complexity in the supporting scripts.

Removed the KinD config files under `test/integration/configs`, as config is now passed as flags into the `k3d` command.

Also renamed `kind_integration.yml` to `integration_tests.yml`

Test skipping logic under ARM was also simplified.
2020-11-18 14:33:16 -05:00
Alex Leong 1a91f6b0df
Increase ARM integration test timeout (#5222)
The ARM integration tests take a very long time to run for some reason.  For example, in the stable-2.9.0 release, they took 
38 minutes.  Thus, this test needs a longer timeout.

Increase the ARM integration test timeout from 30 minutes to 60 minutes.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-11-12 13:57:28 -08:00
Oliver Gould 440a201997
actions: Limit job runtime to <= 30 minutes (#5216)
The default job timeout is 6 hours! This allows runaway builds to
consume our actions resources unnecessarily.

This change limits integration test jobs to 30 minutes. Static checks
are limited to 10 minutes.
2020-11-12 08:30:02 -08:00
Alex Leong 15bd95ee1d
Trigger ARM int tests for edge releases as well (#5073) (#5120)
Used to be triggered only for stable releases, but now that 2.9 stable
approaches let's turn it on for the upcoming RCs.

Signed-off-by: Alex Leong <alex@buoyant.io>

Co-authored-by: Alejandro Pedraza <alejandro@buoyant.io>
2020-10-21 14:29:13 -07:00
Alex Leong 827646a3e1
Revert "Trigger ARM int tests for edge releases as well" (#5087)
This reverts commit 85cbcb4a85.

We disable the ARM integration tests for now until we have more confidence in them.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-10-15 09:28:48 -07:00
Alex Leong fbc405d5b4
Fix incorrect usage of --skip-kind-create flag (#5084)
The release workflow uses the `-skip-kind-create` flag when the flag is actually called `-skip-cluster-create`.  This causes the workflow to fail.

We correct the flag name.

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-10-14 11:30:27 -07:00
Alejandro Pedraza 865e140be9
Trigger ARM int tests for edge releases as well (#5073)
Used to be triggered only for stable releases, but now that 2.9 stable
approaches let's turn it on for the upcoming RCs.
2020-10-14 10:54:07 -05:00
Alejandro Pedraza 3af25fa886
Fix how env vars are set in CI (#5054)
Replaced `set-env` directives with environment files, as explained
[here](https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/)

This gets rids of warnings of the sort:
```
The `set-env` command is deprecated and will be disabled soon. Please
upgrade to using Environment Files. For more information see:
https://github.blog/changelog/2020-10-01-github-actions-deprecating-set-env-and-add-path-commands/
```
2020-10-09 19:24:41 -07:00
Alejandro Pedraza e1772ae183
Fixed releases.yaml by pulling images directly from ghcr.io (#5035)
Previously, `releases.yaml` was trying to load images into the kind
clusters but that failed because those images were already in `ghcr.io`
and not in the local docker cache, but that failure was masked.
Unmasking that failure revealed some flaws that this change addresses:

- In `bin/_test_helpers` (used by `bin/tests`), modified the `images`
arg to accept `docker(default)|archive|skip`, for determining how to
load the images into the cluster (if loading them at all)
- In `bin/image-load`, changed arg `images` to `archive` which is more
descriptive.
- Have `kind_integration.yml` call `bin/tests --images archive`.
- Have `release.yml` call `bin/tests --images skip`.
2020-10-02 08:05:17 -05:00
Alejandro Pedraza b50ae6290d
Add support for k3d in integration tests (#4994)
* Add support for k3d in integration tests

KinD doesn't support setting LoadBalancer services out of the box. It can be added with some additional work, but it seems the solutions are not cross-platform.

K3d on the other hand facilitates this, so we'll be using k3d clusters for the multicluster integration test.

The current change sets the ground by generalizing some of the integration tests operations that were hard-coded to KinD.

- Added `bin/k3d` to wrap the setup and running of a pinned version of `k3d`.
- Refactored `bin/_test-helpers.sh` to account for tests to be run in either KinD or k3d.
- Renamed `bin/kind-load` to `bin/image-load` and make it more generic to load images for both KinD (default) and k3d. Also got rid of the no longer used `--images-host` option.
- Added a placeholder for the new `multicluster` test in the lists in `bin/_test-helpers.sh`. It starts by setting up two k3d clusters.

* Refactor handling of the `--multicluster` flag in integration tests (#4995)

Followup to #4994, based off of that branch (`alpeb/k3d-tests`).
This is more preliminary work previous to the more complete multicluster integration test.

- Removed the `--multicluster` flag from all the tests we had in `bin/_test-helpers.sh`, so only the new "multicluster" integration test will make use of that. Also got rid of the `TestUninstallMulticluster()` test in `install_test.go` to keep the multicluster stuff around, needed for the more complete multicluster test that will be implemented in a followup PR.
- Added "multicluster" to the list of tests in the `kind_integration.yml` workflow.
- For now, this new "multicluster" test in `run_multicluster_test()` is just running the install tests (`test/integration/install_test.go`) with the `--multicluster` flag.

Co-authored-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-09-25 16:33:17 -05:00
Tarun Pothulapati ecce5b91f6
tests: Add Calico CNI deep integration tests (#4952)
* tests: Add new CNI deep integration tests

Fixes #3944

This PR adds a new test, called cni-calico-deep which installs the Linkerd CNI
plugin on top of a cluster with Calico and performs the current integration tests on top, thus
validating various Linkerd features when CNI is enabled. For Calico
to work, special config is required for kind which is at `cni-calico.yaml`

This is different from the CNI integration tests that we run in
cloud integration which performs the CNI level integration tests.

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2020-09-23 19:58:28 +05:30
Alejandro Pedraza d6bcd1e906
Only run the ARM integration tests for stable releases (#4986) 2020-09-21 09:12:00 -05:00
Alejandro Pedraza 68582c5f5b
Do not run cloud integration tests in CI (#4969)
* Do not run cloud integration tests in CI

Closes #4963

Removed the `./.github/workflows/cloud_integration.yml` workflow, and
removed the `cloud_integration_tests` job from the ``./.github/workflows/release.yml` workflow.
2020-09-16 09:36:04 -05:00
Alejandro Pedraza ccf027c051
Push docker images to ghcr.io instead of gcr.io (#4953)
* Push docker images to ghcr.io instead of gcr.io

The `cloud_integration.yml` and `release.yml` workflows were modified to
log into ghcr.io, and remove the `Configure gcloud` step which is no
longer necessary.

Note that besides the changes to cloud_integration.yml and release.yml, there was a change to the upgrade-stable integration test so that we do linkerd upgrade --addon-overwrite to reset the addons settings because in stable-2.8.1 the Grafana image was pegged to gcr.io/linkerd-io/grafana in linkerd-config-addons. This will need to be mentioned in the 2.9 upgrade notes.

Also the egress integration test has a debug container that now is pegged to the edge-20.9.2 tag.

Besides that, the other changes are just a global search and replace (s/gcr.io\/linkerd-io/ghcr.io\/linkerd/).
2020-09-10 15:16:24 -05:00
Tarun Pothulapati c4f8ba270d
Generate Identity certs with alternate domain names (#4920)
Updating only the go 1.15 version, makes the upgrades fail from older versions,
as the identity certs do not have that setting and go 1.15 expects them. 
This PR upgrades the cert generation code to have that field, 
allowing us to move to go 1.15 in later versions of Linkerd.
2020-09-03 22:33:10 +05:30
Alejandro Pedraza 85b71ad786
Revert "Temporarily disable job `psscript-analyzer` in static checks (#4837)" (#4937)
This reverts #4837 which disabled the psscript-analyzer job that had an
issue. This upgrades it to version 2.3.0, which fixes the issue.
2020-09-03 11:54:22 -05:00
Ali Ariff 5186383c81
Add ARM64 Integration Test (#4897)
* Add ARM64 Integration Test

Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
2020-08-28 10:38:40 -07:00
Tharun Rajendran 8a2cb12656
Fix js unit test dependency on ci (#4896)
unit tests in ci are runned using yarn install.
So, there will be some update to the dependencies.
This is fixed by passing --frozen-lockfile in ci workflow
Fixes #3838

Signed-off-by: Tharun <rajendrantharun@live.com>
2020-08-24 14:21:41 -07:00
Zahari Dichev 2e7c00aa37
Diff generated code from proto files (#4863)
Add a static check that ensures the generated files from the proto definitions have not changed. 

Fix #4669

Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
2020-08-18 11:44:33 +03:00
Alejandro Pedraza ac2bfb387b
Remove no longer needed GITCOOKIE_SH in CI (#4881)
Removed usage of `GITCOOKIE_SH`, which was a script stored in a secret
to authenticate requests against googlesource.com, to avoid hitting
rate limits when pulling go dependencies from that source. Now that we
use go modules, deps are pulled from http://proxy.golang.org/ and this
is no longer needed.
2020-08-17 12:36:07 -07:00
Ali Ariff ae8bb0e26e
Release ARM CLI artifacts (#4841)
* When releasing, build and upload the amd64, arm64 and arm architectures builds for the CLI
* Refactored `Dockerfile-bin` so it has separate stages for single and multi arch builds. The latter stage is only used for releases.

Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
2020-08-11 09:25:58 -05:00
Ali Ariff 61d7dedd98
Build ARM docker images (#4794)
Build ARM docker images in the release workflow.

# Changes:
- Add a new env key `DOCKER_MULTIARCH` and `DOCKER_PUSH`. When set, it will build multi-arch images and push them to the registry. See https://github.com/docker/buildx/issues/59 for why it must be pushed to the registry.
- Usage of `crazy-max/ghaction-docker-buildx ` is necessary as it already configured with the ability to perform cross-compilation (using QEMU) so we can just use it, instead of manually set up it.
- Usage of `buildx` now make default global arguments. (See: https://docs.docker.com/engine/reference/builder/#automatic-platform-args-in-the-global-scope)

# Follow-up:
- Releasing the CLI binary file in ARM architecture. The docker images resulting from these changes already build in the ARM arch. Still, we need to make another adjustment like how to retrieve those binaries and to name it correctly as part of Github Release artifacts.

Signed-off-by: Ali Ariff <ali.ariff12@gmail.com>
2020-08-05 11:14:01 -07:00
Alejandro Pedraza f38bdf8ecc
Temporarily disable job `psscript-analyzer` in static checks (#4837)
The job started failing consistently today with:
```
##[error]devblackops/github-action-psscriptanalyzer/v2/action.yml (Line:
30, Col: 9): Unexpected value ''
##[error]devblackops/github-action-psscriptanalyzer/v2/action.yml (Line:
30, Col: 9): Unexpected value ''
##[error]System.ArgumentException: Unexpected type 'NullToken'
encountered while reading 'outputs'. The type 'MappingToken' was
expected.
```

It seems it's something in Github that changed today that is clashing
with the `devblackops/github-action-psscriptanalyzer` action.

I've raised devblackops/github-action-psscriptanalyzer#12
2020-08-04 22:01:30 -07:00
Alejandro Pedraza a1be60aea1
Reenable `upgrade-edge` integration test (#4821)
Followup to #4797

That test was temporarily disabled until the prometheus check in
`linkerd check` got fixed in #4797 and made it into edge-20.7.5
2020-07-31 12:11:32 -05:00
Alejandro Pedraza 2aea2221ed
Fixed `linkerd check` not finding Prometheus (#4797)
* Fixed `linkerd check` not finding Prometheus

## The Problem

`linkerd check` run right after install is failing because it can't find the Prometheus Pod.

## The Cause

The "control plane pods are ready" check used to verify the existence of all the control plane pods, blocking until all the pods were ready.

Since #4724, Prometheus is no longer included in that check because it's checked separately as an add-on. An unintended consequence is that when the ensuing "control plane self-check" is triggered, Prometheus might not be ready yet and the check fails because it doesn't do retries.

## The Fix

The "control plane self-check" uses a gRPC call (it's the only check that does that) and those weren't designed with retries in mind.

This PR adds retry functionality to the `runCheckRPC()` function, making sure the final output remains the same

It also temporarily disables the `upgrade-edge` integration test because after installing edge-20.7.4 `linkerd check` will fail because of this.
2020-07-27 11:54:03 -05:00
Alejandro Pedraza 5e789ba152
Migrate CI to docker buildx and other improvements (#4765)
* Migrate CI to docker buildx and other improvements

## Motivation
- Improve build times in forks. Specially when rerunning builds because of some flaky test.
- Start using `docker buildx` to pave the way for multiplatform builds.

## Performance improvements
These timings were taken for the `kind_integration.yml` workflow when we merged and rerun the lodash bump PR (#4762)

Before these improvements:
- when merging: `24:18`
- when rerunning after merge (docker cache warm): `19:00`
- when running the same changes in a fork (no docker cache): `32:15`

After these improvements:
- when merging: `25:38`
- when rerunning after merge (docker cache warm): `19:25`
- when running the same changes in a fork (docker cache warm): `19:25`

As explained below, non-forks and forks now use the same cache, so the important take is that forks will always start with a warm cache and we'll no longer see long build times like the `32:15` above.
The downside is a slight increase in the build times for non-forks (up to a little more than a minute, depending on the case).

## Build containers in parallel
The `docker_build` job in the `kind_integration.yml`, `cloud_integration.yml` and `release.yml` workflows relied on running `bin/docker-build` which builds all the containers in sequence. Now each container is built in parallel using a matrix strategy.

## New caching strategy
CI now uses `docker buildx` for building the container images, which allows using an external cache source for builds, a location in the filesystem in this case. That location gets cached using actions/cache, using the key `{{ runner.os }}-buildx-${{ matrix.target }}-${{ env.TAG }}` and the restore key `${{ runner.os }}-buildx-${{ matrix.target }}-`.

For example when building the `web` container, its image and all the intermediary layers get cached under the key `Linux-buildx-web-git-abc0123`. When that has been cached in the `main` branch, that cache will be available to all the child branches, including forks. If a new branch in a fork asks for a key like `Linux-buildx-web-git-def456`, the key won't be found during the first CI run, but the system falls back to the key `Linux-buildx-web-git-abc0123` from `main` and so the build will start with a warm cache (more info about how keys are matched in the [actions/cache docs](https://docs.github.com/en/actions/configuring-and-managing-workflows/caching-dependencies-to-speed-up-workflows#matching-a-cache-key)).

## Packet host no longer needed
To benefit from the warm caches both in non-forks and forks like just explained, we're required to ditch doing the builds in Packet and now everything runs in the github runners VMs.
As a result there's no longer separate logic for non-forks and forks in the workflow files; `kind_integration.yml` was greatly simplified but `cloud_integration.yml` and `release.yml` got a little bigger in order to use the actions artifacts as a repository for the images built. This bloat will be fixed when support for [composite actions](https://github.com/actions/runner/blob/users/ethanchewy/compositeADR/docs/adrs/0549-composite-run-steps.md) lands in github.

## Local builds
You still are able to run `bin/docker-build` or any of the `docker-build.*` scripts. And to make use of buildx, run those same scripts after having set the env var `DOCKER_BUILDKIT=1`. Using buildx supposes you have installed it, as instructed [here](https://github.com/docker/buildx).

## Other
- A new script `bin/docker-cache-prune` is used to remove unused images from the cache. Without that the cache grows constantly and we can rapidly hit the 5GB limit (when the limit is attained the oldest entries get evicted).
- The `go-deps` dockerfile base image was changed from `golang:1.14.2` (ubuntu based) to `golang-1:14.2-alpine` also to conserve cache space.

# Addressed separately in #4875:

Got rid of the `go-deps` image and instead added something similar on top of all the Dockerfiles dealing with `go`, as a first stage for those Dockerfiles. That continues to serve as a way to pre-populate go's build cache, which speeds up the builds in the subsequent stages. That build should in theory be rebuilt automatically only when `go.mod` or `go.sum` change, and now we don't require running `bin/update-go-deps-shas`. That script was removed along with all the logic elsewhere that used it, including the `go_dependencies` job in the `static_checks.yml` github workflow.

The list of modules preinstalled was moved from `Dockerfile-go-deps` to a new script `bin/install-deps`. I couldn't find a way to generate that list dynamically, so whenever a slow-to-compile dependency is found, we have to make sure it's included in that list.

Although this simplifies the dev workflow, note that the real motivation behind this was a limitation in buildx's `docker-container` driver that forbids us from depending on images that haven't been pushed to a registry, so we have to resort to building the dependencies as a first stage in the Dockerfiles.
2020-07-22 14:27:45 -05:00
Andrew Seigner 8773416496
Fix build status badge in README (#4769)
The GitHub Actions build status badge was referencing an old workflow
named `CI`, which always shows red, and is no longer used.

Fix the build status badge to reference the `Release` workload.

Also slightly reformat the matrix build yaml, as the list has grown a
bit.

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2020-07-20 17:18:12 -07:00
Alejandro Pedraza 873bd61324
Helm integration deep tests (#4728)
This creates a new integration test target that launches the deep suite,
using a linkerd instance installed through Helm.

I've added a `global.proxyInit.ignoreInboundPorts=1234,5678` override
during install and enhanced the injection test to catch problems like
what we saw in #4679.
2020-07-10 14:48:49 -05:00
Alejandro Pedraza a30213b709
Increased node count for the GKE tests in release.yml (#4748)
Followup to #4746. Tests were running out of resources.
2020-07-10 11:05:12 -05:00
Kevin Leimkuhler e482ed4410
Use 2 nodes in cloud integration tests (#4746)
The deep integration tests started failing on GKE.

Originally, this was thought to be a cleanup issue, but we have not cleaned up
deep integration tests in the past. We install Linkerd once, and then run all
the tests serially.

In thinking it's been a while since we've run a full deep tests on GKE, we may
just need more resources when running them now.

This increases the node count of the GKE cluster that we run on from 1 to 2.

Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-07-09 18:23:45 -07:00
Alejandro Pedraza 9908b2b8b2
Re-enable custom domain integration test (#4722)
The function triggering the test for k8s custom cluster domain was
misnamed, and thus the test wasn't being run.

This also adds some extra error handling to catch this and other
potential issues.
2020-07-07 16:27:46 -05:00
Tarun Pothulapati cf34a14985
Add a Windows Linkerd cli Test (#4653)
This PR adds a new cli test to see if installation yamls are correctly
generated even on windows, this is important because of all the file
path difference between windows and Linux, and if any code uses a wrong
format might cause the chart generation commands to fail on windows.

This creates a separate workflow for both release and integration.

Also, all the exisiting integration tests are moved in to
/tests/integration to separate from /test/cli as this test does not fall
under integration tests category
2020-07-02 23:13:57 +05:30
Kevin Leimkuhler 29bcb57de4
Update release workflow kind integration tests (#4668)
## Description

As discussed [here](https://github.com/linkerd/linkerd2/pull/4653#discussion_r445543061), the `kind_integration` job of the release workflow was not kept in sync with the changes made in #4593.

Until GitHub actions can reuse yaml for separate workflows, these sections are supposed to be kept in sync.

This would be an issue if we had tried doing a release since #4593 merged, but that has not happened yet.

## Changes

This updates the release workflow `kind_integration` job to use the use new test interface, mainly removing cluster creation and image loading as necessary prerequisites.

Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-06-25 13:01:04 -04:00
Kevin Leimkuhler 4372ed56dd
Isolate tests by cluster and make run interface simpler (#4593)
## Summary

Change the default behavior of integration tests to be isolated by cluster.
Additionally, make running one or all tests easier than the current process.

These changes are explained more in the [Testing
RFC](https://github.com/linkerd/rfc/blob/master/design/0004-isolated-integration-tests.md)

## Changes

This is a script used only by Linkerd developers, but there is a lot of useful
usage examples and explanations in `bin/tests --help` output:

```
Run Linkerd integration tests.

Optionally specify one of the following tests: [upgrade helm helm-upgrade uninstall deep external-issuer]

Usage:
    tests [--images] [--images-host ssh://linkerd-docker] [--name test-name] [--skip-kind-create] /path/to/linkerd

Examples:
    # Run all tests in isolated clusters
    tests /path/to/linkerd

    # Run single test in isolated clusters
    tests --name test-name /path/to/linkerd

    # Skip KinD cluster creation and run all tests in default cluster context
    tests --skip-kind-create /path/to/linkerd

    # Load images from tar files located under the 'image-archives' directory
    # Note: This is primarly for CI
    tests --images /path/to/linkerd

    # Retrieve images from a remote docker instance and then load them into KinD
    # Note: This is primarly for CI
    tests --images --images-host ssh://linkerd-docker /path/to/linkerd

Available Commands:
    --name: the argument to this option is the specific test to run
    --skip-kind-create: skip KinD cluster creation step and run tests in an existing cluster.
    --images: (Primarily for CI) use 'kind load image-archive' to load the images from local .tar files in the current directory.
    --images-host: (Primarily for CI) the argument to this option is used as the remote docker instance from which images are first retrieved (using 'docker save') to be then loaded into KinD. This command requires --images.
```

### Run all tests

Old:

```bash
bin/test-run $PWD/bin/linkerd
```

New:

```bash
bin/tests $PWD/bin/linkerd
```

### Run single test (upgrade for example):

Current:

```bash
. bin/_test-run.sh
init_test_run $PWD/bin/linkerd
upgrade_integration_tests
```

New:

```bash
bin/tests --name upgrade $PWD/bin/linkerd
```

### Run tests in isolated KinD clusters

Current: Not possible without running single tests in newly created clusters
manually

New:

```bash
bin/tests $PWD/bin/linkerd
```

### Run tests in isolated namespaces on an existing cluster

Old:

```bash
bin/test-run $PWD/bin/linkerd
```

New:

```bash
bin/tests --skip-kind-create $PWD/bin/linkerd
```

## CI

`kind_integration` has been updated so that it does not create a KinD cluster as
part of its test setup.

`cloud_integration` passes the `--skip-kind-create` flag so that the tests are
run serially in a non-KinD cluster.


Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-06-24 17:06:29 -04:00
Alejandro Pedraza d842a97cb2
Update CI and docs to reference `main` branch (#4662)
Files changed:

```
.github/PULL_REQUEST_TEMPLATE.md
.github/workflows/cloud_integration.yml
.github/workflows/kind_integration.yml
.github/workflows/release.yml
.github/workflows/static_checks.yml
.github/workflows/unit_tests.yml
BUILD.md
CONTRIBUTING.md
bin/test-scale
bin/win/linkerd.nuspec
```
2020-06-24 12:39:22 -07:00
Alejandro Pedraza ba420f2fac
Fix release workflow - avoid downloading choco package in edges (#4638)
Added guard against trying to download choco package when not doing a
stable release.
2020-06-18 16:02:53 -05:00
Alejandro Pedraza 2696ea94dd
Fix release workflow dependencies on choco_pack (#4635)
The `choco_pack` job only runs for stable tags. In order for jobs to
depend on it to run on non-stable tags, we need to move this tag check from the
`choco_pack` job level down into its steps.
2020-06-18 13:50:19 -05:00
Kevin Leimkuhler b0765c4361
Add integration test for upgrading from edge (#4557)
This adds an integration test for upgrading from the latest edge to the current
build.

Closes #4471

Signed-off-by: Kevin Leimkuhler kevin@kleimkuhler.com
2020-06-16 09:18:52 -07:00
Alejandro Pedraza d10ed2aa5e
CI steps for Chocolatey package - take 2 (#4536)
* CI steps for Chocolatey package - take 2

Followup to #4205, supersedes #4205

This adds:

- A new job psscript-analyzer into the `statics_checks.yml`
workflow for linting the Chocolatey Powershell script.
- A new `choco_pack` job in the `release.yml` workflow for
updating the Chocolatey spec file and generating the
package. This is only triggered for stable releases. It requires
a windows runner in order to run the choco tooling (in theory
it should have worked on a linux runner but in practice it
didn't).
- The `Create release` step was updated to upload the generated package,
if present.
- The source file path in `bin/win/linkerd.nuspec` was updated
to make this work.

* Name nupkg file accordingly to the other release assets
2020-06-15 16:42:50 -05:00
Kevin Leimkuhler 8f5ff8d973
Wait for KinD nodes to be ready in CI (#4488)
* Wait for all nodes to be ready in CI

Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-05-28 13:56:09 -07:00
Alex Leong 8b04a657e0
Fix typo in release workflow (#4475)
This should fix the warning in the release action: https://github.com/linkerd/linkerd2/actions/runs/111938670

Signed-off-by: Alex Leong <alex@buoyant.io>
2020-05-26 09:27:25 -07:00
Kevin Leimkuhler 2e1eb9e2ec
Use bin/kind in CI scripts (#4464)
Create kind clusters using bin script instead of GitHub action

Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
2020-05-21 16:22:23 -07:00