Using port `80` opens up services to all sorts of unwanted internet
traffic and, furthermore, we don't even want serve HTTP on this port
since we are always employing Linkerd's mTLS.
This changes the gateway's `incomingPort` to 4180 and the `probePort` to
4181 to fit into Linkerd's other port range being in 41XX.
shellcheck will not accept the string DO since it is not sure whether it is a misspelled do command or a string with DO. Explicitly quoting it will mitigate this.
Signed-off-by: Joakim Roubert <joakimr@axis.com>
The SC1090 "Can't follow non-constant source" issue is addressed in the way suggested in shellcheck's documentation; the source paths are pointed out in shellcheck comments. By adding the bin dir to the -P shellcheck CLI parameter, we avoid having to state the bin directory in each and every script file.
Signed-off-by: Joakim Roubert <joakimr@axis.com>
Remove superfluous echo commands in assignments.
Add quotes.
Simplify the for loops that shellcheck didn't like.
Signed-off-by: Joakim Roubert <joakimr@axis.com>
Followup to #4341
Replaced all the `t.Error`/`t.Fatal` calls in the integration tests with the
new functions defined in `testutil/annotations.go` as described in #4292,
in order for the errors to produce Github annotations.
This piece takes care of the CNI integration test suite.
This also enables the annotations for these and the general integration
tests, by setting the `GH_ANNOTATIONS` environment variable in the
workflows whose flakiness we're interested on catching: Kind
integration, Cloud integration and Release.
Re #4176
Upgraded to Helm v3.2.1 from v2.16.1, getting rid of Tiller and making
other simplifications.
Note that the version placeholder in the `values.yaml` files had to be
changed from `{version}` to `linkerdVersionValue` because the former
confuses Helm v3.
#4217 suggests a retries integration test, but this is already tested as part
of the ServiceProfiles test.
In order to fix this issue, an extra check has been added to the assertion of
the `ActualSuccess` value. It now asserts the value is both greater than 0 and
less than 100.
Closes#4217
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This change adds labels to endpoints that target remote services. It also adds a Grafana dashboard that can be used to monitor multicluster traffic.
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
This release adds special handling for I/O errors in HTTP responses so
that an `errno` label is included to describe the underlying errors
in the proxy's metrics.
---
* Add an `i/o` error label to http metrics (linkerd/linkerd2-proxy#512)
* CLI
* Added a section to the `linkerd check` that validates that all
clusters part of a multicluster setup have compatible trust anchors
* Modified the `inkerd cluster export-service` command to work by
transforming yaml instead of modifying cluster state
* Added functionality that allows the `linkerd cluster export-service`
command to operate on lists of services
* Controller
* Changed the multicluster gateway to always require TLS on connections
originating from outside the cluster
* Removed admin server timeouts from control plane components, thereby
fixing a bug that can cause liveness checks to fail
* Helm
* Moved Grafana templates into a separate add-on chart
* Proxy
* Improved latency under high-concurrency use cases.
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
This release reduces latency and CPU consumption, especially for high-
concurrency use cases.
---
* Add middleware that rejects connections with no identity (linkerd/linkerd2-proxy#507)
* Buffer requests while the service is pending (linkerd/linkerd2-proxy#511)
The Linkerd control plane components' admin servers have an idle connection timeout of 10 seconds. This means that they will close connections which have been idle for 10 seconds. These components are also configured with a 10 second period for liveness checks. This introduces a race condition where connections will be idle for approximately 10 seconds between liveness checks and can idle out, potentially causing the next liveness check to fail.
We remove the idle timeout so that the connection stays alive.
* Refactor integration tests to use annotations functions
First part of #4176
Replaced all the `t.Error`/`t.Fatal` calls in the integration with the
new functions defined in `testutil/annotations.go` as described in #4292,
in order for the errors to produce Github annotations.
Most of these calls have now two strings: one containing a generic error
message and another with a more specific message. The former is what
will be aggregated and seen in the CI reports at
[linkerd2-ci-metrics](https://github.com/linkerd/linkerd2-ci-metrics).
Other changes:
- Improved the annotation generator in `annotations.go` so that the
message includes the name of the test.
- When a failure from `RetryFor` occurs, log the original timeout so
we can consider incrementing it when the failure is persistent.
## edge-20.5.1
* CLI
* Fixed all commands to use kubeconfig's default namespace if specified
(thanks @Matei207!)
* Added multicluster checks to the `linkerd check` command
* Hid development flags in the `linkerd install` command for release builds
* Controller
* Added ability to configure Prometheus Altermanager as well as recording
and alerting rules on the Linkerd Prometheus (thanks @naseemkullah!)
* Added ability to add more commandline flags to the Prometheus command
(thanks @naseemkullah!)
* Web UI
* Fixed TrafficSplit detail page not loading
* Added Jaeger links to the dashboard when the tracing addon is enabled
* Proxy
* Modified internal buffering to avoid idling out services as a request
arrives, fixing failures for requests that are sent exactly once per
minute--such as Prometheus scrapes
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Sometimes for no clear reason pods are taking their time to become
available. The `kubectl wait --for=condition=available` command in
`inject_test.go` is failing sporadically because of this.
e.g in
https://github.com/linkerd/linkerd2/runs/652159504?check_suite_focus=true#step:14:56
I could reproduce this and even though I couldn't see any errors in the logs
or events, I could confirm how long it's taking for the pod to come up:
```
$ k -n l5d-integration-inject-test describe po inject-test-terminus-enabled
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m12s default-scheduler Successfully assigned l5d-integration-inject-test/inject-test-terminus-enabled-96fd5f5dc-5qlpb to gke-alpeb-dev-default-pool-b94ca25c-h84p
Normal Pulled 6m55s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Container image "gcr.io/linkerd-io/proxy-init:v1.3.2" already present on machine
Normal Created 6m54s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Created container linkerd-init
Normal Started 6m47s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Started container linkerd-init
Normal Pulled 6m28s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Container image "buoyantio/bb:v0.0.5" already present on machine
Normal Created 6m27s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Created container bb-terminus
Normal Started 6m27s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Started container bb-terminus
Normal Pulled 6m27s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Container image "gcr.io/linkerd-io/proxy:git-2a95d373" already present on machine
Normal Created 6m27s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Created container linkerd-proxy
Normal Started 6m27s kubelet, gke-alpeb-dev-default-pool-b94ca25c-h84p Started container linkerd-proxy
```
here the pod took 45s to start!
Updated rule in list of ignored k8s warning events to make it more
generic and to account for this failure:
```
error killing pod: failed to "KillPodSandbox" for
"756c8333-1d4d-4f42-bc2d-bd99eb8b4c94" with KillPodSandboxError: "rpc
error: code = Unknown desc = networkPlugin cni failed to teardown pod
\"_\" network: operation Delete is not supported on
WorkloadEndpoint(default/gke--testing--git--2d2fd3f1--default--pool--b9cfce6d--tgcn-cni-bd3ca37ee6fc3a05bafa26ce71faa05279ce08de02462040300786cb7e046b38-eth0)"
```
That happened here:
https://github.com/linkerd/linkerd2/runs/653622248?check_suite_focus=true#step:6:27
When the proxy has an IP watch on a pod and the destination controller gets a pod update event, the destination controller sends a NoEndpoints message to all listeners followed by an Add with the new pod state. This can result in the proxy's load balancer being briefly empty and could result in failing requests in the period.
Since consecutive Add events with the same address will override each other, we can simply send the Adds without needing to clear the previous state with a NoEndpoints message.
This release modifies Linkerd's internal buffering to avoid idling out
services as a request arrives. This could cause failures for requests
that are sent exactly once per minute, such as Prometheus scrapes.
---
* Set a grpc-status of UNAVAILABLE only on io errors (linkerd/linkerd2-proxy#498)
* inbound: Remove unnecessary buffer (linkerd/linkerd2-proxy#501)
* buffer: Move idle timeouts into the buffer (linkerd/linkerd2-proxy#502)
* make: Support CARGO_TARGET for multi-arch builds (linkerd/linkerd2-proxy#497)
* release: Use arch-specific paths (linkerd/linkerd2-proxy#508)
Use [gotestsum](https://github.com/gotestyourself/gotestsum) for running
unit tests in CI, so we get a summary result at the end, instead of having to
scroll up to find failures.
Doesn't apply for integration tests, as only failures are shown there,
and they're easily visible.
Certain install flags are intended to help with Linkerd development and generally are not useful (and are potentially confusing) to users.
We hide these flags in release (edge or stable) builds of the CLI but show them in all other builds. The list of affected flags is:
* control-plane-version
* proxy-image
* proxy-version
* image-pull-policy
* init-image
* init-image-version
Signed-off-by: Alex Leong <alex@buoyant.io>
When using cli commands that work on namespaced resources in the cluster, the default namespace used by the cli is hardcoded to the default Kubernetes namespace (i.e 'default'). This update will allow cli commands that operate on namespaced resources to automatically infer what the name of the default namespace is, by taking the relevant default from the currently used Kubeconfig context. In short, this allows the omission of the -n flag in commands such as linkerd metrics, when working with resources that belong to a namespace that is set as default in the currently active context.
Validation was done manually by setting the default namespace of the currently used context, as well as through two integration tests that target the tap and get command respectively.
Signed-off-by: Matei David <matei.david.35@gmail.com>
This allows end user flexibility for options such as log format. Rather than bubbling up such possible config options into helm values, extra arguments provides more flexibility.
Add prometheusAlertmanagers value allows configuring a list of statically targetted alertmanager instances.
Use rule configmaps for prometheus rules. They take a list of {name,subPath,configMap} values and mounts them accordingly. Provided that subpaths end with _rules.yml or _rules.yaml they should be loaded by prometheus as per prometheus.yml's rule_files content.
Signed-off-by: Naseem <naseem@transit.app>
* Go test failure message wrappers to create GH Annotations
First part of #4176
## Problem
Failures in go tests need to be properly formatted as Github annotations
so that we can fetch them through Github's API for aggregation and
analysis.
## Solution
A wrapper for error messages has been created in `testutil/annotations.go`.
The idea is that instead of throwing test failures like this:
```go
t.Failf("error retrieving data;\nExpected: %#v\nActual: %#v", expected,
actual)
```
We'd throw them like this:
```go
testutil.AnnotationFatalf("error retrieving data", "error retrieving data;\nExpected: %#v\nActual: %#v", expected,
actual)
```
That will continue reporting the error as before (when using `go test`
or another test runner), but as a side-effect it will also send to
stdout something like:
```
::error file=pkg/inject_test.go,line=133::error retrieving data
```
Which becomes a GH annotation, visible in the CI run summary screen.
The fist string art is used to have the GH annotation be a generic error message
that can be aggregated and counted across multiple test runs. If `testutil.Fatalf(str, args...)`
is called instead, the original error message will be used.
Note that that the output will be produced only when the env var
`GH_ANNOTATION` is set (which will when tests are triggered from a
Github Actions workflow).
Besides `testutil/annotation.go` and its accompanying unit test file,
other changes were made in other tests as examples, the plan being that
in a further PR _all_ the tests will use these wrappers.
* Increase timeout for Helm cleanup in integration tests
Tests were failing sporadically, waiting for the Helm namespace to get
cleaned up. I verified that it is getting cleaned up, but taking more
time sometimes.
* update changelog for edge-20.4.5
This edge release includes several new CLI commands for use with
multi-cluster gateways, and adds liveness checks and metrics for
gateways. Additionally, it makes the proxy's gRPC error-handling
behavior more consistent with other implementations, and includes a fix
for a bug in the web UI.
* CLI
* Added `linkerd cluster setup-remote` command for setting up a
multi-cluster gateway
* Added `linkerd cluster gateways` command to display stats for
multi-cluster gateways
* Changed `linkerd cluster export-service` to modify a provided YAML
file and output it, rather than mutating the cluster
* Controller
* Added liveness checks and Prometheus metrics for multi-cluster
gateways
* Changed the proxy injector to configure proxies to do destination
lookups for IPs in the private IP range
* Web UI
* Fixed errors when viewing resource detail pages
* Internal
* Created script and config to build a Linkerd CLI Chocolatey package
for Windows users, which will be published with stable releases
(thanks to @drholmie!)
* Proxy
* Changed the proxy to set a `grpc-status: UNAVAILABLE` trailer when a
gRPC response stream is interrupted by a transport error
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
* review feedback
Signed-off-by: Eliza Weisman <eliza@buoyant.io>