* Schedule heartbeat 10 mins after install
... for the Helm installation method, thus aligning it with the CLI
installation method, to reduce the midnight peak on the receiving end.
The logic added into the chart is now reused by the CLI as well.
Also, set `concurrencyPolicy=Replace` so that when a job fails and it's
retried, the retries get canceled when the next scheduled job is triggered.
Finally, the go client only failed when the connection failed;
successful connections with a non 200 response status were considered
successful and thus the job wasn't retried. Fixed that as well.
### What
When a namespace has the opaque ports annotation, pods and services should
inherit it if they do not have one themselves. Currently, services do this but
pods do not. This can lead to surprising behavior where services are correctly
marked as opaque, but pods are not.
This changes the proxy-injector so that it now passes down the opaque ports
annotation to pods from their namespace if they do not have their own annotation
set. Closes#5736.
### How
The proxy-injector webhook receives admission requests for pods and services.
Regardless of the resource kind, it now checks if the resource should inherit
the opaque ports annotation from its namespace. It should inherit it if the
namespace has the annotation but the resource does not.
If the resource should inherit the annotation, the webhook creates an annotation
patch which is only responsible for adding the opaque ports annotation.
After generating the annotation patch, it checks if the resource is injectable.
From here there are a few scenarios:
1. If no annotation patch was created and the resource is not injectable, then
admit the request with no changes. Examples of this are services with no OP
annotation and inject-disabled pods with no OP annotation.
2. If the resource is a pod and it is injectable, create a patch that includes
the proxy and proxy-init containers—as well as any other annotations and
labels.
3. The above two scenarios lead to a patch being generated at this point, so no
matter the resource the patch is returned.
### UI changes
Resources are now reported to either be "injected", "skipped", or "annotated".
The first pass at this PR worked around the fact that injection reports consider
services and namespaces injectable. This is not accurate because they don't have
pod templates that could be injected; they can however be annotated.
To fix this, an injection report now considers resources "annotatable" and uses
this to clean up some logic in the `inject` command, as well as avoid a more
complex proxy-injector webhook.
What's cool about this is it fixes some `inject` command output that would label
resources as "injected" when they were not even mutated. For example, namespaces
were always reported as being injected even if annotations were not added. Now,
it will properly report that a namespace has been "annotated" or "skipped".
### Tests
For testing, unit tests and integration tests have been added. Manual testing
can be done by installing linkerd with `debug` controller log levels, and
tailing the proxy-injector's app container when creating pods or services.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Fixes#5939
Some CNIs reasssign the IP of a terminating pod to a new pod, which
leads to duplicate IPs in the cluster.
It eventually triggers #5939.
This commit will make the IPWatcher, when given an IP, filter out the terminating pods
(when a pod is given a deletionTimestamp).
The issue is hard reproduce because we are not able to assign a
particular IP to a pod manually.
Signed-off-by: Bruce <wenliang.chen@personio.de>
Co-authored-by: Bruce <wenliang.chen@personio.de>
This fixes an issue where pod lookups by host IP and host port fail even though
the cluster has a matching pod.
Usually these manifested as `FailedPrecondition` errors, but the messages were
too long and resulted in http/2 errors. This change depends on #5893 which fixes
that separate issue.
This changes how often those `FailedPrecondition` errors actually occur. The
destination service now considers pod host IPs and should reduce the frequency
of those errors.
Closes#5881
---
Lookups like this happen when a pod is created with a host IP and host port set
in its spec. It still has a pod IP when running, but requests to
`hostIP:hostPort` will also be redirected to the pod. Combinations of host IP
and host Port are unique to the cluster and enforced by Kubernetes.
Currently, the destination services fails to find pods in this scenario because
we only keep an index with pod and their pod IPs, not pods and their host IPs.
To fix this, we now also keep an index of pods and their host IPs—if and only if
they have the host IP set.
Now when doing a pod lookup, we consider both the IP and the port. We perform
the following steps:
1. Do a lookup by IP in the pod podIP index
- If only one pod is found then return it
2. 0 or more than 1 pods have the same pod IP
3. Do a lookup by IP in the pod hostIP index
- If any number of pods were found, we know that IP maps to a node IP.
Therefore, we search for a pod with a matching host Port. If one exists then
return it; if not then there is no pod that matches `hostIP:port`
4. The IP does not map to a host IP
5. If multiple pods were found in `1`, then we know there are pods with
conflicting podIPs and an error is returned
6. If no pounds were found in `1` then there is no pod that matches `IP:port`
---
Aside from the additional IP watcher test being added, this can be tested with
the following steps:
1. Create a kind cluster. kind is required because it's pods in `kube-system`
have the same pod IPs; this not the case with k3d: `bin/kind create cluster`
2. Install Linkerd with `4445` marked as opaque: `linkerd install --set
proxy.opaquePorts="4445" |kubectl apply -f -`
2. Get the node IP: `kubectl get -o wide nodes`
3. Pull my fork of `tcp-echo`:
```
$ git clone https://github.com/kleimkuhler/tcp-echo
...
$ git checkout --track kleimkuhler/host-pod-repro
```
5. `helm package .`
7. Install `tcp-echo` with the server not injected and correct host IP: `helm
install tcp-echo tcp-echo-0.1.0.tgz --set server.linkerdInject="false" --set
hostIP="..."`
8. Looking at the client's proxy logs, you should not observe any errors or
protocol detection timeouts.
9. Looking at the server logs, you should see all the requests coming through
correctly.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
# Problem
While rolling out often not all pods will be ready in all the same set of
ports, leading the Kubernetes Endpoints API to return multiple subsets,
each covering a different set of ports, with the end result that the
same address gets repeated across subsets.
The old code for endpointsToAddresses would loop through all subsets, and the
later occurrences of an address would overwrite previous ones, with the
last one prevailing.
If the last subset happened to be for an irrelevant port, and the port to
be resolved is named, resolveTargetPort would resolve to port 0, which would
return port 0 to clients, ultimately leading linkerd-proxy to forward
connections to port 0.
This only happens if the pods selected by a service expose > 1 port, the
service maps to > 1 of these ports, and at least one of these ports is named.
# Solution
Never write an address to set of addresses if resolved port is 0, which
indicates named port resolution failed.
# Validation
Added a test case.
Signed-off-by: Riccardo Freixo <riccardofreixo@gmail.com>
This reduces the possible HTTP response size from the destination service when
it encounters an error during a profile lookup.
If multiple objects on a cluster share the same IP (such as pods in
`kube-system`), the destination service will return an error with the two
conflicting pod yamls.
In certain cases, these pod yamls can be too large for the HTTP response and the
destination pod's proxy will indicate that with the following error:
```
hyper::proto::h2::server: send response error: user error: header too big
```
From the app pod's proxy, this results in the following error:
```
poll_profile: linkerd_service_profiles::client: Could not fetch profile error=status: Unknown, message: "http2 error: protocol error: unexpected internal error encountered"
```
We now only return the conflicting pods (or services) names. This reduces the
size of the returned error and fixes these warnings from occurring.
Example response error:
```
poll_profile: linkerd_service_profiles::client: Could not fetch profile error=status: FailedPrecondition, message: "Pod IP address conflict: kube-system/kindnet-wsflq, kube-system/kube-scheduler-kind-control-plane", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Fri, 12 Mar 2021 19:54:09 GMT"} }
```
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
* update go.mod and docker images to go 1.16.1
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
* update test error messages for ParseDuration
* update go version to 1.16.2
This change fixes an issue where the `linkerd-sp-validator` does not set
the `request.UID` for an `admissionResponse`. This causes an issue that
prevents service profiles from being added or updated.
Fixes#5862
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
* destination: pass opaque-ports through cmd flag
Fixes#5817
Currently, Default opaque ports are stored at two places i.e
`Values.yaml` and also at `opaqueports/defaults.go`. As these
ports are used only in destination, We can instead pass these
values as a cmd flag for destination component from Values.yaml
and remove defaultPorts in `defaults.go`.
This means that users if they override `Values.yaml`'s opauePorts
field, That change is propogated both for injection and also
discovery like expected.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
When introducing the `linkerd-await` helper, we provided a default value
for `TARGETARCH`. This appears to interfere with multi-arch image
builds, causing ARM builds to fetch amd64 binaries.
Unsetting this default appears to fix this issue.
When a container starts up, we generally want to wait for the proxy to
initialize before starting the controller (which may initiate outbound
connections, especially to the Kubernetes API). This is true for all
pods except the identity controller, which must start before its proxy.
This change adds the linkerd-await helper to all of our container
images. Its use is explicitly disabled in the identity controller, due
to startup ordering constraints, and the heartbeat controller, because
it does not run a proxy currently.
Fixes#5819
* Remove linkerd prefix from extension resources
This change removes the `linkerd-` prefix on all non-cluster resources
in the jaeger and viz linkerd extensions. Removing the prefix makes all
linkerd extensions consistent in their naming.
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
This change removes the default ignored inbound and outbound ports from the
proxy init configuration.
These ports have been moved to the the `proxy.opaquePorts` configuration so that
by default, installations will proxy all traffic on these ports opaquely.
Closes#5571Closes#5595
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This changes the destination service to always use a default set of opaque ports
for pods and services. This is so that after Linkerd is installed onto a
cluster, users can benefit from common opaque ports without having to annotate
the workloads that serve the applications.
After #5810 merges, the proxy containers will be have the default opaque ports
`25,443,587,3306,5432,11211`. This value on the proxy container does not affect
traffic though; it only configures the proxy.
In order for clients and servers to detect opaque protocols and determine opaque
transports, the pods and services need to have these annotations.
The ports `25,443,587,3306,5432,11211` are now handled opaquely when a pod or
service does not have the opaque ports annotation. If the annotation is present
with a different value, this is used instead of the default. If the annotation
is present but is an empty string, there are no opaque ports for the workload.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This reverts commit f9ab867cbc which renamed the
multicluster label name from `mirror.linkerd.io` to `multicluster.linkerd.io`.
While this change was made to follow similar namings in other extensions, it
complicates the multicluster upgrade process due to the secret creation.
`mirror.linkerd.io` is not that important of a label to change and this will
allow a smoother upgrade process for `stable-2.10.x`
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This change introduces an opaque ports annotation watcher that will send
destination profile updates when a service has its opaque ports annotation
change.
The user facing change introduced by this is that the opaque ports annotation is
now required on services when using the multicluster extension. This is because
the service mirror will create mirrored services in the source cluster, and
destination lookups in the source cluster need to discover that the workloads in
the target cluster are opaque protocols.
### Why
Closes#5650
### How
The destination server now has a new opaque ports annotation watcher. When a
client subscribes to updates for a service name or cluster IP, the `GetProfile`
method creates a profile translator stack that passes updates through resource
adaptors such as: traffic split adaptor, service profile adaptor, and now opaque
ports adaptor.
When the annotation on a service changes, the update is passed through to the
client where the `opaque_protocol` field will either be set to true or false.
A few scenarios to consider are:
- If the annotation is removed from the service, the client should receive
an update with no opaque ports set.
- If the service is deleted, the stream stays open so the client should
receive an update with no opaque ports set.
- If the service has the annotation added, the client should receive that
update.
### Testing
Unit test have been added to the watcher as well as the destination server.
An integration test has been added that tests the opaque port annotation on a
service.
For manual testing, using the destination server scripts is easiest:
```
# install Linkerd
# start the destination server
$ go run controller/cmd/main.go destination -kubeconfig ~/.kube/config
# Create a service or namespace with the annotation and inject it
# get the destination profile for that service and observe the opaque protocol field
$ go run controller/script/destination-client/main.go -method getProfile -path test-svc.default.svc.cluster.local:8080
INFO[0000] fully_qualified_name:"terminus-svc.default.svc.cluster.local" opaque_protocol:true retry_budget:{retry_ratio:0.2 min_retries_per_second:10 ttl:{seconds:10}} dst_overrides:{authority:"terminus-svc.default.svc.cluster.local.:8080" weight:10000}
INFO[0000]
INFO[0000] fully_qualified_name:"terminus-svc.default.svc.cluster.local" opaque_protocol:true retry_budget:{retry_ratio:0.2 min_retries_per_second:10 ttl:{seconds:10}} dst_overrides:{authority:"terminus-svc.default.svc.cluster.local.:8080" weight:10000}
INFO[0000]
```
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Currently the identity controller is the only component that receives the CA certificate / trust anchors as option `-identity-trust-anchors-pem` instead of an env var.
This stops one from letting it read the trust anchors from a Secret that is managed by e.g. cert-manager.
This PR uses an env var instead of the option to provide the trust anchors. For most helm chart users this doesn't change anything. However using kustomize the helm output manifest can now be adjusted (again) so that the certificate is loaded from a ConfigMap or Secret like in [this example](https://github.com/mgoltzsche/khelm/tree/master/example/kpt/linkerd) which aims to produce a static manifest to make the installation/update more declarative and support GitOps workflows.
This PR does not provide chart options/values to specify Secrets upfront - it would introduce dependencies to other operators.
Relates to #3843, see https://github.com/linkerd/linkerd2/issues/3843#issuecomment-775516217Fixes#3321
Signed-off-by: Max Goltzsche <max.goltzsche@gmail.com>
This change counts the number of service profiles installed in a cluster
and adds that info to the heartbeat HTTP request.
Fixes#5474
Signed-off-by: Dennis Adjei-Baah <dennis@buoyant.io>
This renames the multicluster annotation prefix from `mirror.linkerd.io` to
`multicluster.linkerd.io` in order to reflect other extension naming patterns.
Additionally, it moves labels only used in the Multicluster extension into their
own labels file—again to reflect other extensions.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
This adds namespace inheritance of the opaque ports annotation to services.
This means that the proxy injector now watches services creation in a cluster.
When a new service is created, the webhook receives an admission request for
that service and determines whether a patch needs to be created.
A patch is created if the service does not have the annotation, but the
namespace does. This means the service inherits the annotation from the
namespace.
A patch is not created if the service and the namespace do not have the
annotation, or the service has the annotation. In the case of the service having
the annotation, we don't even need to check the namespace since it would not
inherit it anyways.
If a namespace has the annotation value changed, this will not be reflected on
the service. The service would need to be recreated so that it goes through
another admission request.
None of this applies to the `inject` command which still skips service
injection. We rely on being able to check the namespace annotations, and this is
only possible in the proxy injector webhook when we can query the k8s API.
Closes#5737
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
We've created a custom domain, `cr.l5d.io`, that redirects to `ghcr.io`
(using `scarf.sh`). This custom domain allows us to swap the underlying
container registry without impacting users. It also provides us with
important metrics about container usage, without collecting PII like IP
addresses.
This change updates our Helm charts and CLIs to reference this custom
domain. The integration test workflow now refers to the new domain,
while the release workflow continues to use the `ghcr.io/linkerd` registry
for the purpose of publishing images.
Fixes#5755 follow-up to #5750 and #5751
- Unifies the Go version across Docker and CI to be 1.14.15;
- Updates the GitHub Actions base image from ubuntu-18.04 to ubuntu-20.04; and
- Updates the runtime base image from debian:buster-20201117-slim to debian:buster-20210208-slim.
The Go-1.14 release branch includes a number of important updates. This
change updates our containers' base image to the latest release, 1.14.15
See linkerd/linkerd2-proxy-init#32
Fixes#5655
This change removes the namespace inheritance of the opaque ports annotation.
Now when setting opaque port related fields in destination profile responses, we
only look at the pod annotations.
This prepares for #5736 where the proxy-injector will add the annotation from
the namespace if the pod does not have it already.
Closes#5735
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Getting information about node topology queries the k8s api directly.
In an environment with high traffic and high number of pods, the
k8s api server can become overwhelmed or start throttling requests.
This MR introduces a node informer to resolve the bottleneck and
fetch node information asynchronously.
Fixes#5684
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
* values: removal of .global field
Fixes#5425
With the new extension model, We no longer need `Global` field
as we don't rely on chart dependencies anymore. This helps us
further cleanup Values, and make configuration more simpler.
To make upgrades and the usage of new CLI with older config work,
We add a new method called `config.RemoveGlobalFieldIfPresent` that
is used in the upgrade and `FetchCurrentConfiguration` paths to remove
global field and attach its child nodes if global is present. This is verified
by the `TestFetchCurrentConfiguration`'s older test that has the global
field.
We also don't yet remove .global in some helm stable-upgrade tests for
the initial install to work.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
Pods with unusual DNS configurations may not be able to resolve the
control plane's domain names. We can avoid search path shenanigans by
adding a trailing dot to these names.
Closes#5545.
This change moves all tap and tap-injector code into the viz directory.
The tap and tap-injector components now also use a new tap image—separating
these components from the controller image that they are currently part of. This
means the controller image has removed all its build dependencies related to
tap.
Finally, the tap Protobuf has been separated from the metrics-api and moved into
it's own `.proto` file and gen directory. This introduces a clear split between
metrics-api and tap Protobuf.
There is no change in behavior for the `viz tap` command.
### Reviewing
#### Docker images
All the bin directory scripts should be updated to build and load the tap image.
All the CI workflows should be updated to build and push the tap image.
#### Controller and pkg directories
This is primarily deletions. Most of the deleted code in this directory is now
in the tap directory of the Viz extension.
#### viz/tap
This is the location that all the tap related code now lives in. New files are
mostly moved from the controller and pkg directories. Imports have all been
updated to point at the right locations and Protobuf.
The Protobuf here is taken from metrics-api and contains all tap-related
Protobuf.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Fixes#5575
Now that only viz makes use of the `SelfCheck` api, merged the `healthcheck.proto` into `viz.proto`.
Also removed the "checkRPC" functionality that was used for handling multiple API responses and was only used by `SelfCheck`, because the extra complexity was not granted. Revert to use the plain vanilla "check" by just concatenating error responses.
## Success Output
```bash
$ bin/linkerd viz check
...
linkerd-viz
-----------
...
√ viz extension self-check
```
## Failure Examples
Failure when viz fails to connect to the k8s api:
```bash
$ bin/linkerd viz check
...
linkerd-viz
-----------
...
× viz extension self-check
Error calling the Kubernetes API: someerror
see https://linkerd.io/checks/#l5d-api-control-api for hints
Status check results are ×
```
Failure when viz fails to connect to Prometheus:
```bash
$ bin/linkerd viz check
...
linkerd-viz
-----------
...
× viz extension self-check
Error calling Prometheus from the control plane: someerror
see https://linkerd.io/checks/#l5d-api-control-api for hints
Status check results are ×
```
Failure when viz fails to connect to both the k8s api and Prometheus:
```bash
$ bin/linkerd viz check
...
linkerd-viz
-----------
...
× viz extension self-check
Error calling the Kubernetes API: someerror
Error calling Prometheus from the control plane: someerror
see https://linkerd.io/checks/#l5d-api-control-api for hints
Status check results are ×
```
This change adds the `jaeger.linkerd.io/tracing-enabled` annotation which is
automatically added by the Jaeger extension's `jaeger-injector`.
All pods that receive this annotation have also had the required environment
variables and volume/volume mounts add by the injector.
The purpose of this annotation is that it will allow `jaeger check` to check for
the presence of this annotation instead of needing to look at the proxy
containers directly. If this annotation is not present on pods, `jaeger check`
can warn users that tracing is not configured for those pods. This is similar to
`viz check` warning users that tap is not configured—recenlty added in #5602.
Closes#5632
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
* viz: add data-plane and prometheus healthchecks
Fixes#5325
This branch adds the remaining healthchecks for the viz extension
i.e
- Data-plane metrics check in Prometheus
- `--proxy` mode which also checks for tap injections based
on annotations.
For this, The following changes were needed
- Category.ID is made public so that --proxy toggleness can be
allowed
- Made tap env key as a field so that it can be re-used for
checks
simplify viz.NewHealthChecker by removing the need to
pass categoryIDs and instead using
hc.appendCategories directly at the caller to add the
required categories. This is possible by dividing the vizCategories
into separate functions
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
## What this changes
This allows the tap controller to inform `tap` users when pods either have tap
disabled or tap is not enabled yet.
## Why
When a user taps a resource that has not been admitted by the Viz extension's
`tap-injector`, tap is not explicitly disabled but it is also not enabled.
Therefore, the `tap` command hangs and provides no feedback to the user.
Closes#5544
## How
A new `viz.linkerd.io/tap-enabled` annotation is introduced which is
automatically added by the Viz extension's `tap-injector`. This annotation is
added to a pod when it is able to be tapped; this means that the pod and the
pod's namespace do not have the `config.linkerd.io/disable-tap` annotation
added.
When a user attempts to tap a resource, the tap controller now looks for this
new annotation; if the annotation is present on the pod then that pod is
tappable.
If the annotation is not present or tap is explicitly disabled, an error is
returned.
## UI changes
Multiple errors can now occur when trying to tap a resource:
1. There are no pods for the resource.
2. There are pods for the resource, but tap is disabled via pod or namespace
annotation.
3. There are pods for the resource, but tap is not yet enabled because the
`tap-injector` did not admit the resource.
Errors are now handled as shown below:
Tap is disabled:
```
❯ bin/linkerd viz tap deploy/test
Error: no pods to tap for deployment/test
pods found with tap disabled via the config.linkerd.io/disable-tap annotation
```
Tap is not enabled:
```
❯ bin/linkerd viz tap deploy/test
Error: no pods to tap for deployment/test
pods found with tap not enabled; try restarting resource so that it can be injected
```
There are a mix of pods with tap disabled or tap not enabled:
```
❯ bin/linkerd viz tap deploy/test
Error: no pods to tap for deployment/test
pods found with tap disabled via the config.linkerd.io/disable-tap annotation
pods found with tap not enabled; try restarting resource so that it can be injected
```
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
I ran `bin/update-codegen.sh` to update the generated code to include the opaque ports in the generated deepcopy function for service profiles.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Protobuf changes:
- Moved `healthcheck.proto` back from viz to `proto/common` as it remains being used by the main `healthcheck.go` library (it was moved to viz by #5510).
- Extracted from `viz.proto` the IP-related types and put them in `/controller/gen/common/net` to be used by both the public and the viz APIs.
* Added chart templates for new viz linkerd-metrics-api pod
* Spin-off viz healthcheck:
- Created `viz/pkg/healthcheck/healthcheck.go` that wraps the original `pkg/healthcheck/healthcheck.go` while adding the `vizNamespace` and `vizAPIClient` fields which were removed from the core `healthcheck`. That way the core healthcheck doesn't have any dependencies on viz, and viz' healthcheck can now be used to retrieve viz api clients.
- The core and viz healthcheck libs are now abstracted out via the new `healthcheck.Runner` interface.
- Refactored the data plane checks so they don't rely on calling `ListPods`
- The checks in `viz/cmd/check.go` have been moved to `viz/pkg/healthcheck/healthcheck.go` as well, so `check.go`'s sole responsibility is dealing with command business. This command also now retrieves its viz api client through viz' healthcheck.
* Removed linkerd-controller dependency on Prometheus:
- Removed the `global.prometheusUrl` config in the core values.yml.
- Leave the Heartbeat's `-prometheus` flag hard-coded temporarily. TO-DO: have it automatically discover viz and pull Prometheus' endpoint (#5352).
* Moved observability gRPC from linkerd-controller to viz:
- Created a new gRPC server under `viz/metrics-api` moving prometheus-dependent functions out of the core gRPC server and into it (same thing for the accompaigning http server).
- Did the same for the `PublicAPIClient` (now called just `Client`) interface. The `VizAPIClient` interface disappears as it's enough to just rely on the viz `ApiClient` protobuf type.
- Moved the other files implementing the rest of the gRPC functions from `controller/api/public` to `viz/metrics-api` (`edge.go`, `stat_summary.go`, etc.).
- Also simplified some type names to avoid stuttering.
* Added linkerd-metrics-api bootstrap files. At the same time, we strip out of the public-api's `main.go` file the prometheus parameters and other no longer relevant bits.
* linkerd-web updates: it requires connecting with both the public-api and the viz api, so both addresses (and the viz namespace) are now provided as parameters to the container.
* CLI updates and other minor things:
- Changes to command files under `cli/cmd`:
- Updated `endpoints.go` according to new API interface name.
- Updated `version.go`, `dashboard` and `uninstall.go` to pull the viz namespace dynamically.
- Changes to command files under `viz/cmd`:
- `edges.go`, `routes.go`, `stat.go` and `top.go`: point to dependencies that were moved from public-api to viz.
- Other changes to have tests pass:
- Added `metrics-api` to list of docker images to build in actions workflows.
- In `bin/fmt` exclude protobuf generated files instead of entire directories because directories could contain both generated and non-generated code (case in point: `viz/metrics-api`).
* Add retry to 'tap API service is running' check
* mc check shouldn't err when viz is not available. Also properly set the log in multicluster/cmd/root.go so that it properly displays messages when --verbose is used
## What this changes
This adds a tap-injector component to the `linkerd-viz` extension which is
responsible for adding the tap service name environment variable to the Linkerd
proxy container.
If a pod does not have a Linkerd proxy, no action is taken. If tap is disabled
via annotation on the pod or the namespace, no action is taken.
This also removes the environment variable for explicitly disabling tap through
an environment variable. Tap status for a proxy is now determined only be the
presence or absence of the tap service name environment variable.
Closes#5326
## How it changes
### tap-injector
The tap-injector component determines if `LINKERD2_PROXY_TAP_SVC_NAME` should be
added to a pod's Linkerd proxy container environment. If the pod satisfies the
following, it is added:
- The pod has a Linkerd proxy container
- The pod has not already been mutated
- Tap is not disabled via annotation on the pod or the pod's namespace
### LINKERD2_PROXY_TAP_DISABLED
Now that tap is an extension of Linkerd and not a core component, it no longer
made sense to explicitly enable or disable tap through this Linkerd proxy
environment variable. The status of tap is now determined only be if the
tap-injector adds or does not add the `LINKERD2_PROXY_TAP_SVC_NAME` environment
variable.
### controller image
The tap-injector has been added to the controller image's several startup
commands which determines what it will do in the cluster.
As a follow-up, I think splitting out the `tap` and `tap-injector` commands from
the controller image into a linkerd-viz image (or something like that) makes
sense.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
The Destination controller can panic due to a nil-deref when
the EndpointSlices API is enabled.
This change updates the controller to properly initialize values
to avoid this segmentation fault.
Fixes#5521
Signed-off-by: Oleg Ozimok <oleg.ozimok@corp.kismia.com>
* Separate observability API
Closes#5312
This is a preliminary step towards moving all the observability API into `/viz`, by first moving its protobuf into `viz/metrics-api`. This should facilitate review as the go files are not moved yet, which will happen in a followup PR. There are no user-facing changes here.
- Moved `proto/common/healthcheck.proto` to `viz/metrics-api/proto/healthcheck.prot`
- Moved the contents of `proto/public.proto` to `viz/metrics-api/proto/viz.proto` except for the `Version` Stuff.
- Merged `proto/controller/tap.proto` into `viz/metrics-api/proto/viz.proto`
- `grpc_server.go` now temporarily exposes `PublicAPIServer` and `VizAPIServer` interfaces to separate both APIs. This will get properly split in a followup.
- The web server provides handlers for both interfaces.
- `cli/cmd/public_api.go` and `pkg/healthcheck/healthcheck.go` temporarily now have methods to access both APIs.
- Most of the CLI commands will use the Viz API, except for `version`.
The other changes in the go files are just changes in the imports to point to the new protobufs.
Other minor changes:
- Removed `git add controller/gen` from `bin/protoc-go.sh`
Ignore pods with status.phase=Succeeded when watching IP addresses
When a pod terminates successfully, some CNIs will assign its IP address
to newly created pods. This can lead to duplicate pod IPs in the same
Kubernetes cluster.
Filter out pods which are in a Succeeded phase since they are not
routable anymore.
Fixes#5394
Signed-off-by: fpetkovski <filip.petkovsky@gmail.com>
Currently, public-api is part of the core control-plane where
the prom check fails when ran before the viz extension is installed.
This change comments out that check, Once metrics api is moved into
viz, maybe this check can be part of it instead or directly part of
`linkerd viz check`.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
Co-authored-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
The destination service now returns `OpaqueTransport` hint when the annotation
matches the resolve target port. This is different from the current behavior
which always sets the hint when a proxy is present.
Closes#5421
This happens by changing the endpoint watcher to set a pod's opaque port
annotation in certain cases. If the pod already has an annotation, then its
value is used. If the pod has no annotation, then it checks the namespace that
the endpoint belongs to; if it finds an annotation on the namespace then it
overrides the pod's annotation value with that.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
## What
When the destination service returns a destination profile for an endpoint,
indicate if the endpoint can receive opaque traffic.
## Why
Closes#5400
## How
When translating a pod address to a destination profile, the destination service
checks if the pod is controlled by any linkerd control plane. If it is, it can
set a protocol hint where we indicate that it supports H2 and opaque traffic.
If the pod supports opaque traffic, we need to get the port that it expects
inbound traffic on. We do this by getting the proxy container and reading it's
`LINKERD2_PROXY_INBOUND_LISTEN_ADDR` environment variable. If we successfully
parse that into a port, we can set the opaque transport field in the destination
profile.
## Testing
A test has been added to the destination server where a pod has a
`linkerd-proxy` container. We can expect the `OpaqueTransport` field to be set
in the returned destination profile's protocol hint.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
## Summary
This changes the destination service to start indicating whether a profile is an
opaque protocol or not.
Currently, profiles returned by the destination service are built by chaining
together updates coming from watching Profile and Traffic Split updates.
With this change, we now also watch updates to Opaque Port annotations on pods
and namespaces; if an update occurs this is now included in building a profile
update and is sent to the client.
## Details
Watching updates to Profiles and Traffic Splits is straightforward--we watch
those resources and if an update occurs on one associated to a service we care
about then the update is passed through.
For Opaque Ports this is a little different because it is an annotation on pods
or namespaces. To account for this, we watch the endpoints that we should care
about.
### When host is a Pod IP
When getting the profile for a Pod IP, we check for the opaque ports annotation
on the pod and the pod's namespace. If one is found, we'll indicate if the
profile is an opaque protocol if the requested port is in the annotation.
We do not subscribe for updates to this pod IP. The only update we really care
about is if the pod is deleted and this is already handled by the proxy.
### When host is a Service
When getting the profile for a Service, we subscribe for updates to the
endpoints of that service. For any ports set in the opaque ports annotation on
any of the pods, we check if the requested port is present.
Since the endpoints for a service can be added and removed, we do subscribe for
updates to the endpoints of the service.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>
Fixes#5385
## The problems
- `linkerd install --ha` isn't honoring flags
- `linkerd upgrade --ha` is overridding existing configs silently or failing with an error
- *Upgrading HA instances from before 2.9 to version 2.9.1 results in configs being overridden silently, or the upgrade fails with an error*
## The cause
The change in #5358 attempted to fix `linkerd install --ha` that was only applying some of the `values-ha.yaml` defaults, by calling `charts.NewValues(true)` and merging that with the values built from `values.yaml` overriden by the flags. It turns out the `charts.NewValues()` implementation was by itself merging against `values.yaml` and as a result any flag was getting overridden by its default.
This also happened when doing `linkerd upgrade --ha` on an existing instance, which could result in silently overriding settings, or it could also fail loudly like for example when upgrading set up that has an external issuer (in this case the issuer cert won't be able to be read during upgrade and an error would occur as described in #5385).
Finally, when doing `linkerd upgrade` (no --ha flag) on an HA install from before 2.9 results in configs getting overridden as well (silently or with an error) because in order to generate the `linkerd-config-overrides` secret, the original install flags are retrieved from `linkerd-config` via the `loadStoredValuesLegacy()` function which then effectively ends up performing a `linkerd upgrade` with all the flags used for `linkerd install` and falls into the same trap as above.
## The fix
In `values.go` the faulting merging logic is not used anymore, so now `NewValues()` only returns the default values from `values.yaml` and doesn't require an argument anymore. It calls `readDefaults()` which now only returns the appropriate values depending on whether we're on HA or not.
There's a new function `MergeHAValues()` that merges `values-ha.yaml` into the current values (it doesn't look into `values.yaml` anymore), which is only used when processing the `--ha` flag in `options.go`.
## How to test
To replicate the issue try setting a custom setting and check it's not applied:
```bash
linkerd install --ha --controller-log level debug | grep log.level
- -log-level=info
```
## Followup
This wasn't caught because we don't have HA integration tests. Now that our test infra is based on k3d, it should be easy to make such a test using a cluster with multiple nodes. Either that or issuing `linkerd install --ha` with additional configs and compare against a golden file.
Followup to #5282, fixes#5272 in its totality.
This follows the same pattern as the injector/sp-validator webhooks, leveraging `FsCredsWatcher` to watch for changes in the cert files.
To reuse code from the webhooks, we moved `updateCert()` to `creds_watcher.go`, and `run()` as well (which now is called `ProcessEvents()`).
The `TestNewAPIServer` test in `apiserver_test.go` was removed as it really was just testing two things: (1) that `apiServerAuth` doesn't error which is already covered in the following test, and (2) that the golib call `net.Listen("tcp", addr)` doesn't error, which we're not interested in testing here.
## How to test
To test that the injector/sp-validator functionality is still correct, you can refer to #5282
The steps below are similar, but focused towards the tap component:
```bash
# Create some root cert
$ step certificate create linkerd-tap.linkerd.svc ca.crt ca.key --profile root-ca --no-password --insecure
# configure tap's caBundle to be that root cert
$ cat > linkerd-overrides.yml << EOF
tap:
externalSecret: true
caBundle: |
< ca.crt contents>
EOF
# Install linkerd
$ bin/linkerd install --config linkerd-overrides.yml | k apply -f -
# Generate an intermediatery cert with short lifespan
$ step certificate create linkerd-tap.linkerd.svc ca-int.crt ca-int.key --ca ca.crt --ca-key ca.key --profile intermediate-ca --not-after 4m --no-password --insecure --san linkerd-tap.linkerd.svc
# Create the secret using that intermediate cert
$ kubectl create secret tls \
linkerd-tap-k8s-tls \
--cert=ca-int.crt \
--key=ca-int.key \
--namespace=linkerd
# Rollout the tap pod for it to pick the new secret
$ k -n linkerd rollout restart deploy/linkerd-tap
# Tap should work
$ bin/linkerd tap -n linkerd deploy/linkerd-web
req id=0:0 proxy=in src=10.42.0.15:33040 dst=10.42.0.11:9994 tls=true :method=GET :authority=10.42.0.11:9994 :path=/metrics
rsp id=0:0 proxy=in src=10.42.0.15:33040 dst=10.42.0.11:9994 tls=true :status=200 latency=1779µs
end id=0:0 proxy=in src=10.42.0.15:33040 dst=10.42.0.11:9994 tls=true duration=65µs response-length=1709B
# Wait 5 minutes and rollout tap again
$ k -n linkerd rollout restart deploy/linkerd-tap
# You'll see in the logs that the cert expired:
$ k -n linkerd logs -f deploy/linkerd-tap tap
2020/12/15 16:03:41 http: TLS handshake error from 127.0.0.1:45866: remote error: tls: bad certificate
2020/12/15 16:03:41 http: TLS handshake error from 127.0.0.1:45870: remote error: tls: bad certificate
# Recreate the secret
$ step certificate create linkerd-tap.linkerd.svc ca-int.crt ca-int.key --ca ca.crt --ca-key ca.key --profile intermediate-ca --not-after 4m --no-password --insecure --san linkerd-tap.linkerd.svc
$ k -n linkerd delete secret linkerd-tap-k8s-tls
$ kubectl create secret tls \
linkerd-tap-k8s-tls \
--cert=ca-int.crt \
--key=ca-int.key \
--namespace=linkerd
# Wait a few moments and you'll see the certs got reloaded and tap is working again
time="2020-12-15T16:03:42Z" level=info msg="Updated certificate" addr=":8089" component=apiserver
```
Now that tracing has been split out of the main control plane and into the linkerd-jaeger extension, we remove references to tracing from the main control plane including:
* removing the tracing components from the main control plane chart
* removing the tracing injection logic from the main proxy injector and inject CLI (these will be added back into the new injector in the linkerd-jaeger extension)
* removing tracing related checks (these will be added back into `linkerd jaeger check`)
* removing related tests
We also update the `--control-plane-tracing` flag to configure the control plane components to send traces to the linkerd-jaeger extension. To make sure this works even when the linkerd-jaeger extension is installed in a non-default namespace, we also add a `--control-plane-tracing-namespace` flag which can be used to change the namespace that the control plane components send traces to.
Note that for now, only the control plane components send traces; the proxies in the control plane do not. This is because the linkerd-jaeger injector is not yet available. However, this change adds the appropriate namespace annotations to the control plane namespace to configure the proxies to send traces to the linkerd-jaeger extension once the linkerd-jaeger injector is available.
I tested this by doing the following:
1. bin/linkerd install | kubectl apply -f -
1. bin/helm install jaeger jaeger/charts/jaeger
1. bin/linkerd upgrade --control-plane-tracing=true | kubectl apply -f -
1. kubectl -n linkerd-jaeger port-forward svc/jaeger 16686
1. open http://localhost:16686
1. see traces from the linkerd control plane
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#5257
This branch movies mc charts and cli level code to a new
top level directory. None of the logic is changed.
Also, moves some common types into `/pkg` so that they
are accessible both to the main cli and extensions.
Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
* Have webhooks refresh their certs automatically
Fixes partially #5272
In 2.9 we introduced the ability for providing the certs for `proxy-injector` and `sp-validator` through some external means like cert-manager, through the new helm setting `externalSecret`.
We forgot however to have those services watch changes in their secrets, so whenever they were rotated they would fail with a cert error, with the only workaround being to restart those pods to pick the new secrets.
This addresses that by first abstracting out `FsCredsWatcher` from the identity controller, which now lives under `pkg/tls`.
The webhook's logic in `launcher.go` no longer reads the certs before starting the https server, moving that instead into `server.go` which in a similar way as identity will receive events from `FsCredsWatcher` and update `Server.cert`. We're leveraging `http.Server.TLSConfig.GetCertificate` which allows us to provide a function that will return the current cert for every incoming request.
### How to test
```bash
# Create some root cert
$ step certificate create linkerd-proxy-injector.linkerd.svc ca.crt ca.key \
--profile root-ca --no-password --insecure --san linkerd-proxy-injector.linkerd.svc
# configure injector's caBundle to be that root cert
$ cat > linkerd-overrides.yaml << EOF
proxyInjector:
externalSecret: true
caBundle: |
< ca.crt contents>
EOF
# Install linkerd. The injector won't start untill we create the secret below
$ bin/linkerd install --controller-log-level debug --config linkerd-overrides.yaml | k apply -f -
# Generate an intermediatery cert with short lifespan
step certificate create linkerd-proxy-injector.linkerd.svc ca-int.crt ca-int.key --ca ca.crt --ca-key ca.key --profile intermediate-ca --not-after 4m --no-password --insecure --san linkerd-proxy-injector.linkerd.svc
# Create the secret using that intermediate cert
$ kubectl create secret tls \
linkerd-proxy-injector-k8s-tls \
--cert=ca-int.crt \
--key=ca-int.key \
--namespace=linkerd
# start following the injector log
$ k -n linkerd logs -f -l linkerd.io/control-plane-component=proxy-injector -c proxy-injector
# Inject emojivoto. The pods should be injected normally
$ bin/linkerd inject https://run.linkerd.io/emojivoto.yml | kubectl apply -f -
# Wait about 5 minutes and delete a pod
$ k -n emojivoto delete po -l app=emoji-svc
# You'll see it won't be injected, and something like "remote error: tls: bad certificate" will appear in the injector logs.
# Regenerate the intermediate cert
$ step certificate create linkerd-proxy-injector.linkerd.svc ca-int.crt ca-int.key --ca ca.crt --ca-key ca.key --profile intermediate-ca --not-after 4m --no-password --insecure --san linkerd-proxy-injector.linkerd.svc
# Delete the secret and recreate it
$ k -n linkerd delete secret linkerd-proxy-injector-k8s-tls
$ kubectl create secret tls \
linkerd-proxy-injector-k8s-tls \
--cert=ca-int.crt \
--key=ca-int.key \
--namespace=linkerd
# Wait a couple of minutes and you'll see some filesystem events in the injector log along with a "Certificate has been updated" entry
# Then delete the pod again and you'll see it gets injected this time
$ k -n emojivoto delete po -l app=emoji-svc
```
CLI crashes if linkerd-config contains unexpected values.
Add a safe accessor that initializes an empty Global on the first
access. Refactor all accesses to use the newly introduced accessor using
gopls.
Add test for linkerd-config data without Global.
Fixes#5215
Co-authored-by: Itai Schwartz <yitai27@gmail.com>
Signed-off-by: Hod Bin Noon <bin.noon.hod@gmail.com>