The ExternalWorkload resource we introduced has a minor naming
inconsistency; `Tls` in `meshTls` is not capitalised. Other resources
that we have (e.g. authentication resources) capitalise TLS (and so does
Go, it follows a similar naming convention).
We fix this in the workload resource by changing the field's name and
bumping the version to `v1beta1`.
Upgrading the control plane version will continue to work without
downtime. However, if an existing resource exists, the policy controller
will not completely initialise. It will not enter a crashloop backoff,
but it will also not become ready until the resource is edited or
deleted.
Signed-off-by: Matei David <matei@buoyant.io>
Adds a metric that measures the number of items that have been discarded from the work queue in the external workloads controller due to the retries limit being exceeded.
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Fixes#12032
The Destination controller server tests test the destination server with `enableEndpointSlices=false`. The default for this value is true, meaning that these tests do not test the default configuration.
We update the tests to test with `enableEndpointSlices=true` and update the corresponding mock kubernetes Endpoints resources to be EndpointSlices instead. We also fix an instance where the workload watcher was using Endpoints even when in EndpointSlices mode.
Signed-off-by: Alex Leong <alex@buoyant.io>
In certain cases (e.g. high CPU load) kubelets can be slow to read readiness
and liveness responses. Linkerd is configured with a default time out of `1s`
for its probes. To prevent injected pod restarts under high load, this
change makes probe timeouts configurable.
---------
Signed-off-by: Matei David <matei@buoyant.io>
Co-authored-by: Matei David <matei@buoyant.io>
Co-authored-by: Alejandro Pedraza <alejandro@buoyant.io>
A Server may only select workloads in its own namespace. Therefore, when the
destination controller receives an update for a Server, it only needs to
potentially send updates to watches on workloads in that same namespace. Taking
this into account allows us avoid all opaqueness computations for workloads in
other namespaces.
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#12010
## Problem
We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010.
## Cause
The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue.
## Solution
Make a shallow copy of `pp.addresses` in the endpoints watcher and only pass that to the listeners. It's a shallow copy because we avoid making copies of the pod reference in there, knowing it won't get mutated.
## Repro
Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below):
<details>
<summary>endpoint_translator.go diff</summary>
```diff
diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go
index d1018d5f9..7d5abd638 100644
--- a/controller/api/destination/endpoint_translator.go
+++ b/controller/api/destination/endpoint_translator.go
@@ -5,6 +5,7 @@ import (
"reflect"
"strconv"
"strings"
+ "time"
pb "github.com/linkerd/linkerd2-proxy-api/go/destination"
"github.com/linkerd/linkerd2-proxy-api/go/net"
@@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) {
}
func (et *endpointTranslator) add(set watcher.AddressSet) {
for id, address := range set.Addresses {
+ time.Sleep(1 * time.Second)
et.availableEndpoints.Addresses[id] = address
}
```
</details>
Then create these two Server manifests:
<details>
<summary>emoji-web-server.yml</summary>
```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
namespace: emojivoto
name: web-http
labels:
app.kubernetes.io/part-of: emojivoto
app.kubernetes.io/name: web
app.kubernetes.io/version: v11
spec:
podSelector:
matchLabels:
app: web-svc
port: http
proxyProtocol: HTTP/1
```
</details>
<details>
<summary>emoji-web-server-opaque.yml</summary>
```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
namespace: emojivoto
name: web-http
labels:
app.kubernetes.io/part-of: emojivoto
app.kubernetes.io/name: web
app.kubernetes.io/version: v11
spec:
podSelector:
matchLabels:
app: web-svc
port: http
proxyProtocol: opaque
```
</details>
In separate consoles run the patched destination service and a destination client:
```bash
HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211
```
```bash
go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80
```
And run this to continuously switch the `proxyProtocol` field:
```bash
while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done
```
You'll see the following data race report in the Destination controller logs:
<details>
<summary>destination logs</summary>
```console
==================
WARNING: DATA RACE
Write at 0x00c0006d30e0 by goroutine 178:
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm()
<autogenerated>:1 +0x64
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
/home/alpeb/go/pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/controller.go:246 +0x81
k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
<autogenerated>:1 +0x1f
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/home/alpeb/go/pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/shared_informer.go:970 +0x1f4
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:226 +0x41
k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:227 +0xbe
k8s.io/apimachinery/pkg/util/wait.JitterUntil()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:204 +0x10a
k8s.io/apimachinery/pkg/util/wait.Until()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:161 +0x9b
k8s.io/client-go/tools/cache.(*processorListener).run()
/home/alpeb/go/pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/shared_informer.go:966 +0x38
k8s.io/client-go/tools/cache.(*processorListener).run-fm()
<autogenerated>:1 +0x33
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/wait.go:72 +0x86
Previous read at 0x00c0006d30e0 by goroutine 360:
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45
```
</details>
## Extras
This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
Fixes#11995
If a Server is marking a Pod's port as opaque and then the Server's podSelector is updated to no longer select that Pod, then the Pod's port should no longer be marked as opaque. However, this update does not result in any updates from the destination API's Get stream and the port remains marked as opaque.
We fix this by updating the endpoint watcher's handling of Server updates to consider both the old and the new Server.
Signed-off-by: Alex Leong <alex@buoyant.io>
Revise leader election logic for endpoints controller
Our leader election logic can result in updates being missed under certain
conditions. Leases expire after their duration is up, even if their current
holder has been terminated. During this dead time, any changes in the
system will be observed by other controllers, but will not be written to
the API Server.
For example, during a rollout, a controller that has come up will not be
able to acquire the lease for a maximum time of 30 seconds (lease
duration). Within this time frame, any changes to the system (e.g. modified
workloads, services, deleted endpointslices) will be observed but not acted
on by the newly created controller. Once the controller gets into a bad
state, it can only recover after 10 minutes (via service resyncs) or if any
resources are modified.
To address this, we change our leader election mechanism. Instead of
pushing leader election to the edge (i.e. when performing writes) we only
allow events to be observed when a controller is leading (i.e. by
registering callbacks). When a controller stops leading, all of its
callbacks will be de-registered.
NOTE:
* controllers will have a grace period during which they can renew their
lease. Their callbacks will be de-registered only if this fails. We will
not register and de-register callbacks that often for a single
controller.
* we do not lose out on any state. Other informers will continue to run
(e.g. destination readers). When callbacks are registered, we pass all of
the cached objects through them. In other words, we do not issue API
requests on registration, we process the state of the cluster as observed
from the cache.
* we make another change that's slightly orthogonal. Before we shutdown,
we ensure to drain the queue. This should not be a race since we will
first block until the queue is drained, then signal to the leader elector
loop that we are done. This gives us some confidence that all events have
been processed as soon as they were observed.
Signed-off-by: Matei David <matei@buoyant.io>
We the destination controller's workload watcher receives an update for any Server resource, it recomputes opaqueness for every workload. This is because the Server update may have changed opaqueness for that workload. However, this is very CPU intensive for the destination controller, especially during resyncs when we get Server updates for every Server resource in the cluster.
Instead, we only need to recompute opaqueness for workloads that are selected by the old version of the Server or by the new version of the Server. If a workload is not selected by either the new or old version of the Server, then the Server update cannot have changed the workload's opaqueness.
Signed-off-by: Alex Leong <alex@buoyant.io>
Any slices generated for a group of external workloads follow a similar
convention: `linkerd-external-<svc-name>-<hash>`. Currently the hash is
appended directly to the service name making it less readable. We add a
`-` to the generate name value so that random hashes are not part of the
service name. This is similar to the upstream implementation.
Signed-off-by: Matei David <matei@buoyant.io>
When the destination controller receives a GetProfile request for an ExternalName service, it should return gRPC status code INVALID_ARGUMENT to signal to the proxy that ExternalName services do not have endpoints and service discovery should not be used. However, the destination controller is returning gRPC status code UNKNOWN instead. This causes the proxy to retry these requests, resulting in a flurry of warning logs in the destination controller.
We fix the destination controller to properly return INVALID_ARGUMENT instead of UNKNOWN for ExternalName services.
Signed-off-by: Alex Leong <alex@buoyant.io>
If the readiness of an external workload endpoint changes while traffic
is being sent to it, the update will not be propagated to clients. This
can lead to issues where an endpoint that is marked as `notReady`
continues to figure out as being `ready` by the endpoints watcher.
The issue stems from how endpoint slices are diffed. A utility function
responsible for processing addresses does not consider endpoints whose
targetRef is an external workload. We fix the problem and add two module
tests to validate readiness is propagated to clients correctly.
---------
Signed-off-by: Matei David <matei@buoyant.io>
We introduced an endpoints controller that will be responsible for
managing EndpointSlices for services that select external workloads. We
introduce as a follow-up the reconciler component of the controller that
will be responsible for doing the writes and diffing.
Additionally, the controller is wired-up in the destination service's
main routine and will start if endpoint slice support is enabled.
---------
Signed-off-by: Matei David <matei@buoyant.io>
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Co-authored-by: Zahari Dichev <zaharidichev@gmail.com>
ExternalWorkload resources require that status condition has almost all of its
fields set (with the exception of a date field). The original inspiration for
this design was the HTTPRoute object.
When using the resource, it is more practical to handle many of the fields as
optional; it is cumbersome to fill out the fields when creating an
ExternalWorkload. We change the settings to be in-line with a [Pod] object
instead.
[Pod]:
7d1a2f7a73/core/v1/types.go (L3063-L3084)
---------
Signed-off-by: Matei David <matei@buoyant.io>
When a meshed client attempts to establish a connection directly to the workload IP of an ExternalWorkload, the destination controller should return an endpoint profile for that ExternalWorkload with a single endpoint and the metadata associated with that ExternalWorkload including:
* mesh TLS identity
* workload metric labels
* opaque / protocol hints
Signed-off-by: Alex Leong <alex@buoyant.io>
This PR adds metrics to the work queue that is used in the external workload endpoints controller.
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
For mesh expansion, we need to register an ExternalWorkload's service
membership. Service memberships describe which Service objects an
ExternalWorkload is part of (i.e. which service can be used to route
traffic to an external endpoint).
Service membership will allow the control plane to discover
configuration associated with an external endpoint when performing
discovery on a service target.
To build these memberships, we introduce a new controller to the
destination service, responsible for watching Service and
ExternalWorkload objects, and for writing out EndpointSlice objects for
each Service that selects one or more external endpoints.
As a first step, we add a new externalworkload module and a new controller in the
that watches services and workloads. In a follow-up change,
the ExternalEndpointManager will additionally perform
the necessary reconciliation by writing EndpointSlice objects.
Since Linkerd's control plane may run in HA, we also add a lease object
that will be used by the manager. When a lease is claimed, a flag is
turned on in the manager to let it know it may perform writes.
A more compact list of changes:
* Add a new externalworkload module
* Add an EndpointsController in the module along with necessary mechanisms to watch resources.
* Add RBAC rules to the destination service:
* Allow policy and destination to read ExternalWorkload objects
* Allow destination to create / update / read Lease objects
---------
Signed-off-by: Matei David <matei@buoyant.io>
Currently, the value that is put in the `LINKERD2_PROXY_POLICY_WORKLOAD` env var has the format of `pod_ns:pod_name`. This PR changes the format of the policy token into a json struct, so it can encode the type of workload and not only its location. For now, we add an additional `external_workload` type.
Zahari Dichev <zaharidichev@gmail.com>
We introduced an ExternalWorkload CRD along with bindings for mesh
expansion. Currently, the CRD allows users to create ExternalWorkload
resources without adding a meshTls strategy.
This change adds some more validation restrictions to the CRD definition
(i.e. server side validation). When a meshTls strategy is used, we
require both identity and serverName to be present. We also mark meshTls
as the only required field in the spec. Every ExternalWorkload regardless
of the direction of its traffic must have it set.
WorkloadIPs and ports now become optional to allow resources to be
created only to configure outbound discovery (VM to workload)
and inbound policy discovery (VM).
---------
Signed-off-by: Matei David <matei@buoyant.io>
Whenever the destination controller's informer receives an update of a Server resource, it checks every portPublisher in the endpointsWatcher to see if the Server selects any pods in that servicePort and updates those pods' opaque protocol field. Regardless of if any pods were matched or if the opaque protocol changed, an update is sent to each listener. This results in an update to every endpointTranslator each time a Server is updated. During a resync, we get an update for every Server in the cluster which results in N updates to each endpointTranslator where N is the number of Servers in the cluster.
If N is greater than 100, it becomes possible that these N updates could overflow the endpointTranslator update queue if the queue is not being drained fast enough.
We change this to only send the update for a Server if at least one of the servicePort addresses was selected by that server AND it's opaque protocol field changed.
Signed-off-by: Alex Leong <alex@buoyant.io>
We introduced an ExternalWorkload CRD for mesh expansion. This change
follows up by adding bindings for Rust and Go code.
For Go code:
* We add a new schema and ExternalWorkload types
* We also update the code-gen script to generate informers
* We add a new informer type to our abstractions built on-top of
client-go, including a function to check if a client has access to the
resource.
For Rust code:
* We add ExternalWorkload bindings to the policy controller.
---------
Signed-off-by: Matei David <matei@buoyant.io>
Unregister prom gauges when recycling cluster watcher
Fixes#11839
When in `restartClusterWatcher` we fail to connect to the target cluster
for whatever reason, the function gets called again 10s later, and tries
to register the same prometheus metrics without unregistering them
first, which generates warnings.
The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates
the remote kube-api client and registers the metrics, returning a nil
object if the client can't connect. `cleanupWorkers` at the beginning of
`restartClusterWatcher` won't unregister those metrics because of that
nil object.
To fix this, gauges are unregistered on error.
linkerd/linkerd2-proxy#2587 adds configuration parameters that bound the
lifetime and idle times of control plane streams. This change helps to
mitigate imbalanced control plane replica usage and to generally prevent
scenarios where a stream becomes "stuck," as has been observed when a
control plane replica is unhealthy.
This change adds helm values to control this behavior. Default values
are provided.
In 71635cb and 357a1d3 we updated the endpoint and profile translators
to prevent backpressure from stalling the informer tasks. This change
updates the endpoint profile translator with the same fix, so that
updates are buffered and can detect when when a gRPC stream is stalled.
Furthermore, the update method is updated to use a protobuf-aware
comparison method instead of using serialization and string comparison.
A test is added for the endpoint profile translator, since none existed
previously.
When we do a `GetProfile` lookup for an unmeshed pod, we set the `weightedAddr.ProtocolHint` to an empty value `&pb.ProtocolHint{}` to indicate that the address is unmeshed and has no protocol hint. However, when the looked up port is in the default opaque list, we erroneously check if `weightedAddr.ProtocolHint != nil` to determine if we should attempt to get the inbound listen port for that pod. Since `&pb.ProtocolHint{} != nil`, we attempt to get the inbound listen port for the unmeshed pod. This results in an error, preventing any valid `GetProfile` responses from being returned.
We update the initialization logic for `weightedAddr.ProtocolHint` to only create a struct when a protocol hint is present and to leave it as `nil` if the pod is unmeshed.
We add a simple unit test for this behavior as well.
Signed-off-by: Alex Leong <alex@buoyant.io>
This change adds a runtime flag to the destination controller,
--experimental-endpoint-zone-weights=true, that causes endpoints in the
local zone to receive higher weights. This feature is disabled by
default, since the weight value is not honored by proxies. No helm
configuration is exposed yet, either.
This weighting is instrumented in the endpoint translator. Tests are
added to confirm that the behavior is feature-gated.
Additionally, this PR adds the "zone" metric label to endpoint metadata
responses.
Recent versions of the code-generator package have replaced the `generate-groups.sh` script that we use to generate client-go bindings for custom resource types with a new script called `kube_codegen.sh`. This PR updates our `update-codgen.sh` script to use `kube_codegen.sh` instead of `generate-groups.sh`.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Add ability to configure client-go's `QPS` and `Burst` settings
## Problem and Symptoms
When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this:
```
time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
```
## Solution
Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic.
## Observability
The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst`
## Testing
You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to:
- Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory.
- Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts.
```javascript
import { Client, Stream } from 'k6/experimental/grpc';
import { sleep } from 'k6';
const client = new Client();
client.load(['.'], 'identity.proto');
// This always holds:
// req_num = (1 / req_duration ) * duration * VUs
// Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we
// can solve for the required VUs:
// VUs = req_num * req_duration / duration
// VUs = 6000 * 0.5 / 60 = 50
export const options = {
scenarios: {
identity: {
executor: 'constant-vus',
vus: 50,
duration: '1m',
},
},
};
export default () => {
client.connect('localhost:8080', {
plaintext: true,
});
const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify');
// Replace with your own token
let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw==";
// Replace with your own CSR
let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE=";
const data = {
identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local",
token: token,
certificate_signing_request: csr,
};
stream.write(data);
// This request takes around 2ms, so this sleep will mostly determine its final duration
sleep(0.5);
};
```
This results in the following report:
```
scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
* identity: 50 looping VUs for 1m0s (gracefulStop: 30s)
data_received................: 6.3 MB 104 kB/s
data_sent....................: 9.4 MB 156 kB/s
grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms
grpc_streams.................: 6000 99.355331/s
grpc_streams_msgs_received...: 6000 99.355331/s
grpc_streams_msgs_sent.......: 6000 99.355331/s
iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms
iterations...................: 6000 99.355331/s
vus..........................: 50 min=50 max=50
vus_max......................: 50 min=50 max=50
running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations
```
With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
* Add native sidecar support
Kubernetes will be providing beta support for native sidecar containers in version 1.29. This feature improves network proxy sidecar compatibility for jobs and initContainers.
Introduce a new annotation config.alpha.linkerd.io/proxy-enable-native-sidecar and configuration option Proxy.NativeSidecar that causes the proxy container to run as an init-container.
Fixes: #11461
Signed-off-by: TJ Miller <millert@us.ibm.com>
This removes `endpointProfileTranslator`'s dependency on the k8sAPI and
metadataAPI, which are not used. This was introduced in one of #11334's
refactorings, but ended up being not required. No functional changes
here.
https://github.com/linkerd/linkerd2/pull/11491 changed the EndpointTranslator to use a queue to avoid calling `Send` on a gRPC stream directly from an informer callback goroutine. This change updates the ProfileTranslator in the same way, adding a queue to ensure we do not block the informer thread.
Signed-off-by: Alex Leong <alex@buoyant.io>
In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.
The histogram metrics are of the form `{kind}_informer_lag_ms_bucket_*`.
* We record a value only for updates, not for adds or deletes. This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large. This does not represent informer lag and should not be counted as such.
* When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version. This does not represent an actual update of the resource itself and so we do not record a value.
* Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster.
* We use histogram buckets which range from 500ms to about 17 minutes. In my testing, an informer lag of 500ms-1000ms is typical. However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Update dev to v42
* Update Go to 1.21.3
* Update Rust to 1.73.0
* Update the Cargo workspace to use the v2 package resolver
* Update debian from bullseye to bookworm
* Update golangci-lint to 1.55.1
* Disable deprecated linters (deadcode, varcheck)
* Disable goconst linter -- pointless and noisy
* Disable depguard linter -- it requires that all of our Go dependencies be added to allowlists;
* Update K3d to v5.6.0
* Update CI from k3s 1.26 to 1.28
* Update markdownlint-cli2 to 0.10.0
When we do a GetProfile lookup for an opaque port on an unmeshed pod,
we attempt to look up the inbound listen port of that pod's proxy. Since
that pod has no proxy, this fails and we return an error to the GetProfile
API call. This causes the proxy to fail to be able to resolve the profile and
be unable to route the traffic.
We revert to the previous behavior of only logging when we cannot look
up the inbound listen port instead of returning an error.
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#11065
When an inbound proxy receives a request with a canonical name of the form `hostname.service.namespace.svc.cluster.domain`, we assume that `hostname` is the hostname of the pod as described in https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields. However, pods are also addressable with `pod-ip.service.namespace.svc.cluster.domain`. When the destination controller gets a profile request of this form, we attempt to find a pod with hostname of `pod-ip` and return an error with gRPC status `Unknown` since this will not exist.
It is expected that this profile lookup will fail since we cannot have service profiles for individual pods. However, returning a gRPC status `Unknown` for these requests brings the reported success rate of the destination controller down. Instead we should return these as gRPC status `NotFound` so that these responses don't get reported as server errors.
Signed-off-by: Alex Leong <alex@buoyant.io>
The destination controller uses the metadata API to retrieve Job metadata, but does not properly initialize the metadata API to sync jobs. This results in error messages in the log and a fallback to using direct API calls instead of the informer cache.
We properly initialize the jobs informer so that this works as expected.
Signed-off-by: Alex Leong <alex@buoyant.io>
When the destination controller logs about receiving or sending messages to a data plane proxy, there is no information in the log about which data plane pod it is communicating with. This can make it difficult to diagnose issues which span the data plane and control plane.
We add a `pod` field to the context token that proxies include in requests to the destination controller. We add this pod name to the logging context so that it shows up in log messages. In order to accomplish this, we had to plumb through logging context in a few places where it previously had not been. This gives us a more complete logging context and more information in each log message.
An example log message with this fuller logging context is:
```
time="2023-10-24T00:14:09Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183762990} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"voting\"} metric_labels:{key:\"pod\" value:\"voting-7475cb974c-2crt5\"} metric_labels:{key:\"pod_template_hash\" value:\"7475cb974c\"} metric_labels:{key:\"serviceaccount\" value:\"voting\"} tls_identity:{dns_like_identity:{name:\"voting.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} metric_labels:{key:\"namespace\" value:\"emojivoto\"} metric_labels:{key:\"service\" value:\"voting-svc\"}}" addr=":8086" component=endpoint-translator context-ns=emojivoto context-pod=web-767f4484fd-wmpvf remote="10.244.0.65:52786" service="voting-svc.emojivoto.svc.cluster.local:8080"
```
Note the `context-pod` field.
Additionally, we have tested this when no pod field is included in the context token (e.g. when handling requests from a pod which does not yet add this field) and confirmed that the `context-pod` log field is empty, but no errors occur.
Signed-off-by: Alex Leong <alex@buoyant.io>
The destination controller indexes Pods by their primary PodIP and their
primary HostIP (when applicable); and it indexes Services by their
primary ClusterIP.
In preparation for dual-stack (IPv6) support, this change updates the
destination controller indexers to consume all IPs for these resources.
Tests are updated to demonstrate a dual-stack setup (where these
resources have a primary IPv4 address and a secondary IPv6 address).
While exercising these tests, I had to replace the brittle host:port
parsing logic we use in the server, in favor of Go's standard
`net.SplitHostPort` utility.
When a grpc client of the destination.Get API initiates a request but then doesn't read off of that stream, the HTTP2 stream flow control window will fill up and eventually exert backpressure on the destination controller. This manifests as calls to `Send` on the stream blocking. Since `Send` is called synchronously from the client-go informer callback (by way of the endpoint translator), this blocks the informer callback and prevents all further informer calllbacks from firing. This causes the destination controller to stop sending updates to any of its clients.
We add a queue in the endpoint translator so that when it gets an update from the informer callback, that update is queued and we avoid potentially blocking the informer callback. Each endpoint translator spawns a goroutine to process this queue and call `Send`. If there is not capacity in this queue (e.g. because a client has stopped reading and we are experiencing backpressure) then we terminate the stream.
Signed-off-by: Alex Leong <alex@buoyant.io>
Followup to https://github.com/linkerd/linkerd2/pull/11334#issuecomment-1736093592
This extends the test introduced in #11334 to excercise upgrading a
Server associated to a pod's HostPort, and observing how the stream
updates the OpaqueProtocol field.
Helper functions were refactored a bit to allow retrieving the
l5dCRDClientSet used when building the fake API.
Followup to #11328
Implements a new pod watcher, instantiated along the other ones in the Destination server. It also watches on Servers and carries all the logic from ServerWatcher, which has now been decommissioned.
The `CreateAddress()` function has been moved into a function of the PodWatcher, because now we're calling it on every update given the pod associated to an ip:port might change and we need to regenerate the Address object. That function also takes care of capturing opaque protocol info from associated Servers, which is not new and had some logic that was duped in the now defunct ServerWatcher. `getAnnotatedOpaquePorts()` got also moved for similar reasons.
Other things to note about PodWatcher:
- It publishes a new pair of metrics `ip_port_subscribers` and `ip_port_updates` leveraging the framework in `prometheus.go`.
- The complexity in `updatePod()` is due to only send stream updates when there are changes in the pod's readiness, to avoid sending duped messages on every pod lifecycle event.
-
Finally, endpointProfileTranslator's `endpoint` (*pb.WeightedAddr) not being a static object anymore, the `Update()` function now receives an Address that allows it to rebuild the endpoint on the fly (and so `createEndpoint()` was converted into a method of endpointProfileTranslator).
* Bump CNI plugin to v1.2.1
* Bump proxy-init to v2.2.2
Both dependencies include a fix for CVE-2023-2603. Since alpine is used
as the runtime image, there is a security vulnerability detected in the
produced images (due to an issue with libcap). The alpine images have
been bumped to address the CVE.
Signed-off-by: Matei David <matei@buoyant.io>
* stopgap fix for hostport staleness
## Problem
When there's a pod with a `hostPort` entry, `GetProfile` requests
targetting the host's IP and that `hostPort` return an endpoint profile
with that pod's IP and `containerPort`. If that pod vanishes and another
one in that same host with that same `hostPort` comes up, the existing
`GetProfile` streams won't get updated with the new pod information
(metadata, identity, protocol).
That breaks the connectivity of the client proxy relying on that stream.
## Partial Solution
It should be less surprising for those `GetProfile` requests to return
an endpoint profile with the same host IP and port requested, and leave
to the cluster's CNI to peform the translation to the corresponding pod
IP and `containerPort`.
This PR performs that change, but continuing returning the corresponding
pod's information alongside.
If the pod associated to that host IP and port changes, the client proxy
won't loose connectivity, but the pod's information won't get updated
(that'll be fixed in a separate PR).
A new unit test validating this has been added, which will be expanded
to validate the changed pod information when that gets implemented.
## Details of Change
- We no longer do the HostPort->ContainerPort conversion, so the
`getPortForPod` function was dropped.
- The `getPodByIp` function will now be split in two: `getPodByPodIP`
and `getPodByHostIP`, the latter being called only if the former
doesn't return anything.
- The `createAddress` function is now simplified in that it just uses
the passed IP to build the address. The passed IP will depend on which
of the two functions just mentioned returned the pod (host IP or pod
IP)
When a service has it's internal traffic policy set to "local", we will perform filtering to only return local endpoints, as-per the ForZone hints in the endpoints. However, ForZone calculations do not take resources from remote clusters into account, therefore this type of filtering is not appropriate for remote discovery services.
We explicitly ignore any internal traffic policy when doing remote discovery.
Signed-off-by: Alex Leong <alex@buoyant.io>