The destination controller's cluster store registers a gague in its constructor. When this constructor is called multiple times (i.e. in tests), this can lead to a panic.
To avoid this panic, this change updates NewClusterStoreWithDecoder to accept a prometheus registry). The NewClusterStore constructor (used by the application's main) continues to use the default registry, but tests now construct their own temporary registries to avoid duplicate registration errors.
* fix(destination): GetProfile requests targeting pods directly should return endpoint data for running (not necessarily ready) pods
Requiring Pods to pass readiness checks before allowing Pod to Pod communication disrupts communication in e.g. clustered systems which require Pods to communicate with each other prior to establishing ready state and allowing inbound traffic.
Relaxed the requirement and modified the workload watcher to only require that a Pod exists and is in Running phase.
Reproduced the issue with a test setup described in #13247.
Fixes#13247.
---------
Signed-off-by: Tuomo <tjorri@gmail.com>
Co-authored-by: Alejandro Pedraza <alejandro@buoyant.io>
We received a report of a panic:
runtime error: invalid memory address or nil pointer dereference
panic({0x1edb860?, 0x37a6050?}
/usr/local/go/src/runtime/panic.go:785 +0x132
github.com/linkerd/linkerd2/controller/api/destination/watcher.latestUpdated({0xc0006b2d80?, 0xc00051a540?, 0xc0008fa008?})
/linkerd-build/vendor/github.com/linkerd/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1612 +0x125
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*OpaquePortsWatcher).updateService(0xc0007d5480, {0x21fd160?, 0xc000d71688?}, {0x21fd160, 0xc000d71688})
/linkerd-build/vendor/github.com/linkerd/linkerd2/controller/api/destination/watcher/opaque_ports_watcher.go:141 +0x68
The `latestUpdated` function does not properly handle the case where a atime is
omitted from a `ManagedFieldsEntry`.
type ManagedFieldsEntry struct {
// Time is the timestamp of when the ManagedFields entry was added. The
// timestamp will also be updated if a field is added, the manager
// changes any of the owned fields value or removes a field. The
// timestamp does not update when a field is removed from the entry
// because another manager took it over.
// +optional
Time *Time `json:"time,omitempty" protobuf:"bytes,4,opt,name=time"`
This change adds a check to avoid the nil dereference.
Adds tests for the federated service watcher that exercise having remote and local clusters join and leave a federated service and ensuring that the correct proxy API updates are emitted.
Signed-off-by: Alex Leong <alex@buoyant.io>
We add support for federated services to the destination controller by adding a new FederatedServiceWatcher. When the destination controller receives a `Get` request for a Service with the `multicluster.linkerd.io/remote-discovery` and/or the `multicluster.linkerd.io/local-discovery` annotations, it subscribes to the FederatedServiceWatcher instead of subscribing to the EndpointsWatcher directly. The FederatedServiceWatcher watches the federated service for any changes to these annotations, and maintains the appropriate watches on the local EndpointWatcher and/or remote EndpointWatchers fetched through the ClusterStore.
This means that we will often have multiple EndpointTranslators writing to the same `Get` response stream. In order for a `NoEndpoints` message sent to one EndpointTranslator to not clobber the whole stream, we make a change where `NoEndpoints` messages are no longer sent to the response stream, but are replaced by a `Remove` message containing all of the addresses from that EndpointTranslator. This allows multiple EndpointTranslators to coexist on the same stream.
Signed-off-by: Alex Leong <alex@buoyant.io>
Followup to #12844
This new field defines the default policy for Servers, i.e. if a request doesn't match the policy associated to a Server then this policy applies. The values are the same as for `proxy.defaultInboundPolicy` and the `config.linkerd.io/default-inbound-policy` annotation (all-unauthenticated, all-authenticated, cluster-authenticated, cluster-unauthenticated, deny), plus a new value "audit". The default is "deny", thus remaining backwards-compatible.
This field is also exposed as an additional printer column.
Fixes#12686
When an endpoint in an EndpointSlice resource does not contain a hostname field, the destination controller can panic while looking for an endpoint with a certain hostname. This happens when doing a lookup with a pod dns name.
We add a nil check to avoid the panic.
We add such an endpoint to our test fixture to exercise this case.
Signed-off-by: Alex Leong <alex@buoyant.io>
This is a second take on #12427, which avoided a theoretical/correctness
issue around overwritting new ES addresses with stale data.
We had to revert that in #12589 because the change introduced a bug, by
returning early when the ES had no addresses and failed to properly
initiallize `addesses` for the portPublisher.
This just removes the early return.
This reverts commit 4fccf3e9ec.
The early return was causing `pp.addresses = newAddressSet` to not be run when the list of addresses is empty; but setting that is still necessary so that labels are tracked correctly.
This was caught by the tap (viz) integration test run in the release workflow.
When updating the portPublisher's address set when a new EndpointSlice creation event is received, its addresses where getting overwritten with stale data whenever its IDs already existed in the current pp's address set.
There can be pathological cases in single-stack where this can be a problem. For example when ES get recycled but the deletion event is not caught for some reason, when the addition event is received its Address data will be overwritten by the old stale entry.
## Other Changes
- Remove overriding `newAddressSet.LocalTrafficPolicy` as that is already taken care inside `pp.endpointSliceToAddresses(slice)`.
- When there are no Add events to send, return early without updating state nor updating metrics.
Services in dual-stack mode result in the creation of two EndpointSlices, one for each IP family. Before this change, the Get Destination API would nondeterministically return the address for any of those ES, depending on which one was processed last by the controller because they would overwrite each other.
As part of the ongoing effort to support IPv6/dual-stack networks, this change fixes that behavior giving preference to IPv6 addresses whenever a service exposes both families.
There are a new set of unit tests in server_ipv6_test.go, and in the TestEndpointTranslatorForPods tests there's a couple of new cases to test the interaction with zone filtering.
Also the server unit tests were updated to segregate the tests and resources dealing with the IPv4/IPv6/dual-stack cases.
The proxy-injector package has a `ResourceConfig` type that is
responsible for parsing resources, applying overrides, and serialising a
series of configuration values to a Kubernetes patch. The functionality
is very concrete in its assumption; it always relies on a pod spec and
it mutates inner state when deciding on which overrides to apply.
This is not a flexible way to handle injection and configuration
overriding for other types of resources. We change this by turning
methods previously defined on `ResourceConfig` into free-standing
functions. These functions can be applied for any type of resources in
order to compute a set of configuration values based on annotation
overrides. Through the change, the functions can be used to compute
static configuration for non-Pod types or can be used in tests.
Signed-off-by: Matei David <matei@buoyant.io>
When the destination controller receives an update for a Server resource, we recompute opaqueness ports for all pods. This results in a large number of updates to all endpoint profile watches, even if the opaqueness doesn't change. In cases where there are many Server resources, this can result in a large number of updates being sent to the endpoint profile translator and overflowing the endpoint profile translator update queue. This is especially likely to happen during an informer resync, since this will result in an informer callback for every Server in the cluster.
We refactor the workload watcher to not send these updates if the opaqueness has not changed.
This, seemingly simple, change in behavior requires a large code change because:
* the current opaqueness state is not stored on workload publishers and must be added so that we can determine if the opaqueness has changed
* storing the opaqueness in addition to the other state we're storing (pod, ip, port, etc.) means that we are not storing all of the data represented by the Address struct
* workload watcher uses a `createAddress` func to dynamically create an Address from the state it stores
* now that we are storing the Address as state, creating Addresses dynamically is no longer necessary and we can operate on the Address state directly
* this makes the workload watcher more similar to other watchers and follow a common pattern
* it also fixes some minor correctness issues:
* pods that did not have the ready status condition were being considered when they should not have been
* updates to ExternalWorkload labels were not being considered
Signed-off-by: Alex Leong <alex@buoyant.io>
The ExternalWorkload resource we introduced has a minor naming
inconsistency; `Tls` in `meshTls` is not capitalised. Other resources
that we have (e.g. authentication resources) capitalise TLS (and so does
Go, it follows a similar naming convention).
We fix this in the workload resource by changing the field's name and
bumping the version to `v1beta1`.
Upgrading the control plane version will continue to work without
downtime. However, if an existing resource exists, the policy controller
will not completely initialise. It will not enter a crashloop backoff,
but it will also not become ready until the resource is edited or
deleted.
Signed-off-by: Matei David <matei@buoyant.io>
Fixes#12032
The Destination controller server tests test the destination server with `enableEndpointSlices=false`. The default for this value is true, meaning that these tests do not test the default configuration.
We update the tests to test with `enableEndpointSlices=true` and update the corresponding mock kubernetes Endpoints resources to be EndpointSlices instead. We also fix an instance where the workload watcher was using Endpoints even when in EndpointSlices mode.
Signed-off-by: Alex Leong <alex@buoyant.io>
A Server may only select workloads in its own namespace. Therefore, when the
destination controller receives an update for a Server, it only needs to
potentially send updates to watches on workloads in that same namespace. Taking
this into account allows us avoid all opaqueness computations for workloads in
other namespaces.
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#12010
## Problem
We're observing crashes in the destination controller in some scenarios, due to data race as described in #12010.
## Cause
The problem is the same instance of the `AddressSet.Addresses` map is getting mutated in the endpoints watcher Server [informer handler](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/watcher/endpoints_watcher.go#L1309), and iterated over in the endpoint translator [queue loop](https://github.com/linkerd/linkerd2/blob/edge-24.1.3/controller/api/destination/endpoint_translator.go#L197-L211), which run in different goroutines and the map is not guarded. I believe this doesn't result in Destination returning stale data; it's more of a correctness issue.
## Solution
Make a shallow copy of `pp.addresses` in the endpoints watcher and only pass that to the listeners. It's a shallow copy because we avoid making copies of the pod reference in there, knowing it won't get mutated.
## Repro
Install linkerd core and injected emojivoto and patch the endpoint translator to include a sleep call that will help surfacing the race (don't install the patch in the cluster; we'll only use it locally below):
<details>
<summary>endpoint_translator.go diff</summary>
```diff
diff --git a/controller/api/destination/endpoint_translator.go b/controller/api/destination/endpoint_translator.go
index d1018d5f9..7d5abd638 100644
--- a/controller/api/destination/endpoint_translator.go
+++ b/controller/api/destination/endpoint_translator.go
@@ -5,6 +5,7 @@ import (
"reflect"
"strconv"
"strings"
+ "time"
pb "github.com/linkerd/linkerd2-proxy-api/go/destination"
"github.com/linkerd/linkerd2-proxy-api/go/net"
@@ -195,7 +196,9 @@ func (et *endpointTranslator) processUpdate(update interface{}) {
}
func (et *endpointTranslator) add(set watcher.AddressSet) {
for id, address := range set.Addresses {
+ time.Sleep(1 * time.Second)
et.availableEndpoints.Addresses[id] = address
}
```
</details>
Then create these two Server manifests:
<details>
<summary>emoji-web-server.yml</summary>
```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
namespace: emojivoto
name: web-http
labels:
app.kubernetes.io/part-of: emojivoto
app.kubernetes.io/name: web
app.kubernetes.io/version: v11
spec:
podSelector:
matchLabels:
app: web-svc
port: http
proxyProtocol: HTTP/1
```
</details>
<details>
<summary>emoji-web-server-opaque.yml</summary>
```yaml
apiVersion: policy.linkerd.io/v1beta2
kind: Server
metadata:
namespace: emojivoto
name: web-http
labels:
app.kubernetes.io/part-of: emojivoto
app.kubernetes.io/name: web
app.kubernetes.io/version: v11
spec:
podSelector:
matchLabels:
app: web-svc
port: http
proxyProtocol: opaque
```
</details>
In separate consoles run the patched destination service and a destination client:
```bash
HOSTNAME=foobar go run -race ./controller/cmd/main.go destination -enable-h2-upgrade=true -enable-endpoint-slices=true -cluster-domain=cluster.local -identity-trust-domain=cluster.local -default-opaque-ports=25,587,3306,4444,5432,6379,9300,11211
```
```bash
go run ./controller/script/destination-client -path web-svc.emojivoto.svc.cluster.local:80
```
And run this to continuously switch the `proxyProtocol` field:
```bash
while true; do kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server.yml; kubectl apply -f ~/src/k8s/sample_yamls/emoji-web-server-opaque.yml ; done
```
You'll see the following data race report in the Destination controller logs:
<details>
<summary>destination logs</summary>
```console
==================
WARNING: DATA RACE
Write at 0x00c0006d30e0 by goroutine 178:
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*portPublisher).updateServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1310 +0x772
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*servicePublisher).updateServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:711 +0x150
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).addServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:514 +0x173
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:528 +0x26f
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*EndpointsWatcher).updateServer-fm()
<autogenerated>:1 +0x64
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnUpdate()
/home/alpeb/go/pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/controller.go:246 +0x81
k8s.io/client-go/tools/cache.(*ResourceEventHandlerFuncs).OnUpdate()
<autogenerated>:1 +0x1f
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
/home/alpeb/go/pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/shared_informer.go:970 +0x1f4
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:226 +0x41
k8s.io/apimachinery/pkg/util/wait.BackoffUntil()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:227 +0xbe
k8s.io/apimachinery/pkg/util/wait.JitterUntil()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:204 +0x10a
k8s.io/apimachinery/pkg/util/wait.Until()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/backoff.go:161 +0x9b
k8s.io/client-go/tools/cache.(*processorListener).run()
/home/alpeb/go/pkg/mod/k8s.io/client-go@v0.29.1/tools/cache/shared_informer.go:966 +0x38
k8s.io/client-go/tools/cache.(*processorListener).run-fm()
<autogenerated>:1 +0x33
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
/home/alpeb/go/pkg/mod/k8s.io/apimachinery@v0.29.1/pkg/util/wait/wait.go:72 +0x86
Previous read at 0x00c0006d30e0 by goroutine 360:
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).add()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:200 +0x1ab
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).processUpdate()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:190 +0x166
github.com/linkerd/linkerd2/controller/api/destination.(*endpointTranslator).Start.func1()
/home/alpeb/pr/destination-race/linkerd2/controller/api/destination/endpoint_translator.go:174 +0x45
```
</details>
## Extras
This also removes the unused method `func (as *AddressSet) WithPort(port Port) AddressSet` in endpoints_watcher.go
Fixes#11995
If a Server is marking a Pod's port as opaque and then the Server's podSelector is updated to no longer select that Pod, then the Pod's port should no longer be marked as opaque. However, this update does not result in any updates from the destination API's Get stream and the port remains marked as opaque.
We fix this by updating the endpoint watcher's handling of Server updates to consider both the old and the new Server.
Signed-off-by: Alex Leong <alex@buoyant.io>
We the destination controller's workload watcher receives an update for any Server resource, it recomputes opaqueness for every workload. This is because the Server update may have changed opaqueness for that workload. However, this is very CPU intensive for the destination controller, especially during resyncs when we get Server updates for every Server resource in the cluster.
Instead, we only need to recompute opaqueness for workloads that are selected by the old version of the Server or by the new version of the Server. If a workload is not selected by either the new or old version of the Server, then the Server update cannot have changed the workload's opaqueness.
Signed-off-by: Alex Leong <alex@buoyant.io>
If the readiness of an external workload endpoint changes while traffic
is being sent to it, the update will not be propagated to clients. This
can lead to issues where an endpoint that is marked as `notReady`
continues to figure out as being `ready` by the endpoints watcher.
The issue stems from how endpoint slices are diffed. A utility function
responsible for processing addresses does not consider endpoints whose
targetRef is an external workload. We fix the problem and add two module
tests to validate readiness is propagated to clients correctly.
---------
Signed-off-by: Matei David <matei@buoyant.io>
We introduced an endpoints controller that will be responsible for
managing EndpointSlices for services that select external workloads. We
introduce as a follow-up the reconciler component of the controller that
will be responsible for doing the writes and diffing.
Additionally, the controller is wired-up in the destination service's
main routine and will start if endpoint slice support is enabled.
---------
Signed-off-by: Matei David <matei@buoyant.io>
Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>
Co-authored-by: Zahari Dichev <zaharidichev@gmail.com>
When a meshed client attempts to establish a connection directly to the workload IP of an ExternalWorkload, the destination controller should return an endpoint profile for that ExternalWorkload with a single endpoint and the metadata associated with that ExternalWorkload including:
* mesh TLS identity
* workload metric labels
* opaque / protocol hints
Signed-off-by: Alex Leong <alex@buoyant.io>
Whenever the destination controller's informer receives an update of a Server resource, it checks every portPublisher in the endpointsWatcher to see if the Server selects any pods in that servicePort and updates those pods' opaque protocol field. Regardless of if any pods were matched or if the opaque protocol changed, an update is sent to each listener. This results in an update to every endpointTranslator each time a Server is updated. During a resync, we get an update for every Server in the cluster which results in N updates to each endpointTranslator where N is the number of Servers in the cluster.
If N is greater than 100, it becomes possible that these N updates could overflow the endpointTranslator update queue if the queue is not being drained fast enough.
We change this to only send the update for a Server if at least one of the servicePort addresses was selected by that server AND it's opaque protocol field changed.
Signed-off-by: Alex Leong <alex@buoyant.io>
In 71635cb and 357a1d3 we updated the endpoint and profile translators
to prevent backpressure from stalling the informer tasks. This change
updates the endpoint profile translator with the same fix, so that
updates are buffered and can detect when when a gRPC stream is stalled.
Furthermore, the update method is updated to use a protobuf-aware
comparison method instead of using serialization and string comparison.
A test is added for the endpoint profile translator, since none existed
previously.
This change adds a runtime flag to the destination controller,
--experimental-endpoint-zone-weights=true, that causes endpoints in the
local zone to receive higher weights. This feature is disabled by
default, since the weight value is not honored by proxies. No helm
configuration is exposed yet, either.
This weighting is instrumented in the endpoint translator. Tests are
added to confirm that the behavior is feature-gated.
Additionally, this PR adds the "zone" metric label to endpoint metadata
responses.
In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.
The histogram metrics are of the form `{kind}_informer_lag_ms_bucket_*`.
* We record a value only for updates, not for adds or deletes. This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large. This does not represent informer lag and should not be counted as such.
* When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version. This does not represent an actual update of the resource itself and so we do not record a value.
* Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster.
* We use histogram buckets which range from 500ms to about 17 minutes. In my testing, an informer lag of 500ms-1000ms is typical. However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind.
Signed-off-by: Alex Leong <alex@buoyant.io>
Fixes#11065
When an inbound proxy receives a request with a canonical name of the form `hostname.service.namespace.svc.cluster.domain`, we assume that `hostname` is the hostname of the pod as described in https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-hostname-and-subdomain-fields. However, pods are also addressable with `pod-ip.service.namespace.svc.cluster.domain`. When the destination controller gets a profile request of this form, we attempt to find a pod with hostname of `pod-ip` and return an error with gRPC status `Unknown` since this will not exist.
It is expected that this profile lookup will fail since we cannot have service profiles for individual pods. However, returning a gRPC status `Unknown` for these requests brings the reported success rate of the destination controller down. Instead we should return these as gRPC status `NotFound` so that these responses don't get reported as server errors.
Signed-off-by: Alex Leong <alex@buoyant.io>
The destination controller indexes Pods by their primary PodIP and their
primary HostIP (when applicable); and it indexes Services by their
primary ClusterIP.
In preparation for dual-stack (IPv6) support, this change updates the
destination controller indexers to consume all IPs for these resources.
Tests are updated to demonstrate a dual-stack setup (where these
resources have a primary IPv4 address and a secondary IPv6 address).
While exercising these tests, I had to replace the brittle host:port
parsing logic we use in the server, in favor of Go's standard
`net.SplitHostPort` utility.
Followup to #11328
Implements a new pod watcher, instantiated along the other ones in the Destination server. It also watches on Servers and carries all the logic from ServerWatcher, which has now been decommissioned.
The `CreateAddress()` function has been moved into a function of the PodWatcher, because now we're calling it on every update given the pod associated to an ip:port might change and we need to regenerate the Address object. That function also takes care of capturing opaque protocol info from associated Servers, which is not new and had some logic that was duped in the now defunct ServerWatcher. `getAnnotatedOpaquePorts()` got also moved for similar reasons.
Other things to note about PodWatcher:
- It publishes a new pair of metrics `ip_port_subscribers` and `ip_port_updates` leveraging the framework in `prometheus.go`.
- The complexity in `updatePod()` is due to only send stream updates when there are changes in the pod's readiness, to avoid sending duped messages on every pod lifecycle event.
-
Finally, endpointProfileTranslator's `endpoint` (*pb.WeightedAddr) not being a static object anymore, the `Update()` function now receives an Address that allows it to rebuild the endpoint on the fly (and so `createEndpoint()` was converted into a method of endpointProfileTranslator).
Similar to #11246, the destination controller was complaning above
trying to register dupe metrics for the api cache gauges, when a given
target cluster got re-linked. This change unregisters the gauges for the
target cluster when said cluster is removed.
Supersedes #11252
We add a `cluster_store_size` gauge to the destination controller to track the number of entries in the remote discovery cluster store. If this is ever different from the number of cluster credentials secrets in the linkerd namespace, this indicates that there is a problem with a link that needs to be investigated further.
Signed-off-by: Alex Leong <alex@buoyant.io>
Adds support for remote discovery to the destination controller.
When the destination controller gets a `Get` request for a Service with the `multicluster.linkerd.io/remote-discovery` label, this is an indication that the destination controller should discover the endpoints for this service from a remote cluster. The destination controller will look for a remote cluster which has been linked to it (using the `linkerd multicluster link` command) with that name. It will look at the `multicluster.linkerd.io/remote-discovery` label for the service name to look up in that cluster. It then streams back the endpoint data for that remote service.
Since we now have multiple client-go informers for the same resource types (one for the local cluster and one for each linked remote cluster) we add a `cluster` label onto the prometheus metrics for the informers and EndpointWatchers to ensure that each of these components' metrics are correctly tracked and don't overwrite each other.
---------
Signed-off-by: Alex Leong <alex@buoyant.io>
There were a few improvements we could have made to a recent change that
added a ClusterStore concept to the destination service. This PR
introduces the small improvements:
* Sync dynamically created clients in separate go routines
* Refactor metadata API creation
* Remove redundant check in cluster_store_test
* Create a new runtime schema every time a fake metadata API client is
created to avoid racey behaviour.
Signed-off-by: Matei David <matei@buoyant.io>
Fixes#10764
Removed the `server_port_subscribers` gauge, as it wasn't distiguishing
amongst different pods, and the number of subscribers for each pod were
conflicting with one another when updating the metric (see more details
[here](https://github.com/linkerd/linkerd2/issues/10764#issuecomment-1650835823)).
Besides carying an invalid value, this was generating the warning
`unable to delete server_port_subscribers metric with labels`
The metric was replaced with the `server_port_subscribes` and
`server_port_unsubscribes` counters, which track the overall number of
subscribes and unsubscribes to the particular pod port.
🌮 to @adleong for the diagnosis and the fix!
Currently, the control plane does not support indexing and discovering resources across cluster boundaries. In a multicluster set-up, it is desirable to have access to endpoint data (and by extension, any configuration associated with that endpoint) to support pod-to-pod communication. Linkerd's destination service should be extended to support reading discovery information from multiple sources (i.e. API Servers).
This change introduces an `EndpointsWatcher` cache. On creation, the cache registers a pair of event handlers:
* One handler to read `multicluster` secrets that embed kubeconfig files. For each such secret, the cache creates a corresponding `EndpointsWatcher` to read remote discovery information.
* Another handle to evict entries from the cache when a `multicluster` secret is deleted.
Additionally, a set of tests have been added that assert the behaviour of the cache when secrets are created and/or deleted. The cache will be used by the destination service to do look-ups for services that belong to another cluster, and that are running in a "remote discovery" mode.
---------
Signed-off-by: Matei David <matei@buoyant.io>
Fixes#11163
The `servicePublisher.updateServer` function will iterate through all registered listeners and update them. However, a nil listener may temporarily be in the list of listeners if an unsubscribe is in progress. This results in a nil pointer dereference.
All functions which result in updating the listeners must therefore be protected by the mutex so that we don't try to act on the list of listeners while it is being modified.
Signed-off-by: Alex Leong <alex@buoyant.io>
Add support for enabling and disabling topology aware routing when hints are added/removed.
The testing setup is very involved because it involves so many moving parts
1) Setup a service which is layered over several availability zones.
1a) The best way to do this is one service object, with 3 replicasets explicitly forced to use a specific AZ each.
2) Add `service.kubernetes.io/topology-aware-hints: Auto` annotation to the Service object
3) Use a load tester like k6 to send meaningful traffic to your service but only in one AZ
3) Scale up your replica sets until k8s adds Hints to your endpointslices
4) Observe that traffic shifts to only hit pods in one AZ
5) Turn down the replicasets count until such time that K8s removes the hints from your endpointslices
6) Observe traffic shifts back to all pods across all AZ.
Fixes#10764
`GetProfile` streams create a `server_port_subscribers` gauge that tracks the number of listeners interested in a given Server. Because of an oversight, the gauge was only being registered until the second listener was added. For just one listener the gauge was absent. But whenever the `GetProfile` stream ended, the gauge was deleted which resulted in this error if it wasn't registered to begin with:
```
level=warning msg="unable to delete server_port_subscribers metric with labels map[name:voting namespace:emojivoto port:4191]" addr=":8086" component=server
```
One can check that the gauge wasn't being created by installing viz and emojivoto, and checking the following returns empty:
```bash
$ linkerd diagnostics controller-metrics | grep server_port_subscribers
```
After this fix, one can see the metric getting populated:
```bash
$ linkerd diagnostics controller-metrics | grep server_port_subscribers
# HELP server_port_subscribers Number of subscribers to Server changes associated with a pod's port.
# TYPE server_port_subscribers gauge
server_port_subscribers{name="emoji",namespace="emojivoto",port="4191"} 1
server_port_subscribers{name="linkerd",namespace="linkerd",port="4191"} 1
server_port_subscribers{name="linkerd",namespace="linkerd",port="9990"} 1
server_port_subscribers{name="linkerd",namespace="linkerd",port="9995"} 1
server_port_subscribers{name="linkerd",namespace="linkerd",port="9996"} 1
server_port_subscribers{name="linkerd",namespace="linkerd",port="9997"} 1
server_port_subscribers{name="metrics",namespace="linkerd-viz",port="4191"} 1
server_port_subscribers{name="metrics",namespace="linkerd-viz",port="9995"} 1
server_port_subscribers{name="tap",namespace="linkerd-viz",port="4191"} 1
server_port_subscribers{name="tap",namespace="linkerd-viz",port="9995"} 1
server_port_subscribers{name="tap",namespace="linkerd-viz",port="9998"} 1
server_port_subscribers{name="vote",namespace="emojivoto",port="4191"} 1
server_port_subscribers{name="voting",namespace="emojivoto",port="4191"} 1
server_port_subscribers{name="web",namespace="emojivoto",port="4191"} 1
server_port_subscribers{name="web",namespace="linkerd-viz",port="4191"} 1
server_port_subscribers{name="web",namespace="linkerd-viz",port="9994"} 1
```
And when scaling down the voting deployment, one can see how the metric with `name="voting"` is removed.
The GetProfile API endpoint does not behave as expected: when a profile
watch is established, the API server starts two separate profile
watches--a primary watch with the client's namespace and fallback watch
ignoring the client's namespace. These watches race to send data back to
the client. If the backup watch updates first, it may be sent to clients
before being corrected by a subsequent update. If the primary watch
updates with an empty value, the default profile may be served before
being corrected by an update to the backup watch.
From the proxy's perspective, we'd much prefer that the API provide a
single authoritative response when possible. It avoids needless
corrective work from distributing across the system on every watch
initiation.
To fix this, we modify the fallbackProfileListener to behave
predictably: it only emits updates once both its primary and fallback
listeners have been updated. This avoids emitting updates based on a
partial understanding of the cluster state.
Furthermore, the opaquePortsAdaptor is updated to avoid synthesizing a
default serviceprofile (surprising behavior) and, instead, this
defaulting logic is moved into a dedicated defaultProfileListener
helper. A dedupProfileListener is added to squelch obviously redundant
updates.
Finally, this newfound predictability allows us to simplify the API's
tests. Many of the API tests are not clear in what they test and
sometimes make assertions about the "incorrect" profile updates.
Before changing any GetProfile behavior, this change splits the API
handler into some smaller scopes. This helps to clarify control flow and
reduce nested contexts. This change also adds relevant fields to log
contexts to improve diagnostics.
When processing a `delete` event for an EndpointSlice, regardless of the outcome (whether there are still addresses/endpoints alive or whether we have no endpoints left) we make a call to `noEndpoints` on the subscriber. This will send a `Destination.Get` update to the listeners to advertise a `NoEndpoints` event.
If there are still addresses left, NoEndpoints { exists: true } will be sent (to signal the service still exists). Until an update is registered (i.e a new add) there will be no available endpoints -- this is incorrect, since other EndpointSlices may exist. This change fixes the problem by handling `noEndpoints` in a more specialized way for EndpointSlices.
Signed-off-by: Yannick Utard <yannickutard@gmail.com>
Fixes#9986
After reviewing the k8s API calls in Destination, it was concluded we
could only swap out the calls to the Node and RS resources to use the
metadata API, as all the other resources (Endpoints, EndpointSlices,
Services, Pod, ServiceProfiles, Server) required fields other than those
found in their metadata section.
This also required completing the `NewFakeAPI` implementation by adding
the missing annotations and labels entries.
## Testing Memory Consumption
The gains here aren't as big as in #9650. In order to test this we need
to push hard and create 4000 RS:
``` bash
for i in {0..4000}; do kubectl create deployment test-pod-$i --image=nginx; done
```
In edge-23.2.1 the destination pod's memory consumption goes from 40Mi
to 160Mi after all the RS were created. With this change, it went from
37Mi to 140Mi.
Fixes#10205
The server watches uses `*corev1.Pod` in the key of it's subscriber map which is potentially brittle since it's not obvious what exactly should count as equality for full pod resources.
More robust is to use a unique per-pod identifier in the key: the pod's name and namespace. We therefore more the pod resource into the value type of the map since we still need it when we receive a Server update so that we can iterate through all subscriptions and find pods which match the Server's pod selector.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Add metrics to server and service watchers
Closes#10202 and completes #2204
As a followup to #10201, I'm adding the following metric in `server_watcher.go`:
- `server_port_subscribers`: This tracks the number of subscribers to changes to Servers associated to a port in a pod. The metric's label identify the namespace and name of the pod, and its targeted port.
Additionally, `opaque_ports.go` was missing metrics as well. I added `service_subscribers` which tracks the number of subscribers to a given Service, labeled by the Service's namespace and name.
`opaque_ports.go` was also leaking the subscriber's map key, so that got fixed as well.
Closes#10130https://kubernetes.io/docs/concepts/services-networking/service-traffic-policy/
1. Update endpoints watcher to include additional field `localTrafficPolicy`.
Set to true when `.spec.internalTrafficPolicy` is set to `Local`
2. Update endpoints translater to filter by node when `localTrafficPolicy` is
set to true. Topology Aware Hints are not used when
`service.pec.internalTrafficPolicy` is set to local
Signed-off-by: Yu Cao <yc185050@ncr.com>
Fixes#8270
When a listener unsubscribes to port updates in Servers, we were
removing the listener for the `ServerWatcher.subscriptions` map, leaving
the map's key (`podPort` with holds the pod object and port) with an
empty value. In clusters where there's a lot of pod churn, those keys
with empty values were getting accumulated, so this change cleans that
up.
The repro (basically constantly rolling emojivoto) is described in
#9947.
A followup will be up shortly adding metrics to track these metrics,
along with similar missing metrics from other parts of Destination.
* build(deps): bump sigs.k8s.io/gateway-api from 0.5.1 to 0.6.0
Bumps [sigs.k8s.io/gateway-api](https://github.com/kubernetes-sigs/gateway-api) from 0.5.1 to 0.6.0.
- [Release notes](https://github.com/kubernetes-sigs/gateway-api/releases)
- [Changelog](https://github.com/kubernetes-sigs/gateway-api/blob/main/CHANGELOG.md)
- [Commits](https://github.com/kubernetes-sigs/gateway-api/compare/v0.5.1...v0.6.0)
---
updated-dependencies:
- dependency-name: sigs.k8s.io/gateway-api
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* Account for possible errors returned from `AddEventHandler`
In v0.26.0 client-go's `AddEventHandler` method for informers started
returning a registration handle (that we ignore) and an error that we
now surface up.
* client-go v0.26.0 removed the openstack plugin
* Temporary changes to trigger tests in k8s 1.21
- Adds an innocuous change to integration.yml so that all tests get
triggered
- Hard-code k8s version in `k3d cluster create` invocation to v1.21
* Revert "Temporary changes to trigger tests in k8s 1.21"
This reverts commit 3e1fdd0e5e.
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alejandro Pedraza <alejandro@buoyant.io>