Our gRPC servers use the default gRPC server configuration, which
limits the number of concurrent streams to 100. Since the controllers
run with proxies, this provides a hard scaling limit for the number of
watches an application can have.
This change updates our gRPC server configuration to clear the default
concurrency limit, allowing the server to handle as many streams as
possible.
Signed-off-by: Matei David <matei@buoyant.io>
Co-authored-by: Oliver Gould <ver@buoyant.io>
We add a `http_client_errors_total` counter metric to control plane components to measure when requests to the k8s API fail without a response.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Add ability to configure client-go's `QPS` and `Burst` settings
## Problem and Symptoms
When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this:
```
time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
```
## Solution
Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic.
## Observability
The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst`
## Testing
You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to:
- Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory.
- Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts.
```javascript
import { Client, Stream } from 'k6/experimental/grpc';
import { sleep } from 'k6';
const client = new Client();
client.load(['.'], 'identity.proto');
// This always holds:
// req_num = (1 / req_duration ) * duration * VUs
// Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we
// can solve for the required VUs:
// VUs = req_num * req_duration / duration
// VUs = 6000 * 0.5 / 60 = 50
export const options = {
scenarios: {
identity: {
executor: 'constant-vus',
vus: 50,
duration: '1m',
},
},
};
export default () => {
client.connect('localhost:8080', {
plaintext: true,
});
const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify');
// Replace with your own token
let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw==";
// Replace with your own CSR
let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE=";
const data = {
identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local",
token: token,
certificate_signing_request: csr,
};
stream.write(data);
// This request takes around 2ms, so this sleep will mostly determine its final duration
sleep(0.5);
};
```
This results in the following report:
```
scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
* identity: 50 looping VUs for 1m0s (gracefulStop: 30s)
data_received................: 6.3 MB 104 kB/s
data_sent....................: 9.4 MB 156 kB/s
grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms
grpc_streams.................: 6000 99.355331/s
grpc_streams_msgs_received...: 6000 99.355331/s
grpc_streams_msgs_sent.......: 6000 99.355331/s
iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms
iterations...................: 6000 99.355331/s
vus..........................: 50 min=50 max=50
vus_max......................: 50 min=50 max=50
running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations
```
With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
[gocritic][gc] helps to enforce some consistency and check for potential
errors. This change applies linting changes and enables gocritic via
golangci-lint.
[gc]: https://github.com/go-critic/go-critic
Signed-off-by: Oliver Gould <ver@buoyant.io>
This reverts commit edd3b1f6d4.
This is a temporary revert of #3461 while we sort out some details of how this should configured and how it should interact with configuring a trace collector on the Linkerd proxy. We will reintroduce this change once the config plan is straightened out.
Signed-off-by: Alex Leong <alex@buoyant.io>
The repo depended on an old version of client-go. It also depended on
stern, which itself depended on an old version of client-go, making
client-go upgrade non-trivial.
Update the repo to client-go v12.0.0, and also replace stern with a
fork.
This fork of stern includes the following changes:
- updated to use Go Modules
- updated to use client-go v12.0.0
- fixed log line interleaving:
- https://github.com/wercker/stern/issues/96
- based on:
- 8723308e46Fixes#3382
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
The control-plane's clients, specifically the Kubernetes clients, did
not provide telemetry information.
Introduce a `prometheus.ClientWithTelemetry` wrapper to instrument
arbitrary clients. Apply this wrapper to Kubernetes clients.
Fixes#2183
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
* Add controller admin servers and readiness probes
* Tweak readiness probes to be more sane
* Refactor based on review feedback
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>