Commit Graph

12 Commits

Author SHA1 Message Date
Matei David 407df01ec3
chore(controller): Remove stream concurrency limits (#12598)
Our gRPC servers use the default gRPC server configuration, which
limits the number of concurrent streams to 100. Since the controllers
run with proxies, this provides a hard scaling limit for the number of
watches an application can have.

This change updates our gRPC server configuration to clear the default
concurrency limit, allowing the server to handle as many streams as
possible.

Signed-off-by: Matei David <matei@buoyant.io>
Co-authored-by: Oliver Gould <ver@buoyant.io>
2024-05-15 18:07:15 +01:00
Alex Leong e59ae0ff3b
Add error counter for k8s client (#11774)
We add a `http_client_errors_total` counter metric to control plane components to measure when requests to the k8s API fail without a response.

Signed-off-by: Alex Leong <alex@buoyant.io>
2023-12-20 09:34:31 -08:00
Alejandro Pedraza 2d716299a1
Add ability to configure client-go's `QPS` and `Burst` settings (#11644)
* Add ability to configure client-go's `QPS` and `Burst` settings

## Problem and Symptoms

When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this:

```
time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
```

## Solution

Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic.

## Observability

The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst`

## Testing

You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to:

- Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory.
- Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts.

```javascript
import { Client, Stream } from 'k6/experimental/grpc';
import { sleep } from 'k6';

const client = new Client();
client.load(['.'], 'identity.proto');

// This always holds:
// req_num = (1 / req_duration ) * duration * VUs
// Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we
// can solve for the required VUs:
// VUs = req_num * req_duration / duration
// VUs = 6000 * 0.5 / 60 = 50
export const options = {
  scenarios: {
    identity: {
      executor: 'constant-vus',
      vus: 50,
      duration: '1m',
    },
  },
};

export default () => {
  client.connect('localhost:8080', {
    plaintext: true,
  });

  const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify');

  // Replace with your own token
  let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw==";
  // Replace with your own CSR
  let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE=";

  const data = {
		identity:                     "web.emojivoto.serviceaccount.identity.linkerd.cluster.local",
    token:                        token,
		certificate_signing_request:  csr,
  };
  stream.write(data);

  // This request takes around 2ms, so this sleep will mostly determine its final duration
  sleep(0.5);
};
```

This results in the following report:

```
scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop):
           * identity: 50 looping VUs for 1m0s (gracefulStop: 30s)

     data_received................: 6.3 MB 104 kB/s
     data_sent....................: 9.4 MB 156 kB/s
     grpc_req_duration............: avg=2.14ms   min=873.93µs med=1.9ms    max=12.89ms  p(90)=3.13ms   p(95)=3.86ms
     grpc_streams.................: 6000   99.355331/s
     grpc_streams_msgs_received...: 6000   99.355331/s
     grpc_streams_msgs_sent.......: 6000   99.355331/s
     iteration_duration...........: avg=503.16ms min=500.8ms  med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms
     iterations...................: 6000   99.355331/s
     vus..........................: 50     min=50      max=50
     vus_max......................: 50     min=50      max=50

running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations
```

With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.
2023-11-28 15:25:05 -05:00
Oliver Gould 425a43def5
Enable gocritic linting (#7906)
[gocritic][gc] helps to enforce some consistency and check for potential
errors. This change applies linting changes and enables gocritic via
golangci-lint.

[gc]: https://github.com/go-critic/go-critic

Signed-off-by: Oliver Gould <ver@buoyant.io>
2022-02-17 22:45:25 +00:00
Tarun Pothulapati f3deee01b6 Trace Control plane Components with OC (#3495)
* add trace flags and initialisation
* add ocgrpc handler to newgrpc
* add ochttp handler to linkerd web
* add flags to linkerd web
* add ochttp handler to prometheus handler initialisation
* add ochttp clients for components
* add span for prometheus query
* update godep sha
* fix reviews
* better commenting
* add err checking
* remove sampling
* add check in main
* move to pkg/trace

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2019-10-18 12:19:13 -07:00
Alex Leong 4799baa8e2
Revert "Trace Control Plane components using OC (#3461)" (#3484)
This reverts commit edd3b1f6d4.

This is a temporary revert of #3461 while we sort out some details of how this should configured and how it should interact with configuring a trace collector on the Linkerd proxy.  We will reintroduce this change once the config plan is straightened out.

Signed-off-by: Alex Leong <alex@buoyant.io>
2019-09-26 11:56:44 -07:00
Tarun Pothulapati edd3b1f6d4 Trace Control Plane components using OC (#3461)
* add exporter config for all components

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add cmd flags wrt tracing

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add ochttp tracing to web server

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add flags to the tap deployment

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add trace flags to install and upgrade command

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add linkerd prefix to svc names

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add ochttp trasport to API Internal Client

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* fix goimport linting errors

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add ochttp handler to tap http server

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* review and fix tests

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update test values

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* use common template

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* update tests

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* use Initialize

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* fix sample flag

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>

* add verbose info reg flags

Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>
2019-09-26 08:11:48 -07:00
Andrew Seigner c5a85e587c
Update to client-go v12.0.0, forked stern (#3387)
The repo depended on an old version of client-go. It also depended on
stern, which itself depended on an old version of client-go, making
client-go upgrade non-trivial.

Update the repo to client-go v12.0.0, and also replace stern with a
fork.

This fork of stern includes the following changes:
- updated to use Go Modules
- updated to use client-go v12.0.0
- fixed log line interleaving:
  - https://github.com/wercker/stern/issues/96
  - based on:
    - 8723308e46

Fixes #3382

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2019-09-10 11:04:29 -07:00
Andrew Seigner 1df1683b6a
Instrument k8s clients (#2243)
The control-plane's clients, specifically the Kubernetes clients, did
not provide telemetry information.

Introduce a `prometheus.ClientWithTelemetry` wrapper to instrument
arbitrary clients. Apply this wrapper to Kubernetes clients.

Fixes #2183

Signed-off-by: Andrew Seigner <siggy@buoyant.io>
2019-02-18 09:10:02 -08:00
Radu M 07cbfe2725 Fix most golint issues that are not comment related (#1982)
Signed-off-by: Radu Matei <radu@radu-matei.com>
2018-12-20 10:37:47 -08:00
Kevin Lingerfelt 682b0274b5
Add controller admin servers and readiness probes (#1168)
* Add controller admin servers and readiness probes
* Tweak readiness probes to be more sane
* Refactor based on review feedback

Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-06-20 17:32:44 -07:00
Kevin Lingerfelt 9f1df963e9
Move controller/util and web/util packages to pkg (#1109)
Signed-off-by: Kevin Lingerfelt <kl@buoyant.io>
2018-06-13 11:25:56 -07:00