linkerd2

Commit Graph

Author	SHA1	Message	Date
Alejandro Pedraza	38ec59128e	refactor(multicluster): revert Link permissions back to Role (#13848 ) #13783 moved the service mirror permissions on Links from a Role to a ClusterRole as a side-effect, and this change reverts that by refactoring the Links API to allow consuming a namespace-scoped API more easily. - We introduce in our `k8s.API` type a field `L5dClient` alongside the broad `Client` one, which is constructed via the new function `NewL5dNamespacedAPI()`. - In the service-mirror `main.go` we use that constructor to acquire `linksAPI`, which is used to configure the informer for handling Link events in this file. - `linksAPI` is also passed down to instantiations of `RemoteClusterServiceWatcher`, where it's used for the direct kube-apiserver calls and for retrieving a Lister.	2025-03-22 05:44:35 -05:00
Alex Leong	049bc0cb04	feat(multicluster): Add Link v1alpha3 (#13801 ) We add a new v1alpha3 resource version to the Link custom resource. This version adds `excludedAnnotations` and `excludedLabels` fields to the spec which will be used to exclude labels and annotations from being copied onto mirror and federated services. Signed-off-by: Alex Leong <alex@buoyant.io>	2025-03-19 12:13:16 -07:00
Alex Leong	227c4c1db1	feat!(multicluster): Federated services take metadata from member with the oldest Link (#13783 ) When the service mirror controller detects the first member of a federated service, it will create the federated service itself and will copy the port definitions from the founding member service. However, from that point on, the port definition in the federated service will never be updated, even if new members that define different ports join the federated service or if the ports in the founding member service are updated. This leads to a scenario where the federated service’s port definitions become out of date if the member services’ port definitions change. Similarly for labels and annotations. We update the service mirror controller to look at the createdAt timestamp of each member service's Link resource and to use the member with the oldest Link as the authoritative source of truth for the federated service metadata: labels, annotations, and ports. In the case where the service with the oldest Link leaves the federated service, it's metadata will continue to be present on the federated service until a resync occurs from the member with the next oldest Link (at most 10 minutes). Signed-off-by: Alex Leong <alex@buoyant.io>	2025-03-17 16:26:26 -07:00
Alejandro Pedraza	a726757fb1	fix(service-mirror): don't restart cluster watch upon Link status updates (#13579 ) * fix(service-mirror): don't restart cluster watch upon Link status updates Every time there's an update to a Link resource the service mirror restarts the cluster watch after cleaning up any existing worker. We recently introduced a status stanza in Link that gets updated upon every mirroring of a service, which was unnecessarily triggering a cluster watcher restart. For a sufficiently high number of services getting mirrored at once this was causing severe contention on the controller, delaying mirroring up to a halt. This change fixes the situation by only considering changes in the Link Spec for restarting the cluster watch. * Lower log level * Extract the resource event handler functions into a separate file, and add unit test making sure the add/update/delete functions are called, and that in particular the update function is _not_ called when updating a Link status.	2025-01-22 12:35:15 -05:00
Alejandro Pedraza	71291fe7bc	Add `accessPolicy` field to Server CRD (#12845 ) Followup to #12844 This new field defines the default policy for Servers, i.e. if a request doesn't match the policy associated to a Server then this policy applies. The values are the same as for `proxy.defaultInboundPolicy` and the `config.linkerd.io/default-inbound-policy` annotation (all-unauthenticated, all-authenticated, cluster-authenticated, cluster-unauthenticated, deny), plus a new value "audit". The default is "deny", thus remaining backwards-compatible. This field is also exposed as an additional printer column.	2024-07-22 09:01:09 -05:00
Matei David	98e38a66b6	Rename meshTls to meshTLS in ExternalWorkload CRD (#12098 ) The ExternalWorkload resource we introduced has a minor naming inconsistency; `Tls` in `meshTls` is not capitalised. Other resources that we have (e.g. authentication resources) capitalise TLS (and so does Go, it follows a similar naming convention). We fix this in the workload resource by changing the field's name and bumping the version to `v1beta1`. Upgrading the control plane version will continue to work without downtime. However, if an existing resource exists, the policy controller will not completely initialise. It will not enter a crashloop backoff, but it will also not become ready until the resource is edited or deleted. Signed-off-by: Matei David <matei@buoyant.io>	2024-02-20 11:00:13 -08:00
Zahari Dichev	391ce919f5	policy: regenerate Server go bindings (#11920 ) Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>	2024-01-15 11:09:31 +02:00
Matei David	9fbd3c0290	Introduce bindings for ExternalWorkload resources (#11888 ) We introduced an ExternalWorkload CRD for mesh expansion. This change follows up by adding bindings for Rust and Go code. For Go code: * We add a new schema and ExternalWorkload types * We also update the code-gen script to generate informers * We add a new informer type to our abstractions built on-top of client-go, including a function to check if a client has access to the resource. For Rust code: * We add ExternalWorkload bindings to the policy controller. --------- Signed-off-by: Matei David <matei@buoyant.io>	2024-01-08 14:04:20 +00:00
Alejandro Pedraza	7b2b01d539	Unregister prom gauges when recycling cluster watcher (#11875 ) Unregister prom gauges when recycling cluster watcher Fixes #11839 When in `restartClusterWatcher` we fail to connect to the target cluster for whatever reason, the function gets called again 10s later, and tries to register the same prometheus metrics without unregistering them first, which generates warnings. The problem lies in `NewRemoteClusterServiceWatcher`, which instantiates the remote kube-api client and registers the metrics, returning a nil object if the client can't connect. `cleanupWorkers` at the beginning of `restartClusterWatcher` won't unregister those metrics because of that nil object. To fix this, gauges are unregistered on error.	2024-01-05 18:07:13 -08:00
Alejandro Pedraza	2d716299a1	Add ability to configure client-go's `QPS` and `Burst` settings (#11644 ) * Add ability to configure client-go's `QPS` and `Burst` settings ## Problem and Symptoms When having a very large number of proxies request identity in a short period of time (e.g. during large node scaling events), the identity controller will attempt to validate the tokens sent by the proxies at a rate surpassing client-go's the default request rate threshold, triggering client-side throttling, which will delay the proxies initialization, and even failing their startup (after a 2m timeout). The identity controller will surface this through log entries like this: ``` time="2023-11-08T19:50:45Z" level=error msg="error validating token for web.emojivoto.serviceaccount.identity.linkerd.cluster.local: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline" ``` ## Solution Client-go's default `QPS` is 5 and `Burst` is 10. This PR exposes those settings as entries in `values.yaml` with defaults of 100 and 200 respectively. Note this only applies to the identity controller, as it's the only controller performing direct requests to the `kube-apiserver` in a hot path. The other controllers mostly rely in informers, and direct calls are sporadic. ## Observability The `QPS` and `Burst` settings used are exposed both as a log entry as soon as the controller starts, and as in the new metric gauges `http_client_qps` and `http_client_burst` ## Testing You can use the following K6 script, which simulates 6k calls to the `Certify` service during one minute from emojivoto's web pod. Before running this you need to: - Put the identity.proto and [all the other proto files](https://github.com/linkerd/linkerd2-proxy-api/tree/v0.11.0/proto) in the same directory. - Edit the [checkRequest](https://github.com/linkerd/linkerd2/blob/edge-23.11.3/pkg/identity/service.go#L266) function and add logging statements to figure the `token` and `csr` entries you can use here, that will be shown as soon as a web pod starts. ```javascript import { Client, Stream } from 'k6/experimental/grpc'; import { sleep } from 'k6'; const client = new Client(); client.load(['.'], 'identity.proto'); // This always holds: // req_num = (1 / req_duration ) * duration * VUs // Given req_duration (0.5s) test duration (1m) and the target req_num (6k), we // can solve for the required VUs: // VUs = req_num * req_duration / duration // VUs = 6000 * 0.5 / 60 = 50 export const options = { scenarios: { identity: { executor: 'constant-vus', vus: 50, duration: '1m', }, }, }; export default () => { client.connect('localhost:8080', { plaintext: true, }); const stream = new Stream(client, 'io.linkerd.proxy.identity.Identity/Certify'); // Replace with your own token let token = "ZXlKaGJHY2lPaUpTVXpJMU5pSXNJbXRwWkNJNkluQjBaV1pUZWtaNWQyVm5OMmxmTTBkV2VUTlhWSFpqTmxwSmJYRmtNMWRSVEhwNVNHWllhUzFaZDNNaWZRLmV5SmhkV1FpT2xzaWFXUmxiblJwZEhrdWJEVmtMbWx2SWwwc0ltVjRjQ0k2TVRjd01EWTRPVFk1TUN3aWFXRjBJam94TnpBd05qQXpNamt3TENKcGMzTWlPaUpvZEhSd2N6b3ZMMnQxWW1WeWJtVjBaWE11WkdWbVlYVnNkQzV6ZG1NdVkyeDFjM1JsY2k1c2IyTmhiQ0lzSW10MVltVnlibVYwWlhNdWFXOGlPbnNpYm1GdFpYTndZV05sSWpvaVpXMXZhbWwyYjNSdklpd2ljRzlrSWpwN0ltNWhiV1VpT2lKM1pXSXRPRFUxT1dJNU4yWTNZeTEwYldJNU5TSXNJblZwWkNJNklqaGlZbUV5WWpsbExXTXdOVGN0TkRnMk1TMWhNalZsTFRjelpEY3dOV1EzWmpoaU1TSjlMQ0p6WlhKMmFXTmxZV05qYjNWdWRDSTZleUp1WVcxbElqb2lkMlZpSWl3aWRXbGtJam9pWm1JelpUQXlNRE10TmpZMU55MDBOMk0xTFRoa09EUXRORGt6WXpBM1lXUTJaak0zSW4xOUxDSnVZbVlpT2pFM01EQTJNRE15T1RBc0luTjFZaUk2SW5ONWMzUmxiVHB6WlhKMmFXTmxZV05qYjNWdWREcGxiVzlxYVhadmRHODZkMlZpSW4wLnlwMzAzZVZkeHhpamxBOG1wVjFObGZKUDB3SC03RmpUQl9PcWJ3NTNPeGU1cnNTcDNNNk96VWR6OFdhYS1hcjNkVVhQR2x2QXRDRVU2RjJUN1lKUFoxVmxxOUFZZTNvV2YwOXUzOWRodUU1ZDhEX21JUl9rWDUxY193am9UcVlORHA5ZzZ4ZFJNcW9reGg3NE9GNXFjaEFhRGtENUJNZVZ6a25kUWZtVVZwME5BdTdDMTZ3UFZWSlFmNlVXRGtnYkI1SW9UQXpxSmcyWlpyNXBBY3F5enJ0WE1rRkhSWmdvYUxVam5sN1FwX0ljWm8yYzJWWk03T2QzRjIwcFZaVzJvejlOdGt3THZoSEhSMkc5WlNJQ3RHRjdhTkYwNVR5ZC1UeU1BVnZMYnM0ZFl1clRYaHNORjhQMVk4RmFuNjE4d0x6ZUVMOUkzS1BJLUctUXRUNHhWdw=="; // Replace with your own CSR let csr = "MIIBWjCCAQECAQAwRjFEMEIGA1UEAxM7d2ViLmVtb2ppdm90by5zZXJ2aWNlYWNjb3VudC5pZGVudGl0eS5saW5rZXJkLmNsdXN0ZXIubG9jYWwwWTATBgcqhkjOPQIBBggqhkjOPQMBBwNCAATKjgVXu6F+WCda3Bbq2ue6m3z6OTMfQ4Vnmekmvirip/XGyi2HbzRzjARnIzGlG8wo4EfeYBtd2MBCb50kP8F8oFkwVwYJKoZIhvcNAQkOMUowSDBGBgNVHREEPzA9gjt3ZWIuZW1vaml2b3RvLnNlcnZpY2VhY2NvdW50LmlkZW50aXR5LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAKBggqhkjOPQQDAgNHADBEAiAM7aXY8MRs/EOhtPo4+PRHuiNOV+nsmNDv5lvtJt8T+QIgFP5JAq0iq7M6ShRNkRG99ZquJ3L3TtLWMNVTPvqvvUE="; const data = { identity: "web.emojivoto.serviceaccount.identity.linkerd.cluster.local", token: token, certificate_signing_request: csr, }; stream.write(data); // This request takes around 2ms, so this sleep will mostly determine its final duration sleep(0.5); }; ``` This results in the following report: ``` scenarios: (100.00%) 1 scenario, 50 max VUs, 1m30s max duration (incl. graceful stop): * identity: 50 looping VUs for 1m0s (gracefulStop: 30s) data_received................: 6.3 MB 104 kB/s data_sent....................: 9.4 MB 156 kB/s grpc_req_duration............: avg=2.14ms min=873.93µs med=1.9ms max=12.89ms p(90)=3.13ms p(95)=3.86ms grpc_streams.................: 6000 99.355331/s grpc_streams_msgs_received...: 6000 99.355331/s grpc_streams_msgs_sent.......: 6000 99.355331/s iteration_duration...........: avg=503.16ms min=500.8ms med=502.64ms max=532.36ms p(90)=504.05ms p(95)=505.72ms iterations...................: 6000 99.355331/s vus..........................: 50 min=50 max=50 vus_max......................: 50 min=50 max=50 running (1m00.4s), 00/50 VUs, 6000 complete and 0 interrupted iterations ``` With the old defaults (QPS=5 and Burst=10), the latencies would be much higher and number of complete requests much lower.	2023-11-28 15:25:05 -05:00
Alejandro Pedraza	d823ad72c4	Fixed service-mirror metrics warning (#11246 ) Whenever the service mirror's main loop was triggered again, the following warnings were generated: ``` time="2023-08-14T20:16:29Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"service_cache_size\", help: \"Number of items in the client-go service cache\", constLabels: {cluster=\"remote\"}, variableLabels: []}: duplicate metrics collector registration attempted" time="2023-08-14T20:16:29Z" level=warning msg="failed to register Prometheus gauge Desc{fqName: \"endpoints_cache_size\", help: \"Number of items in the client-go endpoints cache\", constLabels: {cluster=\"remote\"}, variableLabels: []}: duplicate metrics collector registration attempted" ``` To fix, this adds into the cluster watcher's `Stop()` method a directive to unregister the prometheus cache metrics associated to the cluster's client API.	2023-08-16 13:34:50 -05:00
Alex Leong	368b63866d	Add support for remote discovery (#11224 ) Adds support for remote discovery to the destination controller. When the destination controller gets a `Get` request for a Service with the `multicluster.linkerd.io/remote-discovery` label, this is an indication that the destination controller should discover the endpoints for this service from a remote cluster. The destination controller will look for a remote cluster which has been linked to it (using the `linkerd multicluster link` command) with that name. It will look at the `multicluster.linkerd.io/remote-discovery` label for the service name to look up in that cluster. It then streams back the endpoint data for that remote service. Since we now have multiple client-go informers for the same resource types (one for the local cluster and one for each linked remote cluster) we add a `cluster` label onto the prometheus metrics for the informers and EndpointWatchers to ensure that each of these components' metrics are correctly tracked and don't overwrite each other. --------- Signed-off-by: Alex Leong <alex@buoyant.io>	2023-08-11 09:31:45 -07:00
Alejandro Pedraza	4dbb027f48	Use metadata API in the proxy and tap injectors (#9650 ) * Use metadata API in the proxy and tap injectors Part of #9485 This adds a new `MetadataAPI` similar to the current `k8s.API` hosting informers, but backed by k8s' `metadatainformer` shared informers, which retrieves only the objects metadata, resulting in less memory consumption by its clients. Currently this is only implemented for the proxy and tap injectors. Usage by the destination controller will be implemented as a follow-up. ## Existing API enhancements Shared objects and logic required by API and MetadataAPI have been moved to the new `k8s.go`, `api_resource.go` and `prometheus.go` files. That includes the `isValidRSParent()` function whose arg is now more generic. ## Unit tests `/controller/k8s/api_test.go` now also instantiates a MetadataAPI, used in the augmented `TestGetObjects()` and `TestGetOwnerKindAndName()` tests. The `resources` struct was introduced to capture the common fields among tests and simplify `newMockAPI()`'s signature. ## Other Changes The injector no longer watches for Pods. It only requires watching workloads that own resources (and also watch namespaces), so Pod is not required. ## Testing Memory Consumption Install linkerd, inject emojivoto and check the injector memory consumption with `kubectl -n linkerd top pod linkerd-proxy-injector-xxx`. It'll start consuming about 16Mi. Then ramp up emojivoto's `voting` deployment replicas to 2000. After 5 minutes memory will stabilize around 32Mi using the current branch. Using the latest edge, it'll stabilize around 110Mi.	2022-11-16 09:21:39 -05:00
Andrew Seigner	22ddb16215	Modify k8s client to use admissionregistration/v1 (#9401 ) The controller's k8s client was using `admissionregistration/v1beta1` for its MWC shared informer. `v1beta1` was removed in k8s 1.22, and `v1` was introduced in k8s 1.16: https://kubernetes.io/docs/reference/using-api/deprecation-guide/#v1-22 Modify the controller's k8s client to use `admissionregistration/v1` for its MWC shared informer. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2022-09-20 09:44:18 -07:00
Alejandro Pedraza	e80a791777	Allow initializing a k8s namespace-scoped API (#8751 ) * Allow initializing a k8s namespace-scoped API This allows reusing the `k8s.API` informers by other projects that don't necessarily have cluster-wide permissions.	2022-08-04 09:14:26 -05:00
Alex Leong	b7a0b8adb4	Bump minimum kubernetes version to 1.21 (#8647 ) Fixes #8592 Increase the minimum supported kubernetes version from 1.20 to 1.21. This allows us to drop support for batch/v1beta1/CronJob and discovery/v1beta1/EndpointSlices, instead using only v1 of those resources. This fixes deprecation warnings about these warnings printed by the CLI. Signed-off-by: Alex Leong <alex@buoyant.io>	2022-06-14 15:15:28 -07:00
Alex Leong	57dd772a3d	Fix panic when injector encounters unsupported owner kind (#8643 ) Fixes #8624 When the proxy-injector encounters a resource with an owner ref, it calls `api.GetObjects` to fetch the owner. If the owner is a kind which is not supported by the proxy-injector, we will panic. We add a condition so that we only attempt to fetch the owner resource if it is a kind we support. Signed-off-by: Alex Leong <alex@buoyant.io>	2022-06-10 14:30:12 -07:00
Kevin Leimkuhler	67bcd8f642	Add `gosec` and `errcheck` lints (#7954 ) Closes #7826 This adds the `gosec` and `errcheck` lints to the `golangci` configuration. Most significant lints have been fixed my individual changes, but this enables them by default so that all future changes are caught ahead of time. A significant amount of these lints are been exluced by the various `exclude-rules` rules added to `.golangci.yml`. These include operations are files that generally do not fail such as `Copy`, `Flush`, or `Write`. We also choose to ignore most errors when cleaning up functions via the `defer` keyword. Aside from those, there are several other rules added that all have comments explaining why it's okay to ignore the errors that they cover. Finally, several smaller fixes in the code have been made where it seems necessary to catch errors or at least log them. Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>	2022-03-03 10:09:51 -07:00
Oliver Gould	425a43def5	Enable gocritic linting (#7906 ) [gocritic][gc] helps to enforce some consistency and check for potential errors. This change applies linting changes and enables gocritic via golangci-lint. [gc]: https://github.com/go-critic/go-critic Signed-off-by: Oliver Gould <ver@buoyant.io>	2022-02-17 22:45:25 +00:00
Oliver Gould	f5876c2a98	go: Enable `errorlint` checking (#7885 ) Since Go 1.13, errors may "wrap" other errors. [`errorlint`][el] checks that error formatting and inspection is wrapping-aware. This change enables `errorlint` in golangci-lint and updates all error handling code to pass the lint. Some comparisons in tests have been left unchanged (using `//nolint:errorlint` comments). [el]: https://github.com/polyfloyd/go-errorlint Signed-off-by: Oliver Gould <ver@buoyant.io>	2022-02-16 18:32:19 -07:00
Tarun Pothulapati	4170b49b33	smi: remove default functionality in linkerd (#7334 ) Now, that SMI functionality is fully being moved into the [linkerd-smi](www.github.com/linkerd/linkerd-smi) extension, we can stop supporting its functionality by default. This means that the `destination` component will stop reacting to the `TrafficSplit` objects. When `linkerd-smi` is installed, It does the conversion of `TrafficSplit` objects to `ServiceProfiles` that destination components can understand, and will react accordingly. Also, Whenever a `ServiceProfile` with traffic splitting is associated with a service, the same information (i.e splits and weights) is also surfaced through the `UI` (in the new `services` tab) and the `viz cmd`. So, We are not really loosing any UI functionality here. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-12-03 12:07:30 +05:30
Kevin Leimkuhler	01cbe616f1	Honor Server `proxyProtocol` in destination service `Get` with policy CRD APIs (#7184 ) This change ensures that if a Server exists with `proxyProtocol: opaque` that selects an endpoint backed by a pod, that destination requests for that pod reflect the fact that it handles opaque traffic. Currently, the only way that opaque traffic is honored in the destination service is if the pod has the `config.linkerd.io/opaque-ports` annotation. With the introduction of Servers though, users can set `server.Spec.ProxyProtocol: opaque` to indicate that if a Server selects a pod, then traffic to that pod's `server.Spec.Port` should be opaque. Currently, the destination service does not take this into account. There is an existing change up that _also_ adds this functionality; it takes a different approach by creating a policy server client for each endpoint that a destination has. For `Get` requests on a service, the number of clients scales with the number of endpoints that back that service. This change fixes that issue by instead creating a Server watch in the endpoint watcher and sending updates through to the endpoint translator. The two primary scenarios to consider are ### A `Get` request for some service is streaming when a Server is created/updated/deleted When a Server is created or updated, the endpoint watcher iterates through its endpoint watches (`servicePublisher` -> `portPublisher`) and if it selects any of those endpoints, the port publisher sends an update if the Server has marked that port as opaque. When a Server is deleted, the endpoint watcher once again iterates through its endpoint watches and deletes the address set's `OpaquePodPorts` field—ensuring that updates have been cleared of Server overrides. ### A `Get` request for some service happens after a Server is created When a `Get` request occurs (or new endpoints are added—they both take the same path), we must check if any of those endpoints are selected by some existing Server. If so, we have to take that into account when creating the address set. This part of the change gives me a little concern as we first must get all the Servers on the cluster and then create a set of _all_ the pod-backed endpoints that they select in order to determine if any of these _new_ endpoints are selected. ## Testing Right now this can be tested by starting up the destination service locally and running `Get` requests on a service that has endpoints selected by a Server app.yaml ```yaml apiVersion: v1 kind: Pod metadata: name: pod labels: app: pod spec: containers: - name: app image: nginx ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: svc spec: selector: app: pod ports: - name: http port: 80 --- apiVersion: policy.linkerd.io/v1alpha1 kind: Server metadata: name: srv labels: policy: srv spec: podSelector: matchLabels: app: pod port: 80 proxyProtocol: HTTP/1 ``` ```bash $ go run controller/script/destination-client/main.go -path svc.default.svc.cluster.local:80 ``` Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>	2021-11-23 20:35:53 -07:00
Tarun Pothulapati	45478b6db8	viz: support `stat` on new policy resources (#6785 ) Fixes #6733 As policy resources provide a grouping, statistics summaries should also be allowed on these groupings which are useful to the user. Them being port specific provide a great way to break down these metrics further. This PR adds support for policy resources i.e `server` and `serverauthorization` on the `stat` command. ## Changes This adds a new path in the `stat_summary.go` file to handle policy objects. I tried to see if we could re-use some of the other paths but some of the labels seems to differ and hence a different path had to be created. We can try to refactor and merge them though. We support both request and TCP metrics for the `server` resource while only the former with `serverauthorization` resources as metrics are generated in this manner. This also adds these policy objects into the `k8s` package to make them as known resources. For both the policy resources, `--from` doesn't work as these metrics are not exposed from outbound, and there is no way to query about the client workload from the inbound metrics. `--to` is supported to get metrics specifically for a destination workload. (just like on a service) ## Testing ```bash > curl -sL https://run.linkerd.io/emojivoto.yml \| linkerd inject --proxy-log-level debug - \| kubectl apply -f - > kubectl apply -f `897de1a8d5/emojivoto-policy.yml` # Initial values on ⛵ kind-kind linkerd2 on 🌱 taru [📦📝🤷‍] via 🐼 v1.16.7 via  via ❄️ impure (shell) ➜ ./bin/go-run cli viz stat srv -A -owide ~/work/linkerd2 NAMESPACE NAME UNAUTHORIZED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN READ_BYTES/SEC WRITE_BYTES/SEC emojivoto emoji-grpc 0.0rps 100.00% 1.8rps 1ms 1ms 3ms 1 188.6B/s 2072.9B/s emojivoto prom 0.0rps - - - - - - - - emojivoto voting-grpc 0.0rps 80.70% 0.9rps 1ms 2ms 3ms 1 91.4B/s 52.7B/s emojivoto web-http 0.0rps 90.68% 2.0rps 2ms 10ms 28ms 1 153.7B/s 4509.4B/s # After changing the `emoji-grpc` authz on ⛵ kind-kind linkerd2 on 🌱 taru [📦📝🤷‍] via 🐼 v1.16.7 via  via ❄️ impure (shell) took 2s ➜ ./bin/go-run cli viz stat srv -A -owide ~/work/linkerd2 NAMESPACE NAME UNAUTHORIZED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN READ_BYTES/SEC WRITE_BYTES/SEC emojivoto emoji-grpc 0.3rps 100.00% 1.1rps 0ms 0ms 0ms 1 156.5B/s 1282.4B/s emojivoto prom 0.0rps - - - - - - - - emojivoto voting-grpc 0.0rps 87.88% 0.6rps 0ms 0ms 0ms 1 53.5B/s 31.5B/s emojivoto web-http 0.0rps 61.18% 1.4rps 1ms 2ms 2ms 1 110.2B/s 2195.7B/s # after changing the `web-http` authz on ⛵ kind-kind linkerd2 on 🌱 taru [📦📝🤷‍] via 🐼 v1.16.7 via  via ❄️ impure (shell) ➜ ./bin/go-run cli viz stat srv -A -owide ~/work/linkerd2 NAMESPACE NAME UNAUTHORIZED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN READ_BYTES/SEC WRITE_BYTES/SEC emojivoto emoji-grpc 0.0rps - - - - - - - - emojivoto prom 0.0rps - - - - - - - - emojivoto voting-grpc 0.0rps - - - - - - - - emojivoto web-http 1.0rps - - - - - - - - > linkerd viz stat srv/emoji-grpc -n emojivoto -owide NAME SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN READ_BYTES/SEC WRITE_BYTES/SEC emoji-grpc 100.00% 2.0rps 1ms 1ms 1ms 1 199.9B/s 2208.0B/s > linkerd viz stat srv/web-http -n emojivoto -owide NAME SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN READ_BYTES/SEC WRITE_BYTES/SEC web-http 94.02% 1.9rps 4ms 9ms 10ms 1 152.7B/s 4505.9B/s > linkerd viz stat srv -n emojivoto -o wide NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 TCP_CONN READ_BYTES/SEC WRITE_BYTES/SEC emoji-grpc - 100.00% 2.0rps 1ms 1ms 1ms 1 201.6B/s 2209.8B/s prom - - - - - - - - - voting-grpc - 86.21% 1.0rps 1ms 1ms 1ms 1 98.3B/s 55.9B/s web-http - 91.67% 2.0rps 3ms 8ms 10ms 1 157.7B/s 4600.3B/s > linkerd viz stat serverauthorization/web-public -n emojivoto NAME MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 web-http - 89.83% 2.0rps 3ms 9ms 10ms > linkerd viz stat saz -n emojivoto NAME AUTHORIZATION MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 emoji-grpc emoji-grpc - 100.00% 2.0rps 1ms 1ms 1ms prom prom-prometheus - - - - - - voting-grpc voting-grpc - 89.83% 1.0rps 1ms 1ms 1ms web-http web-public - 94.96% 2.0rps 1ms 5ms 9ms > linkerd viz stat saz/web-public -n emojivoto NAME AUTHORIZATION MESHED SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 web-http web-public - 90.00% 2.0rps 1ms 5ms 9ms ``` Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-09-15 10:59:36 +05:30
Matei David	2689be893f	Add parent obj validation for ReplicaSets (#6458 ) The problem is for parent objects that are not supported in Linkerd, we cannot get any metrics. For example, using a Rollout will not report any metrics higher than a pod level. To fix, add validation for ReplicaSet owners; if it's a valid parent, use parent Kind and Name, otherwise use ReplicaSet. Tested using CLI/UI Interim solution for #6429 Signed-off-by: Matei David <matei@buoyant.io>	2021-07-13 10:16:35 -06:00
Alex Leong	d9315fa4ee	Add client-go cache size metrics (#6447 ) Fixes #6354 We add Prometheus gauges which track the client-go cache size for each resource type. For example, the following metrics are added to the destination controller: ``` # HELP endpoint_cache_size Number of items in the client-go endpoint cache # TYPE endpoint_cache_size gauge endpoint_cache_size 21 # HELP job_cache_size Number of items in the client-go job cache # TYPE job_cache_size gauge job_cache_size 0 # HELP namespace_cache_size Number of items in the client-go namespace cache # TYPE namespace_cache_size gauge namespace_cache_size 8 # HELP node_cache_size Number of items in the client-go node cache # TYPE node_cache_size gauge node_cache_size 1 # HELP pod_cache_size Number of items in the client-go pod cache # TYPE pod_cache_size gauge pod_cache_size 23 # HELP replica_set_cache_size Number of items in the client-go replica_set cache # TYPE replica_set_cache_size gauge replica_set_cache_size 40 # HELP service_cache_size Number of items in the client-go service cache # TYPE service_cache_size gauge service_cache_size 18 # HELP service_profile_cache_size Number of items in the client-go service_profile cache # TYPE service_profile_cache_size gauge service_profile_cache_size 4 # HELP traffic_split_cache_size Number of items in the client-go traffic_split cache # TYPE traffic_split_cache_size gauge traffic_split_cache_size 0 ``` Signed-off-by: Alex Leong <alex@buoyant.io>	2021-07-09 13:36:43 -07:00
Tarun Pothulapati	ebbb3182a9	checks: use caching with opaqueports check (#6292 ) Fixes #6272 The opaqueports is prone to fail, with `context deadline exceeded` as there are numerous k8s API requests being performed. This PR updates the pre-fetching logic to instead use `controller/k8s` which provides a wrapper around `pkg/k8s` with caching by using shared informers underneath! This commit includes the following changes: - Update `checkMisconfiguredOpaquePortAnnotations` to use `controllerk8s.KubeAPI` instead of `hc.kubeAPI` - `kubeAPI.Sync` fn also had to be updated as it fails to check if the sp and ts sharedinformers are nil, which might be the case in cases like this where they are not needed. We had to use `controllerK8s.NewAPI` for the initialization instead of `controllerk8s.InitializeAPI` to take-in `hc.kubeAPI` so as to support unit testing, etc as `hc.kubeAPI` is how we pass the fake resources in unit tests. Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2021-06-23 10:28:15 +05:30
wangchenglong01	89c4126d90	spSharedInformers, tsSharedInformers may be nil, calling the metho will cause panic (#6030 ) * spSharedInformers, tsSharedInformers may be nil, calling the metho will cause panic Signed-off-by: Cookie Wang <wangchl01@inspur.com> * spSharedInformers, tsSharedInformers may be nil, calling the metho will cause panic Signed-off-by: Cookie Wang <wangchl01@inspur.com>	2021-04-22 08:33:03 -04:00
Tarun Pothulapati	d0caaa86c4	Bump k8s client-go to v0.19.2 (#5002 ) Fixes #4191 #4993 This bumps Kubernetes client-go to the latest v0.19.2 (We had to switch directly to 1.19 because of this issue). Bumping to v0.19.2 required upgrading to smi-sdk-go v0.4.1. This also depends on linkerd/stern#5 This consists of the following changes: - Fix ./bin/update-codegen.sh by adding the template path to the gen commands, as it is needed after we moved to GOMOD. - Bump all k8s related dependencies to v0.19.2 - Generate CRD types, client code using the latest k8s.io/code-generator - Use context.Context as the first argument, in all code paths that touch the k8s client-go interface Signed-off-by: Tarun Pothulapati <tarunpothulapati@outlook.com>	2020-09-28 12:45:18 -05:00
Matei David	a2bd230cd6	service topologies: add Kubernetes/API EndpointSlice support (#4696 ) Based on the [EndpointSlice PR](https://github.com/linkerd/linkerd2/pull/4663), this is just the k8s/api support for endpointslices to shorten the first PR. * Adds CRD * Adds functions that check whether the cluster has EndpointSlice access * Adds discovery & endpointslice informers to api. Signed-off-by: Matei David <matei.david.35@gmail.com>	2020-07-06 15:28:48 -07:00
Alex Leong	9cd4557644	Properly show the meshed count for non-selector services (#4446 ) When viewing the output of `linkerd stat` for services which do not have a selector (such as services created by the service-mirror, for example) the meshed count column shows the total number which exist, even though the service actually selects no pods at all. We update the StatSummary implementation to account for services which have no selector. Additionally, we update the logic of the `--unmeshed` flag. When the `--unmeshed` flag is not set, we typically skip rows for unmeshed resources because those resources would have no stats. This is not appropriate to do when the `--from` flag is also set because in this case, metrics are not collected on the target resource but are instead collected on the client-side. This means that stats can be present, even for unmeshed resources and these resources should still be displayed, even if the `--unmeshed` flag is not set. Signed-off-by: Alex Leong <alex@buoyant.io>	2020-05-20 10:08:27 -07:00
Alex Leong	d8eebee4f7	Upgrade to client-go 0.17.4 and smi-sdk-go 0.3.0 (#4221 ) Here we upgrade our dependencies on client-go to 0.17.4 and smi-sdk-go to 0.3.0. Since smi-sdk-go uses client-go 0.17.4, these upgrades must be performed simultaneously. This also requires simultaneously upgrading our dependency on linkerd/stern to a SHA which also uses client-go 0.17.4. This keeps all of our transitive dependencies synchronized on one version of client-go. This ALSO requires updating our codegen scripts to use the 0.17.4 version of code-generator and running it to generate 0.17.4 compatible generated code. I took this opportunity to update our code generation script to properly use the version of code-generater from `go.mod` rather than a hardcoded SHA. Signed-off-by: Alex Leong <alex@buoyant.io>	2020-04-01 10:07:23 -07:00
Zahari Dichev	edd7fd203d	Service Mirroring Component (#4028 ) This PR introduces a service mirroring component that is responsible for watching remote clusters and mirroring their services locally. Signed-off-by: Zahari Dichev <zaharidichev@gmail.com>	2020-03-02 21:16:08 +02:00
Mayank Shah	3c3a4a5f5d	cli: Add label selector flag for `stat` (#4040 ) * Update `linkerd-namespace` shorthand to `L` * Add --selector (-l) flag for `stat` Signed-off-by: Mayank Shah <mayankshah1614@gmail.com>	2020-02-17 13:40:07 -05:00
Mayank Shah	60ac0d5527	Add `as-group` CLI flag (#3952 ) Add CLI flag --as-group that can impersonate group for k8s operations Signed-off-by: Mayank Shah mayankshah1614@gmail.com	2020-01-22 16:38:31 +02:00
Sergio C. Arteaga	cee8e3d0ae	Add CronJobs and ReplicaSets to dashboard and CLI (#3687 ) This PR adds support for CronJobs and ReplicaSets to `linkerd inject`, the web dashboard and CLI. It adds a new Grafana dashboard for each kind of resource. Closes #3614 Closes #3630 Closes #3584 Closes #3585 Signed-off-by: Sergio Castaño Arteaga tegioz@icloud.com Signed-off-by: Cintia Sanchez Garcia cynthiasg@icloud.com	2019-12-11 10:02:37 -08:00
Alejandro Pedraza	d3d8266c63	If tap source IP matches many running pods then only show the IP (#3513 ) * If tap source IP matches many running pods then only show the IP When an unmeshed source ip matched more than one running pod, tap was showing the names for all those pods, even though the didn't necessary originate the connection. This could be reproduced when using pod network add-on such as Calico. With this change, if a node matches, return it, otherwise we proceed to look for a matching pod. If exactly one running pod matches we return it. Otherwise we return just the IP. Fixes #3103	2019-10-25 12:38:11 -05:00
Alejandro Pedraza	30ecddb965	Fix injector timeout under high load (#3442 ) * Fix injector timeout under high load Fixes #3358 When retrieving a pod owner, we were hitting the k8s API directly because at injection time the informer might not have been informed about the existence of the parent object. Under a large number of injection requests this ended up in the k8s API requests being throttled, the proxy-injector getting blocked and the webhook requests timing out. Now we'll hit the shared informer first, and hit the k8s API only when the informer doesn't return anything. After a few injection requests for the same owner, the informer should have been updated. Testing: Scaling an emoji deployment to 1000 replicas, and after waiting for a couple of minutes: Before: ```bash # a portion of the pods doesn't get injected $ kubectl-n emojivoto get po \| grep ./1 \| wc -l 109 kubectl -n kube-system logs -f kube-apiserver-minikube \| grep failing.timeout .... (lots of errors) ``` After: ```bash # all the pods get injected $ kubectl -n emojivoto get po \| grep ./1 \| wc -l 0 kubectl -n kube-system logs -f kube-apiserver-minikube \| grep failing.timeout ```	2019-09-18 17:58:38 -05:00
Alejandro Pedraza	acbab93ca8	Add support for k8s 1.16 (#3364 ) Fixes #3356 1.16 removes some api groups that were already deprecated. From k8s blog post (https://kubernetes.io/blog/2019/07/18/api-deprecations-in-1-16/): ``` - PodSecurityPolicy: will no longer be served from extensions/v1beta1 in v1.16. Migrate to the policy/v1beta1 API, available since v1.10. Existing persisted data can be retrieved/updated via the policy/v1beta1 API. - DaemonSet, Deployment, StatefulSet, and ReplicaSet: will no longer be served from extensions/v1beta1, apps/v1beta1, or apps/v1beta2 in v1.16. Migrate to the apps/v1 API, available since v1.9. Existing persisted data can be retrieved/updated via the apps/v1 API. ``` Previous PRs had already made this change at the Helm templates level, but we still needed to do it at the API calls and tests. The integration tests ran fine for k8s 1.12 and 1.15. They fail on 1.16 because the upgrade integration test tries to install linkerd 2.5 which is not compatible with 1.16. Signed-off-by: Alejandro Pedraza <alejandro@buoyant.io>	2019-09-04 09:59:55 -05:00
Alejandro Pedraza	fd248d3755	Undo refactoring from #3316 (#3331 ) Thus fixing `linkerd edges` and the dashboard topology graph Signed-off-by: Alejandro Pedraza <alejandro@buoyant.io>	2019-08-29 13:37:54 -05:00
arminbuerkle	5c38f38a02	Allow custom cluster domains in remaining backends (#3278 ) * Set custom cluster domain in GetServiceProfileFor * Set custom cluster domain in tap server Move fetching cluster domain for tap server to cmd main * Handle fetchting cluster domain errors separately * Use custom cluster domain for traffic split adaptor Signed-off-by: Armin Buerkle <armin.buerkle@alfatraining.de>	2019-08-27 10:01:36 -07:00
Alejandro Pedraza	02efb46e45	Have the proxy-injector emit events upon injection/skipping injection (#3316 ) * Have the proxy-injector emit events upon injection/skipping injection Fixes #3253 Have the proxy-injector emit an event whenever a injection happens, or when injection is skipped for some reason (also added that reason into the proxy-injector logs). The level is associated to the parent workload (it can't be associated to the pod because at this point the pod hasn't been persisted). The event recorder was setup at the `webhook/server.go` level and passed to the proxy-injector's `Inject` function. The sp-validator thus also has access to the event recorder, but for now it's not using it. Related changes: - Refactored `api.GetOwnerKindAndName()` to have it return a more generic object. - Refactored `report.Injectable()` to also have it return the reason why a workload is not injectable. Signed-off-by: Alejandro Pedraza <alejandro@buoyant.io>	2019-08-26 13:34:36 -05:00
Andrew Seigner	9a672dd5a9	Introduce `linkerd --as` flag for impersonation (#3173 ) Similar to `kubectl --as`, global flag across all linkerd subcommands which sets a `ImpersonationConfig` in the Kubernetes API config. Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2019-07-31 16:05:33 -07:00
Alex Leong	d6ef9ea460	Update ServiceProfile CRD to version v1alpha2 and remove validation (#3078 ) The openAPIV3Schema validation in the ServiceProfiles CRD is very limited in what it can validate and is obviated by more sophisticated validation done by the validating admission controller. Therefore, we would like to remove the openAPIV3Schema validation to reduce the size and complexity of the CRD object. To do so, we must also bump the version of the ServiceProfile custom resource from v1alpha1 to v1alpha2. This ensures that when the controller is upgraded, it will attempt to watch the v1alpha2 resource. If it cannot (because, for example, the controller pod started before the ServiceProfile CRD was updated and therefore the v1alpha2 version does not exist) then it will go into a crash loop backoff until it can. This essentially means that the controller will wait for the CRD to be upgraded to include v1alpha2 before it will start. Bumping the version is necessary because if we did not, it would be possible for the controller to start before the CRD is updated (removing the validation). In this case, when the CRD is edited, the controller will lose its list watch on ServiceProfiles and will stop getting updates. Signed-off-by: Alex Leong <alex@buoyant.io>	2019-07-23 11:46:31 -07:00
Alex Leong	bdf5b46d09	Make the routes command traffic split aware (#3030 ) The `linkerd routes` command gets the list of routes for a resource by checking which services that resource is a member of. If a traffic split exists, it is possible for a resource to get traffic via a service that it is not a member of. Specifically, a resource which is a member of a leaf service can get traffic to the apex service. This means that even though the resource is serving routes associated with the apex service, these will not be displayed in the `linkerd routes` command. We update `linkerd routes` to be traffic-split aware. This means that when a traffic split exists, we consider resources which are members of a leaf service with non-zero weight to be members of the apex service for the purpose of determining which routes to display. Signed-off-by: Alex Leong <alex@buoyant.io>	2019-07-10 12:45:35 -07:00
Jonathan Juares Beber	e2211f5f77	Introduces owner references verification for pods (#3027 ) When getting pods for specific kubernetes resources, the usage of just labels, as a selector, generates wrong outputs. Once, two resources can use the same label selector and manage distinct pods, a new mechanism to check pods for a given resource it's needed. More details on #2932. This commit introduces a verification through the pod owner references `UID`s, comparing with the given resource's. Additional logic is needed when handling `Deployments` since it creates a `ReplicaSet` and this last one is the actual pod's owner. No verification is done in case of `Services`. Signed-off-by: Jonathan Juares Beber <jonathanbeber@gmail.com>	2019-07-10 12:44:24 -07:00
Alejandro Pedraza	8988a5723f	Have `GetOwnerKindAndName` be able to skip the cache (#2972 ) * Have `GetOwnerKindAndName` be able to skip the cache Refactored `GetOwnerKindAndName` so it can optionally skip the shared informer cache and instead hit the k8s API directly. Useful for the proxy injector, when the pod's replicaset got just created and might not be in ready in the cache yet. Fixes #2738 Signed-off-by: Alejandro Pedraza <alejandro@buoyant.io>	2019-06-20 12:58:15 -05:00
Alex Leong	c698d6bca1	Add support for TrafficSplits (#2897 ) Add support for querying TrafficSplit resources through the common API layer. This is done by depending on the TrafficSplit client bindings from smi-sdk-go. Signed-off-by: Alex Leong <alex@buoyant.io>	2019-06-11 10:04:42 -07:00
Andrew Seigner	ec540a882e	Consolidate k8s APIs (#2747 ) Numerous codepaths have emerged that create k8s configs, k8s clients, and make k8s api requests. This branch consolidates k8s client creation and APIs. The primary change migrates most codepaths to call `k8s.NewAPI` to instantiate a `KubernetesAPI` struct from `pkg`. `KubernetesAPI` implements the `kubernetes.Interface` (clientset) interface, and also persists a `client-go` `rest.Config`. Specific list of changes: - removes manual GET requests from `k8s.KubernetesAPI`, in favor of clientsets - replaces most calls to `k8s.GetConfig`+`kubernetes.NewForConfig` with a single `k8s.NewAPI` - introduces a `timeout` param to `k8s.NewAPI`, currently only used by healthchecks - removes `NewClientSet` in `controller/k8s/clientset.go` in favor of `k8s.NewAPI` - removes `httpClient` and `clientset` from `HealthChecker`, use `KubernetesAPI` instead Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2019-04-25 11:31:38 -07:00
Andrew Seigner	64d38572ae	Implement fallback logic for owner ref lookups (#2737 ) The proxy-injector retrieves owner information when injecting pods. For pods created via deployments, this requires a Pod -> ReplicaSet -> Deployment lookup. There is a race condition where the injection happens before the k8s informer client has indexed the new ReplicaSet. If a ReplicaSet informer lookup initially fails, retry one time via a get request. Also introduce logging to record the failure/retry, and tests to validate `GetOwnerKindAndName` works with and without informer indexing. Fixes #2731 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2019-04-23 14:39:18 -07:00
Andrew Seigner	e5d2460792	Remove single namespace functionality (#2474 ) linkerd/linkerd2#1721 introduced a `--single-namespace` install flag, enabling the control-plane to function within a single namespace. With the introduction of ServiceProfiles, and upcoming identity changes, this single namespace mode of operation is becoming less viable. This change removes the `--single-namespace` install flag, and all underlying support. The control-plane must have cluster-wide access to operate. A few related changes: - Remove `--single-namespace` from `linkerd check`, this motivates combining some check categories, as we can always assume cluster-wide requirements. - Simplify the `k8s.ResourceAuthz` API, as callers no longer need to make a decision based on cluster-wide vs. namespace-wide access. Components either have access, or they error out. - Modify the web dashboard to always assume ServiceProfiles are enabled. Reverts #1721 Part of #2337 Signed-off-by: Andrew Seigner <siggy@buoyant.io>	2019-03-12 00:17:22 -07:00

1 2

75 Commits