This helps ensure a minimum level of security. The two places this affects is our controller webhook and linkerd-viz tap API.
The controller requires that kube-api supports TLSv1.3, which it does as of 1.19 (our minimum is currently 1.22). The linkerd-viz tap API is mostly used internally, and is deprecated. It may be worth revisiting if we want to keep it around at all.
Signed-off-by: Scott Fleener <scott@buoyant.io>
* build(deps): bump google.golang.org/grpc from 1.63.2 to 1.64.0
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.63.2 to 1.64.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.63.2...v1.64.0)
---
updated-dependencies:
- dependency-name: google.golang.org/grpc
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
I've replaced all the `grpc.Dial` calls with `grpc.NewClient`. There was one call `grpc.DialContext(ctx,...)` in `viz/tap/api/grpc_server.go` that also got replaced with `grpc.NewClient`, which loses the ability to pass `ctxt` but that doesn't matter; as we're not using `WithBlock(true)` that context wasn't being accounted for when we were using `DialContext()` anyways.
https://github.com/grpc/grpc-go/blob/v1.64.0/clientconn.go#L216-L242
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Alejandro Pedraza <alejandro@buoyant.io>
Our gRPC servers use the default gRPC server configuration, which
limits the number of concurrent streams to 100. Since the controllers
run with proxies, this provides a hard scaling limit for the number of
watches an application can have.
This change updates our gRPC server configuration to clear the default
concurrency limit, allowing the server to handle as many streams as
possible.
Signed-off-by: Matei David <matei@buoyant.io>
Co-authored-by: Oliver Gould <ver@buoyant.io>
The main change here is the refactoring of the address functions in `addr.go` that support the Destination controller and Viz's Tap controller. Some of those functions only worked for IPv4, so this change refactored them to make them IP family agnostic.
This enabled adding (and fixing) IPv6 unit tests as detailed in the following sections.
Other changes:
- The `ProxyAddressesToString()` function was no longer used, so it got removed.
- The `ProxyIPToString()` function was only used by the destination-client script, so that got stripped out.
## `addr_test.go`
We added IPv6 cases to each test, that would have failed previously.
## `endpoint_translator_test.go`
One of the test pods (pod3) was changed to have an IPv6. Without the other changes in this PR those tests would still have passed, but just because when comparing actual IPs with expected ones we weren't checking if they were both zero. So here we added checks against that.
## `server_test.go`
As above, we added checks against empty IPs. And in the mocked resources in `test_util.go` we added an IPv6 EndpointSlice.
The Tap API resource shortnames were colliding with existing Kubernetes
resources (e.g. `po`, `deploy`, etc), causing warnings from kubectl
v1.29.0+.
Remove the shortnames from the Tap APIService handlers.
To validate:
```bash
bin/k3d cluster create
# install latest edge
curl https://run.linkerd.io/install-edge | sh
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f -
linkerd check
linkerd viz install | kubectl apply -f -
linkerd check
# observe shortnames
kubectl api-resources --api-group=tap.linkerd.io
# with kubectl v1.29.0+, observe "Warning: short name..."
kubectl get po
# replace tap image
TAP_IMAGE=$(bin/docker-build-tap)
bin/k3d image load $TAP_IMAGE
kubectl -n linkerd-viz set image deploy/tap tap=$TAP_IMAGE
# verify shortnames are no longer present
kubectl api-resources --api-group=tap.linkerd.io
# with kubectl v1.29.0+, observe no warning
kubectl get po
```
Fixeslinkerd/linkerd2#11784
Signed-off-by: Andrew Seigner <siggy@buoyant.io>
Adds support for remote discovery to the destination controller.
When the destination controller gets a `Get` request for a Service with the `multicluster.linkerd.io/remote-discovery` label, this is an indication that the destination controller should discover the endpoints for this service from a remote cluster. The destination controller will look for a remote cluster which has been linked to it (using the `linkerd multicluster link` command) with that name. It will look at the `multicluster.linkerd.io/remote-discovery` label for the service name to look up in that cluster. It then streams back the endpoint data for that remote service.
Since we now have multiple client-go informers for the same resource types (one for the local cluster and one for each linked remote cluster) we add a `cluster` label onto the prometheus metrics for the informers and EndpointWatchers to ensure that each of these components' metrics are correctly tracked and don't overwrite each other.
---------
Signed-off-by: Alex Leong <alex@buoyant.io>
Currently, Linkerd Tap displays all headers by default, which can contain irrelevant or sensitive information that users may want to exclude from being displayed. Therefore, there is a need for a feature that allows users to set their own ignore headers list, so that they can choose which headers to exclude from the output of Linkerd Tap on the dashboard.
This commit adds a new helm value, `tap.ignoredHeaders`, to the Linkerd-viz chart. This value allows users to specify a comma-separated list of headers to be ignored by Linkerd Tap. The default list includes commonly irrelevant and sensitive headers such as authorization tokens and cookies.
Once merged, every comma-seperated value in `tap.ignoredHeaders` will be removed from the headers in the Tap Dashboard when expanding a request headers result.
Fixes#10389
Signed-off-by: Ryan Hristovski <ryan.hristovski@docker.com>
Fixes https://github.com/linkerd/linkerd2/issues/10036
The Linkerd control plane components written in go serve liveness and readiness probes endpoint on their admin server. However, the admin server is not started until k8s informer caches are synced, which can take a long time on large clusters. This means that liveness checks can time out causing the controller to be restarted.
We start the admin server before attempting to sync caches so that we can respond to liveness checks immediately. We fail readiness probes until the caches are synced.
Signed-off-by: Alex Leong <alex@buoyant.io>
* Removed dupe imports
My IDE (vim-gopls) has been complaining for a while, so I decided to take
care of it. Found via
[staticcheck](https://github.com/dominikh/go-tools)
* Add stylecheck to go-lint checks
04a66ba added a `ReadHeaderTimeout` to our HTTP servers (at gosec's
insistence). We chose a fairly arbitrary timeout of 10s. This
configuration causes any connection that has been idle for 10s to be
torn down by the server. Unfortunately, this timeout value matches the
default Kubernetes probe interval and the default linkerd-viz scrape
interval. This can cause probes to race the timeout so that the
connection is healthy from the proxy's point of view and a request is
sent on the connection exactly as the server drops the connection.
These request failures cause controller success rate to appear degraded.
To correct this, this change raises the timeout to 15s so that the
timeout no longer matches the default probe interval.
The proxy's HTTP client is supposed to [retry] requests that encounter
this type of error. We should follow up by doing more research into why
that is not occurring in this situation.
[retry]: https://docs.rs/hyper/0.14.20/hyper/client/struct.Builder.html#method.retry_canceled_requests
Signed-off-by: Oliver Gould <ver@buoyant.io>
Newer versions of golangci-lint flag `http.Server` instances that do not
set a `ReadHeaderTimeout` as being vulnerable to "slowloris" attacks,
wherein clients initiate requests that hold connections open
indefinitely.
This change sets a `ReadHeaderTimeout` of 10s. This timeout is fairly
conservative so that clients can eagerly create connections, but is
still constrained enough that these connections won't remain open
indefinitely.
This change also updates kubert to v0.9.1, which instruments a header
read timeout on the policy admission server.
Signed-off-by: Oliver Gould <ver@buoyant.io>
Introduce change to generate dual-stack compatible URL addresses. This change refactors
some of the code to use Go's standard net library to generate IPv4 and IPv6 compatible
addresses.
The main components affected by this change:
* multicluster probe worker
* port-forwarding package
* viz tap api
Signed-off-by: zhlsunshine <huailong.zhang@intel.com>
We frequently compare data structures--sometimes very large data
structures--that are difficult to compare visually. This change replaces
uses of `reflect.DeepEqual` with `deep.Equal`. `go-test`'s `deep.Equal`
returns a diff of values that are not equal.
Signed-off-by: Oliver Gould <ver@buoyant.io>
Follow-up to #8087 that allows pprof to be enabled via the `--set
enablePprof=true` flag.
Each control plane components spawns its own admin server, so each of these
received it's own `enable-pprof` flag. When `enablePprof=true`, it is passed
through to each component so that when it launches its admin server, its pprof
endpoints are enabled.
A note on the templating: `-enable-pprof={{.Values.enablePprof | default
false}}`. `false` values are not rendered by Helm so without the `... | default
false}}`, it tries to pass the flag as `-enable-pprof=""` which results in an
error. Inlining this felt better than conditionally passing the flag with
```yaml {{ if .Values.enablePprof -}} -enable-pprof={{.Values.enablePprof}} {{
end -}} ```
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
When the Tap controller logs errors, it can potentially log error
messages that include data that originates from the network. This makes
these logs succeptible to log forgery ([CWE-117]).
This change updates log formatting to use `%q` so that newlines, etc are
escaped.
[CWE-117]: https://cwe.mitre.org/data/definitions/117.html
Signed-off-by: Oliver Gould <ver@buoyant.io>
Closes#7881
This makes the rest of the necessary fixes to satisfy the `staticcheck` lint.
The only class of lints that are being skipped are those related to deprecated tap code. There was some discussion on the original change started by @adleong about if this _actually_ deprecated [here](https://github.com/linkerd/linkerd2/pull/3240#discussion_r313634584); it doesn't look like we every came back around to fully removing it but I don't think it should be a blocker for enabling the lint right now.
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
Closes#7826
This adds the `gosec` and `errcheck` lints to the `golangci` configuration. Most significant lints have been fixed my individual changes, but this enables them by default so that all future changes are caught ahead of time.
A significant amount of these lints are been exluced by the various `exclude-rules` rules added to `.golangci.yml`. These include operations are files that generally do not fail such as `Copy`, `Flush`, or `Write`. We also choose to ignore most errors when cleaning up functions via the `defer` keyword.
Aside from those, there are several other rules added that all have comments explaining why it's okay to ignore the errors that they cover.
Finally, several smaller fixes in the code have been made where it seems necessary to catch errors or at least log them.
Signed-off-by: Kevin Leimkuhler <kleimkuhler@icloud.com>
Since Go 1.13, errors may "wrap" other errors. [`errorlint`][el] checks
that error formatting and inspection is wrapping-aware.
This change enables `errorlint` in golangci-lint and updates all error
handling code to pass the lint. Some comparisons in tests have been left
unchanged (using `//nolint:errorlint` comments).
[el]: https://github.com/polyfloyd/go-errorlint
Signed-off-by: Oliver Gould <ver@buoyant.io>
In several places where we build TLS servers (usually in our webhooks),
we use the default TLS configuration, which enables legacy versions of
TLS.
This change updates these servers to specify a minimum TLS version of
v1.2.
Signed-off-by: Oliver Gould <ver@buoyant.io>
Our `SubjectAccessReview` debug logging can include newlines in
user-provided data. This change ensures these values are quoted/escaped.
Signed-off-by: Oliver Gould <ver@buoyant.io>
CodeQL has caught several instances where we may be susceptible to [log
forgery][cql].
This change ensures that we strip newlines from log messages that
include potentially user-supplied strings. Several redundant error logs
are removed--we should generally not log an error when returning an
error. Errors should be logged where they are handled.
This change also properly escapes URL paths when constructing them from
protobuf messages.
Note that CodeQL continued to mark some of these uses as issues, but
we've marked them as false-positive. See github/codeql-go#635 and
github/codeql-go#650.
[cql]: https://codeql.github.com/codeql-query-help/go/go-log-injection/
Signed-off-by: Oliver Gould <ver@buoyant.io>
Fixes#5589
The core control plane has a dependency on the viz package in order to use the `BuildResource` function. This "backwards" dependency means that the viz source code needs to be included in core docker-builds and is bad for code hygiene.
We move the `BuildResource` function into the viz package. In `cli/cmd/metrics.go` we replace a call to `BuildResource` with a call directly to `CanonicalResourceNameFromFriendlyName`.
Signed-off-by: Alex Leong <alex@buoyant.io>
CodeQL flags our debug-logging of user and group header values. While
this information isn't strictly sensitive, it isn't really necessary,
either--we've haven't used any of this logged information for practical
diagnostics in years, if then. And because these header names are read
as configuration and not hardcoded, there's no way for us to be really
_sure_ that these headers are safe to log. So, I propose that we just
stop logging them and instead log the header names instead :).
While we're here, we should use `Header.Values()` instead of direct
slice access. `Values()` ensures that headers are encoded in the
canonical MIME format ("Train-Case") to ensure case-insenstive
comparison.
[1]: https://codeql.github.com/codeql-query-help/go/go-clear-text-logging/
Fix various lint issues:
- Remove unnecessary calls to fmt.Sprint
- Fix check for empty string
- Fix unnecessary calls to Printf
- Combine multiple `append`s into a single call
Signed-off-by: shubhendra <withshubh@gmail.com>
Fixes https://github.com/linkerd/linkerd2/discussions/5777
When a user runs `linkerd viz check --proxy`, it will print a warning if there are any proxies which cannot be tapped. This is a normal state of affairs after freshly installing the linkerd-viz extensions because any existing pods will need to be restarted before they can be tapped. The check warning may lead users to falsely believe that something has gone wrong with their installation.
We remove this specific check from `linkerd viz check --proxy`. To replace it, we improve the error output when attempting to tap a resource which is not tappable. This gives the user actionable feedback when the tap command fails.
Old:
```console
> linkerd viz tap -n emojivoto deploy/vote-bot
no pods to tap for deployment/vote-bot
```
New:
```console
> linkerd viz tap -n emojivoto deploy/vote-bot
no pods to tap for deployment/vote-bot
1 pods found with tap not enabled:
* vote-bot-64dd87cb87-7mcv4
restart these pods to enable tap and make them valid tap targets
```
Signed-off-by: Alex Leong <alex@buoyant.io>
Closes#5545.
This change moves all tap and tap-injector code into the viz directory.
The tap and tap-injector components now also use a new tap image—separating
these components from the controller image that they are currently part of. This
means the controller image has removed all its build dependencies related to
tap.
Finally, the tap Protobuf has been separated from the metrics-api and moved into
it's own `.proto` file and gen directory. This introduces a clear split between
metrics-api and tap Protobuf.
There is no change in behavior for the `viz tap` command.
### Reviewing
#### Docker images
All the bin directory scripts should be updated to build and load the tap image.
All the CI workflows should be updated to build and push the tap image.
#### Controller and pkg directories
This is primarily deletions. Most of the deleted code in this directory is now
in the tap directory of the Viz extension.
#### viz/tap
This is the location that all the tap related code now lives in. New files are
mostly moved from the controller and pkg directories. Imports have all been
updated to point at the right locations and Protobuf.
The Protobuf here is taken from metrics-api and contains all tap-related
Protobuf.
Signed-off-by: Kevin Leimkuhler <kevin@kleimkuhler.com>