A purpose-built proxy for the Linkerd service mesh. Written in Rust.
Go to file
katelyn martin 030fa28d55
fix: remove ambiguous metrics registry keys (#3987)
### 🖼️ background

the linkerd2 proxy implements, registers, and exports Prometheus metrics using a variety of systems, for historical reasons. new metrics broadly rely upon the official [`prometheus-client`](https://github.com/prometheus/client_rust/) library, whose interfaces are reexported for internal consumption in the [`linkerd_metrics::prom`](https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/metrics/src/lib.rs#L30-L60) namespace.

other metrics predate this library however, and rely on the metrics registry implemented in the workspace's [`linkerd-metrics`](https://github.com/linkerd/linkerd2-proxy/tree/main/linkerd/metrics) library.

### 🐛 bug report

* https://github.com/linkerd/linkerd2/issues/13821

linkerd/linkerd2#13821 reported a bug in which duplicate metrics could be observed and subsequently dropped by Prometheus when upgrading the control plane via helm with an existing workload running.

### 🦋 reproduction example

for posterity, i'll note the reproduction steps here.

i used these steps to identify the `2025.3.2` edge release as the affected release. upgrading from `2025.2.3` to `2025.3.1` did not exhibit this behavior. see below for more discussion about the cause.

generate certificates via <https://linkerd.io/2.18/tasks/generate-certificates/>

using these two deployments, courtesy of @GTRekter:

<details>
<summary>**💾 click to expand: app deployment**</summary>

```yaml
apiVersion: v1 
kind: Namespace 
metadata: 
  name: simple-app 
  annotations: 
    linkerd.io/inject: enabled 
---
apiVersion: v1 
kind: Service 
metadata: 
  name: simple-app-v1 
  namespace: simple-app 
spec: 
  selector: 
    app: simple-app-v1 
    version: v1 
  ports: 
    - port: 80 
      targetPort: 5678
---
apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: simple-app-v1 
  namespace: simple-app 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: simple-app-v1 
      version: v1 
  template: 
    metadata: 
      labels: 
        app: simple-app-v1 
        version: v1 
    spec: 
      containers: 
        - name: http-app 
          image: hashicorp/http-echo:latest 
          args: 
            - "-text=Simple App v1" 
          ports: 
            - containerPort: 5678 
```
</details>

<details>
<summary>**🤠 click to expand: client deployment**</summary>

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: traffic
  namespace: simple-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: traffic
  template:
    metadata:
      labels:
        app: traffic
    spec:
      containers:
      - name: traffic
        image: curlimages/curl:latest
        command:
          - /bin/sh
          - -c
          - |
            while true; do
              TIMESTAMP_SEND=$(date '+%Y-%m-%d %H:%M:%S')
              PAYLOAD="{\"timestamp\":\"$TIMESTAMP_SEND\",\"test_id\":\"sniff_me\",\"message\":\"hello-world\"}"
              echo "$TIMESTAMP_SEND - Sending payload: $PAYLOAD"
              RESPONSE=$(curl -s -X POST \
                -H "Content-Type: application/json" \
                -d "$PAYLOAD" \
                http://simple-app-v1.simple-app.svc.cluster.local:80)
              TIMESTAMP_RESPONSE=$(date '+%Y-%m-%d %H:%M:%S')
              echo "$TIMESTAMP_RESPONSE - RESPONSE: $RESPONSE"
              sleep 1
            done
```
</details>

and this prometheus configuration:

<details>
<summary>**🔥 click to expand: prometheus configuration**</summary>

```yaml
global:
  scrape_interval: 10s

scrape_configs:
  - job_name: 'pod'
    scrape_interval: 10s
    static_configs:
    - targets: ['localhost:4191']
      labels:
        group: 'traffic'
```
</details>

we will perform the following steps:

```sh
# install the edge release

# specify the versions we'll migrate between.
export FROM="2025.3.1"
export TO="2025.3.2"

# create a cluster, and add the helm charts.
kind create cluster
helm repo add linkerd-edge https://helm.linkerd.io/edge

# install linkerd's crd's and control plane.
helm install linkerd-crds linkerd-edge/linkerd-crds \
  -n linkerd --create-namespace --version $FROM

helm install linkerd-control-plane \
  -n linkerd \
  --set-file identityTrustAnchorsPEM=cert/ca.crt \
  --set-file identity.issuer.tls.crtPEM=cert/issuer.crt \
  --set-file identity.issuer.tls.keyPEM=cert/issuer.key \
  --version $FROM \
  linkerd-edge/linkerd-control-plane

# install a simple app and a client to drive traffic.
kubectl apply -f duplicate-metrics-simple-app.yml
kubectl apply -f duplicate-metrics-traffic.yml

# bind the traffic pod's metrics port to the host.
kubectl port-forward -n simple-app deploy/traffic 4191

# start prometheus, begin scraping metrics
prometheus --config.file=prometheus.yml
```

now, open a browser and query `irate(request_total[1m])`.

next, upgrade the control plane:

```
helm upgrade linkerd-crds linkerd-edge/linkerd-crds \
  -n linkerd --create-namespace --version $TO
helm upgrade linkerd-control-plane \
  -n linkerd \
  --set-file identityTrustAnchorsPEM=cert/ca.crt \
  --set-file identity.issuer.tls.crtPEM=cert/issuer.crt \
  --set-file identity.issuer.tls.keyPEM=cert/issuer.key \
  --version $TO \
  linkerd-edge/linkerd-control-plane
```

prometheus will begin emitting warnings regarding 34 time series being dropped.

in your browser, querying `irate(request_total[1m])` once more will show that
the rate of requests has stopped, due to the new time series being dropped.

next, restart the workloads...

```
kubectl rollout restart deployment -n simple-app simple-app-v1 traffic
```

prometheus warnings will go away, as reported in linkerd/linkerd2#13821.

### 🔍 related changes

* https://github.com/linkerd/linkerd2/pull/13699
* https://github.com/linkerd/linkerd2/pull/13715

in linkerd/linkerd2#13715 and linkerd/linkerd2##13699, we made some changes to the destination controller. from the "Cautions" section of the `2025.3.2` edge release:

> Additionally, this release changes the default for `outbound-transport-mode`
> to `transport-header`, which will result in all traffic between meshed
> proxies flowing on port 4143, rather than using the original destination
> port.

linkerd/linkerd2#13699 (_included in `edge-25.3.1`_) introduced this outbound transport-protocol configuration surface, but maintained the default behavior, while linkerd/linkerd2#13715 (_included in `edge-25.3.2`_) altered the default behavior to route meshed traffic via port 4143.

this is a visible change in behavior that can be observed when upgrading from a version that preceded this change to the mesh. this means that when upgrading across `edge-25.3.2`, such as from the `2025.2.1` to `2025.3.2` versions of the helm charts, or from the `2025.2.3` to the `2025.3.4` versions of the helm charts (_reported upstream in linkerd/linkerd2#13821_), the freshly upgraded destination controller pods will begin routing meshed traffic differently.

i'll state explicitly, _that_ is not a bug! it is, however, an important clue to bear in mind: data plane pods that were started with the previous control plane version, and continue running after the control plane upgrade, will have seen both routing patterns. reporting a duplicate time series for affected metrics indicates that there is a hashing collision in our metrics system.

### 🐛 the bug(s)

we define a collection to structures to model labels for inbound and outbound endpoints'
metrics:

```rust
// linkerd/app/core/src/metrics.rs

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub enum EndpointLabels {
    Inbound(InboundEndpointLabels),
    Outbound(OutboundEndpointLabels),
}

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub struct InboundEndpointLabels {
    pub tls: tls::ConditionalServerTls,
    pub authority: Option<http::uri::Authority>,
    pub target_addr: SocketAddr,
    pub policy: RouteAuthzLabels,
}

#[derive(Clone, Debug, PartialEq, Eq, Hash)]
pub struct OutboundEndpointLabels {
    pub server_id: tls::ConditionalClientTls,
    pub authority: Option<http::uri::Authority>,
    pub labels: Option<String>,
    pub zone_locality: OutboundZoneLocality,
    pub target_addr: SocketAddr,
}
```

\- <https://github.com/linkerd/linkerd2-proxy/blob/main/linkerd/app/core/src/metrics.rs>

bear particular attention to the derived `Hash` implementation. note the `tls::ConditionalClientTls` and `tls::ConditionalServerTls` types used in each of these labels. these are used by some of our types like `TlsConnect` to emit prometheus labels, using our legacy system's `FmtLabels` trait:

```rust
// linkerd/app/core/src/transport/labels.rs

impl FmtLabels for TlsConnect<'_> {
    fn fmt_labels(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self.0 {
            Conditional::None(tls::NoClientTls::Disabled) => {
                write!(f, "tls=\"disabled\"")
            }
            Conditional::None(why) => {
                write!(f, "tls=\"no_identity\",no_tls_reason=\"{}\"", why)
            }
            Conditional::Some(tls::ClientTls { server_id, .. }) => {
                write!(f, "tls=\"true\",server_id=\"{}\"", server_id)
            }
        }
    }
}
```

\- <99316f7898/linkerd/app/core/src/transport/labels.rs (L151-L165)>

note the `ClientTls` case, which ignores fields in the client tls information:

```rust
// linkerd/tls/src/client.rs

/// A stack parameter that configures a `Client` to establish a TLS connection.
#[derive(Clone, Debug, Eq, PartialEq, Hash)]
pub struct ClientTls {
    pub server_name: ServerName,
    pub server_id: ServerId,
    pub alpn: Option<AlpnProtocols>,
}
```

\- <99316f7898/linkerd/tls/src/client.rs (L20-L26)>

this means that there is potential for an identical set of labels to be emitted given two `ClientTls` structures with distinct server names or ALPN protocols. for brevity, i'll elide the equivalent issue with `ServerTls`, and its corresponding `TlsAccept<'_>` label implementation, though it exhibits the same issue.

### 🔨 the fix

this pull request introduces two new types: `ClientTlsLabels` and `ServerTlsLabels`. these continue to implement `Hash`, for use as a key in our metrics registry, and for use in formatting labels.

`ClientTlsLabels` and `ServerTlsLabels` each resemble `ClientTls` and `ServerTls`, respectively, but do not contain any fields that are elided in label formatting, to prevent duplicate metrics from being emitted.

relatedly, #3988 audits our existing `FmtLabels` implementations and makes use of exhaustive bindings, to prevent this category of problem in the short-term future. ideally, we might eventually consider replacing the metrics interfaces in `linkerd-metrics`, but that is strictly kept out-of-scope for the purposes of this particular fix.

---

* fix: do not key transport metrics registry on `ClientTls`

Signed-off-by: katelyn martin <kate@buoyant.io>

* fix: do not key transport metrics registry on `ServerTls`

Signed-off-by: katelyn martin <kate@buoyant.io>

---------

Signed-off-by: katelyn martin <kate@buoyant.io>
2025-07-02 12:38:04 -04:00
.checksec build(deps): Update rustls and ring (#2735) 2024-02-21 16:03:35 -08:00
.devcontainer chore(dev): introduce copilot instructions (#3873) 2025-04-21 12:24:34 -04:00
.github build(deps): bump Swatinem/rust-cache from 2.7.8 to 2.8.0 (#3983) 2025-06-26 12:25:06 -04:00
docs chore(deps): bump DavidAnson/markdownlint-cli2-action (#3923) 2025-05-22 13:16:37 -07:00
hyper-balance chore(deps)!: upgrade to hyper 1.x (#3504) 2025-03-21 12:53:11 -04:00
linkerd fix: remove ambiguous metrics registry keys (#3987) 2025-07-02 12:38:04 -04:00
linkerd2-proxy chore(deps): define `tracing` workspace dependency (#3834) 2025-05-12 16:00:34 -04:00
opencensus-proto chore(deps)!: upgrade to hyper 1.x (#3504) 2025-03-21 12:53:11 -04:00
opentelemetry-proto build(deps): bump the opentelemetry group with 2 updates (#3934) 2025-05-27 07:17:39 -07:00
spiffe-proto chore(deps)!: upgrade to hyper 1.x (#3504) 2025-03-21 12:53:11 -04:00
tools chore(deps)!: upgrade to hyper 1.x (#3504) 2025-03-21 12:53:11 -04:00
.clippy.toml clippy: Disallow lock and instant types from `std` (#1458) 2022-02-02 11:59:03 -08:00
.codecov.yml ci: Fixup codecov config (#2545) 2023-12-03 11:53:02 -08:00
.dockerignore Add proxy_build_info metric (#600) 2020-07-24 09:19:40 -07:00
.gitattributes Update to linkerd2-proxy-api v0.5 and tonic v0.7 (#1596) 2022-04-11 11:29:33 -07:00
.gitignore chore(gitignore): ignore `.cargo` directory (#3451) 2024-12-11 11:35:07 -05:00
CONTRIBUTING.md dev: Update markdowlint-cli2 to 5.0.1 (#1892) 2022-08-15 13:42:18 -07:00
Cargo.lock build(deps): bump indexmap from 2.9.0 to 2.10.0 (#3986) 2025-06-27 11:19:12 -04:00
Cargo.toml chore(deps): define `tracing` workspace dependency (#3834) 2025-05-12 16:00:34 -04:00
DCO Add contributing doc and DCO file (#88) 2017-12-22 14:54:27 -08:00
Dockerfile chore: bump dev from v44 to v45 (#3496) 2025-01-06 13:27:49 -05:00
GOVERNANCE.md ci: Lint markdown files (#1707) 2022-05-25 11:46:19 -07:00
LICENSE Introducing Conduit, the ultralight service mesh 2017-12-05 00:24:55 +00:00
MAINTAINERS.md ci: Lint markdown files (#1707) 2022-05-25 11:46:19 -07:00
README.md chore(deps): bump DavidAnson/markdownlint-cli2-action (#3923) 2025-05-22 13:16:37 -07:00
deny.toml feat(meshtls): Add aws-lc-rs as optional rustls backend (#3883) 2025-04-28 08:38:40 -04:00
justfile fix(ci): work around broken `linkerd install --crds` (#3828) 2025-03-31 14:55:09 +00:00
rust-toolchain.toml build(deps): bump linkerd/dev from v43 to v44 (#3420) 2024-12-06 10:50:41 -05:00

README.md

The Linkerd Proxy

linkerd2

GitHub license Slack Status

This repo contains the transparent proxy component of Linkerd2. While the Linkerd2 proxy is heavily influenced by the Linkerd 1.X proxy, it comprises an entirely new codebase implemented in the Rust programming language.

This proxy's features include:

  • Transparent, zero-config proxying for HTTP, HTTP/2, and arbitrary TCP protocols.
  • Automatic Prometheus metrics export for HTTP and TCP traffic;
  • Transparent, zero-config WebSocket proxying;
  • Automatic, latency-aware, layer-7 load balancing;
  • Automatic layer-4 load balancing for non-HTTP traffic;
  • Automatic Mutual TLS;
  • An on-demand diagnostic tap API.

This proxy is primarily intended to run on Linux in containerized environments like Kubernetes, though it may also work on other Unix-like systems (like macOS).

The proxy supports service discovery via DNS and the linkerd2 Destination gRPC API.

The Linkerd project is hosted by the Cloud Native Computing Foundation (CNCF).

Building the project

We use just-cargo which provide a thin wrapper around just and cargo.

We recommend that you use the included Dev Container to avoid setting up the complex development environment by hand.

Just

A justfile is provided to automate most build tasks. It provides the following recipes:

  • just build -- Compiles the proxy on your local system using cargo
  • just test -- Runs unit and integration tests on your local system using cargo
  • just docker -- Builds a Docker container image that can be used for testing.

Cargo

Usually, Cargo, Rust's package manager, is used to build and test this project. If you don't have Cargo installed, we suggest getting it via https://rustup.rs/.

Devcontainer

A Devcontainer is provided for use with Visual Studio Code. It includes all of the tooling needed to build and test the proxy.

Repository Structure

This project is broken into many small libraries, or crates, so that components may be compiled & tested independently. The following crate targets are especially important:

Code of conduct

This project is for everyone. We ask that our users and contributors take a few minutes to review our code of conduct.

Security

We test our code by way of fuzzing and this is described in FUZZING.md.

A third party security audit focused on fuzzing Linkerd2-proxy was performed by Ada Logics in 2021. The full report can be found in the docs/reports/ directory.

License

linkerd2-proxy is copyright 2018 the linkerd2-proxy authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.