* docker.io/library/golang from 1.22 to 1.23
* gotestsum from 0.4.2 to 1.12.0
* protoc-gen-go from 1.28.1 to 1.35.2
* protoc-gen-go-grpc from 1.2 to 1.5.1
* docker.io/library/rust from 1.76.0 to 1.83.0
* cargo-deny from 0.14.11 to 0.16.3
* cargo-nextest from 0.9.67 to 0.9.85
* cargo-tarpaulin from 0.27.3 to 0.31.3
* just from 1.24.0 to 1.37.0
* yq from 4.33.3 to 4.44.5
* markdownlint-cli2 from 0.10.0 to 0.15.0
* shellcheck from 0.9.0 to 0.10.0
* actionlint from 1.6.26 to 1.7.4
* protoc from 3.20.3 to 29.0
* step from 0.25.2 to 0.28.2
* kubectl from 1.29.2 to 1.31.3
* k3d from 5.6.0 to 5.7.5
* k3s image shas
* helm from 3.14.1 to 3.16.3
* helm-docs from 1.12.0 to 1.14.2
The policy-controller main() has grown to hundreds of lines of business logic.
It's preferable to extract this into a library so that the business logic is
seperated from the process/runtime setup.
The multicluster integration tests do not retry image import failures when
loading the pause container. This change adds retries to the `_pause-load`
target in `justfile`.
assert_status_accepted was not panicking when the status was not accepted.
This change updates the e2e_egress_network test to properly wait on networks
that are not accepted.
The e2e_egress_network test is failing for two reasons:
1. httpbin.org is currently down and returning 503 errors.
2. The default tests change the default policies and relaunch a container after
the policy is changed. In under-resourced environments, this update may be
slow or unreliable.
This change fixes this by:
1. Using postman-echo.com instead of httpbin.org.
2. Splitting the tests into distinct allow and deny tests to avoid mutating the
default policy.
This change introduces a timeout into the kubernetes lease logic so that patches
may not get stuck indefinitely.
This change also modifies our Cargo.tomls so that kubert and its related
dependencies (kube and k8s-openapi) are defined at the workspace-level.
When the policy controller patches a status, it sets the field manager to be
that of the Kind of resource being managed. Per the Kubernetes documentation,
this field should describe the controller that is making the change:
> Managers identify distinct workflows that are modifying the object (especially
> useful on conflicts!), and can be specified through the fieldManager query
> parameter as part of a modifying request. When you Apply to a resource, the
> fieldManager parameter is required. For other updates, the API server infers a
> field manager identity from the "User-Agent:" HTTP header (if present).
>
> When you use the kubectl tool to perform a Server-Side Apply operation,
> kubectl sets the manager identity to "kubectl" by default.
This commit sets the field manager to "linkerd.io/policy-controller", as is used
in status values.
The status controller updates its leadership state every time the lease changes.
But if the lease expires and does not update for some reason, the contoller
incorrectly continues to act as the leader.
This change ensures that the controller checks the lease on each iteration
to ensure that it correctly honors the lease contract.
The status reconcilation loop runs differently than it seems is intended (based
on the original DEBUG logging): reconcilation is triggered every time that the
lease updates, even when there is no change in leadership. The code seems to
assume that the lease only updates when leadership changes, but this is not the
case.
This commit updates the status reconcilation loop to use a timer. The timer is
reset when leadership is acquired so that reconcilation is triggered at a fixed
interval.
There are a few things about the policy controller logging that can be cleaned
up for consistency and clarity:
* We frequently log ERROR messages when processing resources with unexpected
values. These messages are more appropriately emitted at WARN--we want to
surface these situations, but they are not really exceptional.
* The leadership status of the status controller is not logged at INFO level, so
it's not possible to know about status changes without DEBUG logging.
* We generally use Sentence-cased log messages when emitting user-facing
messages. There are a few situations where we are not consistent.
* The status controller reconciliation logging is somewhat noisy and misleading.
* The status controller does not log any messages when patching resources.
```
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder has changed
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder has changed
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder has changed
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
DEBUG status::Index: linkerd_policy_controller_k8s_status::index: Lease holder reconciling cluster index.name=linkerd-destination-74d7fdc45d-xfb8l
```
The "Lease holder has changed" message actually indicates that the _lease_ has
changed, though the holder may be unchanged.
To improve logging clarity, this change does the following:
* Adds an INFO level log when the leadership status of the controller changes.
* Adds an INFO level log when the status controller patches resources.
* Adds DEBUG level logs when the status controller patches resources.
* Reconciliation housekeeping logging is moved to TRACE level.
* Consistently uses sentence capitalization in user-facing log messages
* Reduces ERROR messages to WARN when handling invalid user-provided data
(including cluster resources). This ensures that ERRORs are reserved for
exceptional policy controller states.
The policy container is configured differently than all other controllers: other
controllers configure an `initialDelaySeconds` on their `livenessProbe` but not
on their `readinessProbe`. The policy container, however, includes this
configuration on its `readinessProbe` but not on its `livenessProbe`.
This commit fixes the policy container to match the other controllers.
This reduces pod readiness time from 20s to 4s.
We received a report of a panic:
runtime error: invalid memory address or nil pointer dereference
panic({0x1edb860?, 0x37a6050?}
/usr/local/go/src/runtime/panic.go:785 +0x132
github.com/linkerd/linkerd2/controller/api/destination/watcher.latestUpdated({0xc0006b2d80?, 0xc00051a540?, 0xc0008fa008?})
/linkerd-build/vendor/github.com/linkerd/linkerd2/controller/api/destination/watcher/endpoints_watcher.go:1612 +0x125
github.com/linkerd/linkerd2/controller/api/destination/watcher.(*OpaquePortsWatcher).updateService(0xc0007d5480, {0x21fd160?, 0xc000d71688?}, {0x21fd160, 0xc000d71688})
/linkerd-build/vendor/github.com/linkerd/linkerd2/controller/api/destination/watcher/opaque_ports_watcher.go:141 +0x68
The `latestUpdated` function does not properly handle the case where a atime is
omitted from a `ManagedFieldsEntry`.
type ManagedFieldsEntry struct {
// Time is the timestamp of when the ManagedFields entry was added. The
// timestamp will also be updated if a field is added, the manager
// changes any of the owned fields value or removes a field. The
// timestamp does not update when a field is removed from the entry
// because another manager took it over.
// +optional
Time *Time `json:"time,omitempty" protobuf:"bytes,4,opt,name=time"`
This change adds a check to avoid the nil dereference.
The http_local_rate_limit_policy test creates a resource with a status already
hydrated, but status setting is a job of the controller.
This change updates the test to create a resource without a status and then to
wait for the status to be set properly.
This will hopefully help us to avoid race conditions in this test whereby the
API lookup can occur before the controller observes the resource creation.
We add a linkerd.io/created-by annotation to Link resources which specifies the version of the CLI which was used to create the Link. This annotation is already used in this way by control plane components. This allows us to easily see what version of Linkerd was used to generated a Link. We add a check that inspects this value and warns if any Links don't match the current version of the CLI.
Additionally, we fix an issue with the orphaned services check where it was incorrectly warning that federated services were orphaned because they don't have a specific target cluster.
Signed-off-by: Alex Leong <alex@buoyant.io>