The controller package now promotes those buckets
prior to starting reconciliation and leader election
The unopposed elector now seeds the univeral bucket
This helps avoid races in unit tests were the univeral
bucket isn't promoted in time when reconciling an resource
* Avoid double-resyncs without leader election.
tl;dr Without leader election enabled, we've been suffering from double resyncs, this fixes that.
Essentially when the informers are started, prior to starting the controllers, they fill the controller's workqueue with ~the world. Once we start the controllers, the leader election code calls `Promote` on a `Bucket` so that it starts processing them as their leader, and then is requeues every key that falls into that `Bucket`.
When leader election is disabled, we use the `UniversalBucket`, which always owns everything. This means that the first pass through the workqueue was already hitting every key, so re-enqueuing every key results in superfluous processing of those keys.
This change makes the `LeaderAwareFuncs` elide the call to `PromoteFunc` when the `enq` function passed is `nil`, and makes the `unopposedElector` have a `nil` `enq` field when we explicitly pass it the `UniversalBucket` because leader election is disabled.
* Add more unit test coverage
This is modelled after some of the semantics available in controller-runtime's `Result` type, which allows reconcilers to indicate a desire to reprocess a key either immediately or after a delay.
I worked around our lack of this when I reworked Tekton's timeout handling logic by exposing a "snooze" function to the Reconciler that wrapped `EnqueueAfter`, but this feels like a cleaner API for folks to use in general, and is consistent with our `NewPermanentError` and `NewSkipKey` functions in terms of influencing the queuing behaviors in our reconcilers via wrapped errors.
You can see some discussion of this here: https://knative.slack.com/archives/CA4DNJ9A4/p1627524921161100
* TestEnqueue: remove unnecessary calls to Sleep
The rate limiter applies only when multiple items are put onto the
workqueue, which is not the case in those tests.
Execution: ~7.6s -> ~2.1s
* TestEnqueueAfter: remove assumptions on execution times
Instead of sleeping for a conservative amount of time, keep watching the
state of the workqueue in a goroutine, and notify the test logic as soon
as the item is observed.
Execution: ~1s -> ~0.05s
* TestEnqueueKeyAfter: remove assumptions on execution times
Instead of sleeping for a conservative amount of time, keep watching the
state of the workqueue in a goroutine, and notify the test logic as soon
as the item is observed.
Execution: ~1s -> ~0.05s
* TestStartAndShutdownWithErroringWork: remove sleep
Instead of sleeping for a conservative amount of time, keep watching the
state number of requeues in a goroutine, and notify the test logic as
soon as the expected threshold is reached.
Logs, for an idea of timings
----------------------------
Started workers
Processing from queue bar (depth: 0)
Reconcile error {"error": "I always error"}
Requeuing key bar due to non-permanent error (depth: 0)
Reconcile failed. Time taken: 104µs {"knative.dev/key": "bar"}
Processing from queue bar (depth: 0)
Reconcile error {"error": "I always error"}
Requeuing key bar due to non-permanent error (depth: 0)
Reconcile failed. Time taken: 48.2µs {"knative.dev/key": "bar"}
Execution: ~1s -> ~0.01s
* TestStart*/TestRun*: reduce sleep time
There is no need to sleep for that long. If an error was returned, it
would activate the second select case immediately.
Execution: ~1s -> ~0.05s
* TestImplGlobalResync: reduce sleep time
We know the fast lane is empty in this test, so we can safely assume
immediate enqueuing of all items on the slow lane.
Logs, for an idea of timings
----------------------------
Started workers
Processing from queue foo/bar (depth: 0)
Reconcile succeeded. Time taken: 11.5µs {"knative.dev/key": "foo/bar"}
Processing from queue bar/foo (depth: 1)
Processing from queue fizz/buzz (depth: 0)
Reconcile succeeded. Time taken: 9.7µs {"knative.dev/key": "fizz/buzz"}
Reconcile succeeded. Time taken: 115µs {"knative.dev/key": "bar/foo"}
Shutting down workers
Execution: ~4s -> ~0.05s
* review: Replace for/select with PollUntil
* review: Remove redundant duration multiplier
* review: Replace defer with t.Cleanup
* Allow creating controller with custom RateLimiter.
Which was possible before via field modification.
Not switching to a builder pattern mostly for speed of resolution.
Happy to consider alternatives.
* Add tests for new functionality.
Specifically, these test that the Wait() function is notified about
the item, and that the RateLimiter is passed through to the queue.
* Add Options. Gophers love Options.
* Even moar controller GenericOptions.
* Attempt to appease lint, don't create struct for typecheck.
* GenericOptions -> ControllerOptions
* Public struct fields.
* Enable HA by default.
This consolidates the core of sharedmain around the new leaderelection logic, which will now be **enabled by default**.
This can now be disabled with `--disable-ha` or by passing `sharedmain.WithHADisabled(ctx)` to `sharedmain.MainWithConfig`.
* vagababov comments, build failure
* Open an issue for enabledComponents removal.
* Move the configmap watcher startup.
This race was uncovered by the chaos duck on knative/serving! When we have enabled a feature flag, e.g. multi-container, and the webhook pods are restarted, there is a brief window where the webhook is up and healthy before the configmaps have synchronized and the new webhook pod realizes the feature is enabled.
* Drop the import alias
* Change StartAll to take context.
This has bugged me since we started using `ctx`, which containers a `stopCh` of sorts as `Done()`. This is somewhat for consistency, but by using `ctx` explicitly we enable ourselves to take advantage of more contextual information.
I did a quick scan of call sites and the good news is that the `sharedmain` change should be the place through which the vast majority of calls occur, however, the one outlier here is the KPA which calls this manually. I will stage a PR to manually import pkg into serving to fix this once this lands.
* Add a Run shim for back-compat
* allow filtering on schema.GroupKind
In addition deprecated usage of Filter with the introduction of
FilterGroupVersionKind
* reduce nesting & simplify boolean logic
Add RunInfomers which is similar to the StartInformers function but allows for
users to for informers to finish running.
This function will be mainly used in tests to fix the race described in
knative/serving#5351
* Rework how we GlobalResync.
This consolidates the logic for how we do GlobalResyncs in a few ways:
1. Remove SendGlobalUpdates and replace it with FilteredGlobalResync.
2. Have GlobalResync go through the FilteredGlobalResync with a tautological predicate.
3. Change the way we enqueue things during a global resync to avoid flooding the system
by using EnqueueAfter and wait.Jitter to stagger how things are queued.
The background for this is that when I started to benchmark reconciliation of a large
number of resources we'd see random large stalls like [this](https://mako.dev/run?run_key=4882560844824576&~dl=1&~cl=1&~rl=1&~rvl=1&~il=1&~sksl=1&~pal=1)
fairly consistently. Looking at the logs, these stalls coincide with incredibly deep
work queues due to global resyncs triggers by configmap notifications. When I disabled
global resyncs the spikes [went away](https://mako.dev/run?run_key=4975897060835328&~dl=1&~cl=1&~rl=1&~rvl=1&~il=1&~sksl=1&~pal=1).
To mitigate the huge pile-up due to the global resync we use `wait.Jitter` to stagger how
things are enqueued with a `maxFactor` of the length of the store. This also seems to
[keep things flowing](https://mako.dev/run?run_key=5701802099998720&~dl=1&~cl=1&~rl=1&~rvl=1&~il=1&~sksl=1&~pal=1),
although we will possibly need to tune things further.
* Update comment to mention the delay.
A while back we added a "StatsReporter" argument to `controller.NewImpl`,
but in serving every callsite of this is passing:
```
controller.NewImpl(r, logger, "my-string", MustNewStatsReporter("my-string", logger))
```
Where `MustNewStatsReporter` is just a form of knative/pkg's `controller.NewStatsReporter`
that logs fatally when an error is returned. It is notable that Serving's logic has been
duplicated to both Build and Eventing.
There are a handful of changes here:
1. Move MustNewStatsReporter into knative/pkg
2. Expose the current interface as NewImplWithStats
3. Drop the StatsReporter from NewImpl and default to `MustNewStatsReporter()`
This is a breaking change for downstream repositories, but should make their callsites universally simpler.
In serving we have `reconciler.Handler` that wraps a handler function
(e.g. `Enqueue`) in the `cache.ResourceEventHandlerFuncs`. This pattern
was becoming pervasive, so this simpler handler dramatically reduced our
boilerplate.
This is needed for the positive handoff, so that we can enqueue
the KPA after we received the positive answer from the activator
but still want to have buffer time to make sure the network
changes propagate everywhere.
* Add a version of GlobalResync that calls an event handler.
This allows impl to apply filtering and queuing logic of their own,
instead of calling Enqueue() for everything.
* Fix typo in doc comment.
I was hoping to switch from a simple string key to a struct to include additional metadata into our workqueue item (in particular the time at which it was queued), but the great unit tests here reminded me that this would not properly deduplicate keys (they would have different queue times!), so I'm settling for simplifying the loop in a few ways:
1. Keys can only be strings, so remove some error checking and a test.
1. We were wrapping a block in a `func` because it contained `defer`, but the outer function had nothing in it.
I'd still like to try and find a way to track queue times for keys, so that we can surface two interesting metrics:
1. How long are things spending queued?
1. How long between events being observed and us responding? (above + reconcile latency)