The test assumes the threads would schedule in particular way, but they don't.
But what we really care to check is that we thread in the proper RL and it works.
We don't need to check that underlying queue impl works, that's done in its own tests.
So just verify these two things.
* Allow creating controller with custom RateLimiter.
Which was possible before via field modification.
Not switching to a builder pattern mostly for speed of resolution.
Happy to consider alternatives.
* Add tests for new functionality.
Specifically, these test that the Wait() function is notified about
the item, and that the RateLimiter is passed through to the queue.
* Add Options. Gophers love Options.
* Even moar controller GenericOptions.
* Attempt to appease lint, don't create struct for typecheck.
* GenericOptions -> ControllerOptions
* Public struct fields.
* Use two lane queue instead of the regular workqueue
- we need to poll for len in the webhook tests because we have async propagation now, and check at the wrong time will be not correct.
- otherwise just a drop in replacement.
* update test
* cmt
* tests hardened
* Remove key in tags to reduce metrics count
Issue: https://github.com/knative/serving/issues/8609
Signed-off-by: Lance Liu <xuliuxl@cn.ibm.com>
* remove tag key for OpenCensus
Signed-off-by: Lance Liu <xuliuxl@cn.ibm.com>
* Enable HA by default.
This consolidates the core of sharedmain around the new leaderelection logic, which will now be **enabled by default**.
This can now be disabled with `--disable-ha` or by passing `sharedmain.WithHADisabled(ctx)` to `sharedmain.MainWithConfig`.
* vagababov comments, build failure
* Open an issue for enabledComponents removal.
* Move the configmap watcher startup.
This race was uncovered by the chaos duck on knative/serving! When we have enabled a feature flag, e.g. multi-container, and the webhook pods are restarted, there is a brief window where the webhook is up and healthy before the configmaps have synchronized and the new webhook pod realizes the feature is enabled.
* Drop the import alias
* Add an option to skip automated status updates in a reconciler.
This option is necessary to be able to create reconcilers like Serving's labeler, that is purely adding labels to resources. If that fails, the new automated observed generation handling changes the status and that gets written to the API currently, which is not desired.
* Flip the bool.
* do not record for empty metric config
* Revert "do not record for empty metric config"
This reverts commit 539a5e4dbb.
* add a comment
* fix typo
* fix tests
* revert
* revert tests
* revert
* fix conflicts
* one more test file
- use more performant functions (mostly remove formatters were possible)
- move zap.Error() to the call site, rather than creating a new sugared logger with a key attached (not cheap)
- fix *w usages where the key was not provided.
* do not record for empty metric config
* Revert "do not record for empty metric config"
This reverts commit 539a5e4dbb.
* use metricstest package for test
* Add the ability to override controller agent name
- controller can set agent name via context. If nothing is set then
controller falls back to default controller agent name
* Check if context name is not nil and can be casted to string before returning.
* Create EventRecorder after we determine the controller agent name
* Address comments
Signed-off-by: Andrew Su <asu@pivotal.io>
* Remove check for if recorder is nil
Co-authored-by: Andrew Su <asu@pivotal.io>
* Explicitly name controller filters.
Everytime I read the generic "Filter" or "FilterGroupVersionKind" my brain needs to do an extra roundtrip to realize that this actually filters on the **controller** having that GVK/GK. This adds new functions that explictly state that to avoid that roundtrip.
Old functions are just deprecated so this can be rolled out without a downstream break.
* Rename after review.
* Remove unused code.
* Use raw strings to avoid escaping.
* Remove unneeded type conversions.
* Preallocate slices where possible.
* Use semantic equality in psbinding reconciler.
* Change StartAll to take context.
This has bugged me since we started using `ctx`, which containers a `stopCh` of sorts as `Done()`. This is somewhat for consistency, but by using `ctx` explicitly we enable ourselves to take advantage of more contextual information.
I did a quick scan of call sites and the good news is that the `sharedmain` change should be the place through which the vast majority of calls occur, however, the one outlier here is the KPA which calls this manually. I will stage a PR to manually import pkg into serving to fix this once this lands.
* Add a Run shim for back-compat
* allow filtering on schema.GroupKind
In addition deprecated usage of Filter with the introduction of
FilterGroupVersionKind
* reduce nesting & simplify boolean logic
Add RunInfomers which is similar to the StartInformers function but allows for
users to for informers to finish running.
This function will be mainly used in tests to fix the race described in
knative/serving#5351
This adds logic to hook into two other metric systems:
1. `cache.SetReflectorMetricsProvider`, which doesn't seem hooked up in Kubernetes yet, but would theoretically give us metrics about the mechanisms underpinning informers.
2. `metrics.Register`, which hooks us into the rest client infrastructure to give us metrics about low-level API server calls.
Fixes: https://github.com/knative/pkg/issues/679
Fixes: https://github.com/knative/pkg/issues/680
* Do not enqueu (and log) if we're shutting down
* Wait for queue to empty before terminate and other restrictions
We need to stop enqueueing stuff when queue is shuitting down since this causes panics in our tests.
So
1. wait for queue to empty before shutting down
2. do not enqueue stuff for global reconciliation if the queue is shutting down (this exhibits races in global resync tests)
3. do not enqueue on error, since while the erroneous object was being processed the queue might have terminated
/assign mattmoor
* Rework how we GlobalResync.
This consolidates the logic for how we do GlobalResyncs in a few ways:
1. Remove SendGlobalUpdates and replace it with FilteredGlobalResync.
2. Have GlobalResync go through the FilteredGlobalResync with a tautological predicate.
3. Change the way we enqueue things during a global resync to avoid flooding the system
by using EnqueueAfter and wait.Jitter to stagger how things are queued.
The background for this is that when I started to benchmark reconciliation of a large
number of resources we'd see random large stalls like [this](https://mako.dev/run?run_key=4882560844824576&~dl=1&~cl=1&~rl=1&~rvl=1&~il=1&~sksl=1&~pal=1)
fairly consistently. Looking at the logs, these stalls coincide with incredibly deep
work queues due to global resyncs triggers by configmap notifications. When I disabled
global resyncs the spikes [went away](https://mako.dev/run?run_key=4975897060835328&~dl=1&~cl=1&~rl=1&~rvl=1&~il=1&~sksl=1&~pal=1).
To mitigate the huge pile-up due to the global resync we use `wait.Jitter` to stagger how
things are enqueued with a `maxFactor` of the length of the store. This also seems to
[keep things flowing](https://mako.dev/run?run_key=5701802099998720&~dl=1&~cl=1&~rl=1&~rvl=1&~il=1&~sksl=1&~pal=1),
although we will possibly need to tune things further.
* Update comment to mention the delay.
A while back we added a "StatsReporter" argument to `controller.NewImpl`,
but in serving every callsite of this is passing:
```
controller.NewImpl(r, logger, "my-string", MustNewStatsReporter("my-string", logger))
```
Where `MustNewStatsReporter` is just a form of knative/pkg's `controller.NewStatsReporter`
that logs fatally when an error is returned. It is notable that Serving's logic has been
duplicated to both Build and Eventing.
There are a handful of changes here:
1. Move MustNewStatsReporter into knative/pkg
2. Expose the current interface as NewImplWithStats
3. Drop the StatsReporter from NewImpl and default to `MustNewStatsReporter()`
This is a breaking change for downstream repositories, but should make their callsites universally simpler.
In serving we have `reconciler.Handler` that wraps a handler function
(e.g. `Enqueue`) in the `cache.ResourceEventHandlerFuncs`. This pattern
was becoming pervasive, so this simpler handler dramatically reduced our
boilerplate.
This is needed for the positive handoff, so that we can enqueue
the KPA after we received the positive answer from the activator
but still want to have buffer time to make sure the network
changes propagate everywhere.
* Add a version of GlobalResync that calls an event handler.
This allows impl to apply filtering and queuing logic of their own,
instead of calling Enqueue() for everything.
* Fix typo in doc comment.
* split and update
* bug fixed
* change the copyright year
* use fake metadata in tests
* exclude test coverage for gcp_metadata
* add record
* add tests
* revert 2019 and add a new var
* change documents
* address comment
* change line
* Use different reporting periods based on the metrics backend. Original implementation was setting this value to 1 min to accommodate the minimum allowed value for stackdriver, but that is a sub optimal behavior for Prometheus. The change also allows operator to override the default values.
* Fix the units of reconciler latency from nano seconds to milliseconds.
* Address PR comments.
* Add better logging for reconciliation.
* Remove unnecessary lock in the UT.
* Address PR comments.
I was hoping to switch from a simple string key to a struct to include additional metadata into our workqueue item (in particular the time at which it was queued), but the great unit tests here reminded me that this would not properly deduplicate keys (they would have different queue times!), so I'm settling for simplifying the loop in a few ways:
1. Keys can only be strings, so remove some error checking and a test.
1. We were wrapping a block in a `func` because it contained `defer`, but the outer function had nothing in it.
I'd still like to try and find a way to track queue times for keys, so that we can surface two interesting metrics:
1. How long are things spending queued?
1. How long between events being observed and us responding? (above + reconcile latency)
This promotes the private deletion-handling helper wrapping `meta.Accessor` into `kmeta`. The wordy name is based on the wordier `cache.DeletionHandlingMetaNamespaceKeyFunc`, since it serves a similar purpose.
* Prepare for global resync on ConfigMap changes
- Add GlobalResync(cache.SharedInformer) method to *controller.Impl
- Move UntypedStore from serving/pkg/config to pkg/configmap
- Add onAfterStore callbacks to UntypedStore
- Add TypeFilter to enable composable construction of such callbacks
* Address review
- Unit test GlobalResync()
- Reimpl GlobalResync() using ListKeys()
* Update comments
* Add test for onAfterStore
* Update godoc comment for Logger
The rate limiting queue makes the sleep necessary. Locally I see 0-2 failures when this is run with -count=10000. This bumps the sleep slightly to reduce this further.
Fixes: https://github.com/knative/pkg/issues/56
This adds a logkey.Key for attaching the key being reconciled to a logger we embed into the context passed to Reconcile. The point of this is to enable us to eliminate [this](928d580756/pkg/reconciler/v1alpha1/service/service.go (L101-L104)) boilerplate from each of the Reconcilers.
These aren't identical, this has the form:
knative.dev/key: foo/bar
What it's replacing has the form:
knative.dev/namespace: foo
serving.knative.dev/service: bar
However, the type information present in the latter is redundant with the controller type that [should already be embedded](928d580756/pkg/reconciler/reconciler.go (L80)) in the logger.
The `controller.go` is from: https://github.com/knative/serving/pull/1770, however, this adds 100% coverage of `controller.go`, which we have been missing in `knative/serving`.
This also adds a `context.Context` to the `Reconcile` signature, to enable better sharing of logger setup across controllers.
Fixes: https://github.com/knative/pkg/issues/8