This change updates the set-inventory logic to retain objects which
failed to reconcile. This ensures that if you run the applier/destroyer
multiple times, an object that is failing to reconcile will be retained
in the inventory. Before this change, an object failing to reconcile
could be lost after multiple attempts (e.g. multiple destroys).
Prior to this change, the inventory always was deleted at the end of a
Destroy event. This would occur even in the case of a pruning
failure, resulting in the objects being removed from the inventory
without being deleted. This change makes it so that the inventory is
only deleted if all objects have been pruned.
The stress tests create 1000 copies of this deployment on the kind
cluster, which means that these deployments are in contention for
resources with the kind control plane and test suite. Reducing the size
of these test deployments should help the stress test run with fewer
resources available, and will hopefully get the presubmits passing.
This commit introduces 2 new Storage interface methods to enable clients
to implement their own logic for applying inventory objects to the live
cluster.
- Duplicate the 1,000 Deployment test, but use a 1m reconcile timeout in
a retry loop.
- This verifies that the applier and destroyer are re-entrant at scale.
- Add DefaultStatusWatcher that wraps DynamicClient and manages
informers for a set of resource objects.
- Supports two modes: root-scoped & namespace-scoped.
- Root-scoped mode uses root-scoped informers to efficiency and
performance.
- Namespace-scoped mode uses namespace-scoped informers to
minimize the permissions needed to run and the size of the
in-memory object cache.
- Automatic mode selects which mode to use based on whether the
objects being watched are in one or multiple namespaces.
This is the default mode, optimizing for performance.
- If CRDs are being watched, the creation/deletion of CRDs can
cause informers for those custom resources to be created/deleted.
- In namespace-scope mode, if namespaces are being watched, the
creation/deletion of namespaces can also trigger informers to
be created/deleted.
- All creates/updates/deletes to CRDs also cause RESTMapper reset.
- Allow pods to be unschedulable for 15s before reporting the
status as Failed. Any update resets the timer.
- Add BlindStatusWatcher for testing and disabling for dry-run.
- Add DynamicClusterReader that wraps DynamicClient.
This is now used to look up generated resources
(ex: Deployment > ReplicaSets > Pods).
- Add DefaultStatusReader which uses a DelegatingStatusReader to
wrap a list of conventional and specific StatusReaders.
This should make it easier to extend the list of StatusReaders.
- Move some pending WaitEvents to be optional in tests, now that
StatusWatcher can resolve their status before the WaitTask starts.
- Add a new Thousand Deployments stress test (10x kind nodes)
- Add some new logs for easier debugging
- Add internal SyncEvent so that apply/delete tasks don't start
until the StatusWatcher has finished initial synchronization.
This helps avoid missing events from actions that happen while
synchronization is incomplete.
- Filter optional pending WaitEvents when testing.
BREAKING CHANGE: Replace StatusPoller w/ StatusWatcher
BREAKING CHANGE: Remove PollInterval (obsolete with watcher)
Previous sorting method was not stable, and only worked coincidentally
for the two use cases that were using it. This new method works on
more event types and only sorts contiguous events. This should make
the sort usable when we add parallel apply and watch instead of poll.
Event Changes:
- Renamed ActionGroupEvent.Type -> Status
- Renamed Event.Operation -> Status
- Renamed Status fields to use consistent prefixes and suffixes
- Combined Applied, Changed, Unchanged, and ServersideApplied into
ApplySuccessful
- Added Failed status for apply, prune, and delete events
- Replaced Unspecified with Pending
- Made enum String output more consistent
Printer Changes:
- Added FormatSummary to print summary stats at the end of the
apply/destroy, instead of after the last of each type of action
group.
- Modified printer output to match new more consistent events.
- Updated JSON printer docs with latest schema details.
BREAKING CHANGE: Event "operations" and "type" are now "status"
BREAKING CHANGE: JSON printer schema changed to match events
BREAKING CHANGE: Event status enums renamed/refactored
- Ex: make test-e2e-focus FOCUS=ApplyDestroy
- Rename e2e tests to be easier to copy/paste/focus without spaces.
- Reduce stress test verbosity to reduce log spam.
- Wait for kind controllers to be ready before running tests.
- Rewrite actuation filters to return an error with the reason for
skipping.
- Add explicit error types for most skip errors, to make it easier to catch
and handle them.
- Add Is method to explicit error types to allow use of errors.Is for
recursive unwrapped matching.
- Rename InventoryPolicyFilter to InventoryPolicyPruneFilter for
consistency with InventoryPolicyApplyFilter
- Update deletion prevention inventory-id removal to use errors.As instead
of matching the filter name.
- Convert error structs to use pointers to allow nil errors and avoid
copying contents.
- Update printers to handle skip errors
BREAKING CHANGE: Skipped actuation events now include an error.
BREAKING CHANGE: DeleteEvent.Reason replaced with an error.
BREAKING CHANGE: Unused InventoryNamespaceInSet error removed.
BREAKING CHANGE: InventoryOverlapError replaced with PolicyPreventedActuationError.
BREAKING CHANGE: NeedAdoptionError replaced with PolicyPreventedActuationError.
BREAKING CHANGE: NoInventoryObjError & MultipleInventoryObjError now use pointers.
- Stress test tests 1,000 Namespaces, CofnigMaps, & CronTabs (CR)
- Stress test is a new test suite with its own make entrypoint
- Refactor shared test code so the e2e and stress tests can both use it
- Update test client QPS to 20 (from 5)
- StatusPolicyNone disables inventory status updates.
- StatusPolicyAll fully enables inventory status updates.
- This allows an opt-out feature for working around the problem
that adding status can make the inventory larger than the max
etcd object size, causing the applier to exit without applying
or pruning anything. With StatusPolicyNone, the user can still
safely prune objects to make their inventory smaller, and then
re-enable the status with StatusPolicyAll.
- Note: the default ConfigMap does not currently support status,
so this only affects custom inventory impls.
- Pass TaskContext into TaskBuilder.Build
- Combine dependency graph for apply and prune objects.
This is required to catch dependencies that would have been deleted.
- Replace graph.SortObjs into DependencyGraph + Sort + HydrateSetList
- Replace graph.ReverseSortObjs with ReverseSetList to perform on the
combined (apply + prune) set list.
- Add planned pending applies and prune to the InventoryManager
before executing the task queue.
This allows the DependencyFilter to validate against the planned
actuation strategy of objects that haven't been applied/pruned yet.
- Add the dependency graph to the TaskContext, for the
DependencyFilter to use.
This can be removed in the future if the filters are managed by the
solver.
- Make Graph.Sort non-destructive, so the graph can be re-used by the
DependencyFilter.
- Add Graph.EdgesFrom and EdgesTo for the DependencyFilter to use.
This requires storing the reverse edge list.
- Add an e2e test for the DependencyFilter
- Add an e2e test for the LocalNamespaceFilter
Fixes https://github.com/kubernetes-sigs/cli-utils/issues/526
Fixes https://github.com/kubernetes-sigs/cli-utils/issues/528
If the inventory object supports status field, it is updated
after setting the inventory list at then of one apply process.
BREAKING CHANGE: Update the inventory client and inventory interfaces
to pass the apply/reconcile status.
- Add a new Inventory KRM object for storing the spec and status
of the inventory objects in memory.
- Improve reconcile, apply, & delete status tracking in the
TaskContext/Inventory to cover all possible statuses
- Move most of the convenience methods from the TaskContext into a
new inventory.Manager.
- Fix a minor bug where object UID might have drifted (delete &
recreate) between GET and DELETE.
- Add ValidationPolicy:
ExitEarly (default) - exit before apply/prune if any objects are invalid
SkipInvalid - apply/prune valid objects & skip invalid ones
- Add ValidationEvent to be sent once for every invalid object or set of
objects, when the SkipInvalid policy is selected. For ExitEarly,
return an error for reverse compatibility.
- Add validation.Collector to simplify aggregating validation errors
from multiple sources and extracting invalid object IDs.
- Add invalid objects to the TestContext so they can be retained in
the inventory (only if already present). This primarily applies to
invalid annotations and dependencies. Objects without name or kind
should never be added to the inventory.
- Update Solver to use validation.Collector and filter invalid objects.
- Add e2e test for invalid objects.
- Update Printers to handle ValidationEvent
- Add ExternalDependencyError & InvalidAnnotationError to make it easier
to handle and introspect validation errors.
- Move Validator to pkg/object/validation
- Replace ValidationError with validation.Error
- Replace MultiValidationError with generic MultiError
- Update Validator & SortObjs to use MultiError
- Add ResourceReferenceFromObjMetadata
- Rename NewResourceReference -> ResourceReferenceFromUnstructured
- Delete duplicate ResourceReference.ObjMetadata()
- Modify some error messages for consistency and clarity
- Use templating to generate some test artifacts
BREAKING CHANGE: apply-time-mutation namespace required for namespace-scoped resources
- Using field.Error allows more errors to be wrapped as a
ValidationError, instead of interrupting validation and exiting
early.
- Add explicit validation for kind (clearer error).
- Move NestedField from testutil to object, so it can be used at
runtime.