The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.
Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.
After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
In order to simplify the deleteNodesWithErrors code, return nodeGroupID
as well as nodes with create errors. That way we avoid the additional
node group matching code.
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: update scale down status after every scale up
- move scaledown delay status to cluster state/registry
- enable scale down if `ScaleDownDelayTypeLocal` is enabled
- add new funcs on cluster state to get and update scale down delay status
- use timestamp instead of booleans to track scale down delay status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: use existing fields on clusterstate
- uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus`
- changed the above existing fields a little to make them more convenient for use
- moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove note saying only `scale-down-after-add` is supported
- because we are supporting all the flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add`
- because it is not true anymore
- we are supporting all `--scale-down-delay-after-*` flags per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: fix clusterstate tests failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: move back initializing processors logic to from static autoscaler to main
- we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors
- and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: revert changes related to `clusterstate`
- since I am going with observer pattern
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add observer interface for state of scaling
- to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same)
- refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove params passed to `clearScaleUpFailures`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: revert clusterstate tests
- approach has changed
- I am not making any changes in clusterstate now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: add accidentally deleted lines for clusterstate test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: implement `Add` fn for scale state observer
- to easily add new observers
- re-word comments
- remove redundant params from `NewDefaultScaleDownCandidatesProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: CI complaining because no comments on fn definitions
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: initialize parent `ScaleDownCandidatesProcessor`
- instead of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: add scale state notifier to list of default processors
- initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn
- this allows more flexibility
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: add observer interface
- create a separate observer directory
- implement `RegisterScaleUp` function in the clusterstate
- TODO: resolve syntax errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: use `scaleStateNotifier` in place of `clusterstate`
- delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory)
- register `clustertstate` with `scaleStateNotifier`
- use `Register` instead of `Add` function in `scaleStateNotifier`
- fix `go build`
- wip: fixing tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: fix syntax errors
- add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip track scale down failures along with scale up failures
- I was tracking scale up failures but not scale down failures
- fix copyright year 2017 -> 2023 for the new `pointers` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: register failed scale down with scale state notifier
- wip writing tests for `scale_down_candidates_delay_processor`
- fix CI lint errors
- remove test file for `scale_down_candidates_processor` (there is not much to test as of now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: wip tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add unit tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor`
- not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: better doc comments for `TestGetScaleDownCandidates`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: don't ignore error in `NGChangeObserver`
- return it instead and let the caller decide what to do with it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: change pointers to values in `NGChangeObserver` interface
- easier to work with
- remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now)
- add tests for clusterstate's `RegisterScaleUp`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: conditions in `GetScaleDownCandidates`
- set scale down in cool down if the number of scale down candidates is 0
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: use `ng1` instead of `ng2` in existing test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: assign directly instead of using `sdProcessor` variable
- variable is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: first working test for static autoscaler
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: continue working on static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: wip second static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove `Println` used for debugging
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add static_autoscaler tests for scale down delay per nodegroup flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: rebase off the latest `master`
- change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: fix clusterstate test failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: fix failing orchestrator test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor`
- describes the processor better
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver`
- makes it easier to understand for someone not familiar with the codebase
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: reword code comment `after` -> `for which`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: don't return error from `RegisterScaleDown`
- not needed as of now (no implementer function returns a non-nil error for this function)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: address review comments around ng change observer interface
- change dir structure of nodegroup change observer package
- stop returning errors wherever it is not needed in the nodegroup change observer interface
- rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: make nodegroupchange observer thread-safe
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add TODO to consider using multiple mutexes in nodegroupchange observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: use `time.Now()` directly instead of assigning a variable to it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: share code for checking if there was a recent scale-up/down/failure
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: convert `ScaleDownCandidatesDelayProcessor` into table tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()`
- makes it easier to understand what the function does
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: replace scale state notifier `Register` -> `RegisterForNotifications` in test
- to fix syntax errors since it is already renamed in the actual code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove `clusterStateRegistry` from `delete_in_batch` tests
- not needed anymore since we have `scaleStateNotifier`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: address PR review comments
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: add empty `RegisterFailedScaleDown` for clusterstate
- fix syntax error in static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
Due to the dependency of the MaxNodeProvisionTimeProvider on the context
the provider was extracted to a dedicated package and injected to the
ClusterStateRegistry after context creation.
Without this, with aggressive settings, scale-down could be removing
registered upcoming nodes before they have a chance to become ready
(the duration of which should be unrelated to the scale-down settings).
This does make us call len() in a bunch of places within CSR, but allows
for greater flexibility - it's possible to act on the sets of nodes determined
by Readiness.
* Adding isNodeDeleted method to CloudProvider interface. Supports detecting whether nodes are fully deleted or are not-autoscaled. Updated cloud providers to provide initial implementation of new method that will return an ErrNotImplemented to maintain existing taint-based deletion clusterstate calculation.
Adding check to backfill loop to confirm cloud provider node no longer exists before flagging the node as deleted. Modifying some comments to be more accurate. Replacing erroneous line deletion.
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.