Commit Graph

255 Commits

Author SHA1 Message Date
Yaroslava Serdiuk dffff4f557
Add ProvisioningRequest injector (#6529)
* Add ProvisioningRequests injector

* Add test case for Accepted conditions and add supported provreq classes list

* Use Passive clock
2024-02-28 02:14:49 -08:00
Mahmoud Atwa e7ff1cd90f Introduce LocalSSDSizeProvider interface for GCE 2024-02-16 09:10:27 +00:00
Yaroslava Serdiuk 5286b3f770
Add ProvisioningRequestProcessor (#6488) 2024-02-14 05:14:46 -08:00
Yaroslava Serdiuk ed6ebbe8ba
ScaleUp for check-capacity ProvisioningRequestClass (#6451)
* ScaleUp for check-capacity ProvisioningRequestClass

* update condition logic

* Update tests

* Naming update

* Update cluster-autoscaler/core/scaleup/orchestrator/wrapper_orchestrator_test.go

Co-authored-by: Bartek Wróblewski <bwroblewski@google.com>

---------

Co-authored-by: Bartek Wróblewski <bwroblewski@google.com>
2024-01-30 02:36:59 -08:00
vadasambar 5de49a11fb feat: support `--scale-down-delay-after-*` per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: update scale down status after every scale up
- move scaledown delay status to cluster state/registry
- enable scale down if  `ScaleDownDelayTypeLocal` is enabled
- add new funcs on cluster state to get and update scale down delay status
- use timestamp instead of booleans to track scale down delay status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use existing fields on clusterstate
- uses `scaleUpRequests`, `scaleDownRequests` and `scaleUpFailures` instead of `ScaleUpDelayStatus`
- changed the above existing fields a little to make them more convenient for use
- moved initializing scale down delay processor to static autoscaler (because clusterstate is not available in main.go)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove note saying only `scale-down-after-add` is supported
- because we are supporting all the flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: evaluate `scaleDownInCooldown` the old way only if `ScaleDownDelayTypeLocal` is set to `false`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove line saying `--scale-down-delay-type-local` is only supported for `--scale-down-delay-after-add`
- because it is not true anymore
- we are supporting all `--scale-down-delay-after-*` flags per nodegroup
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate tests failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move back initializing processors logic to from static autoscaler to main
- we don't want to initialize processors in static autoscaler because anyone implementing an alternative to static_autoscaler has to initialize the processors
- and initializing specific processors is making static autoscaler aware of an implementation detail which might not be the best practice
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert changes related to `clusterstate`
- since I am going with observer pattern
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add observer interface for state of scaling
- to implement observer pattern for tracking state of scale up/downs (as opposed to using clusterstate to do the same)
- refactor `ScaleDownCandidatesDelayProcessor` to use fields from the new observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove params passed to `clearScaleUpFailures`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: revert clusterstate tests
- approach has changed
- I am not making any changes in clusterstate now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add accidentally deleted lines for clusterstate test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement `Add` fn for scale state observer
- to easily add new observers
- re-word comments
- remove redundant params from `NewDefaultScaleDownCandidatesProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CI complaining because no comments on fn definitions
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: initialize parent `ScaleDownCandidatesProcessor`
- instead  of `ScaleDownCandidatesSortingProcessor` and `ScaleDownCandidatesDelayProcessor` separately
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add scale state notifier to list of default processors
- initialize processors for `NewDefaultScaleDownCandidatesProcessor` outside and pass them to the fn
- this allows more flexibility
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add observer interface
- create a separate observer directory
- implement `RegisterScaleUp` function in the clusterstate
- TODO: resolve syntax errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: use `scaleStateNotifier` in place of `clusterstate`
- delete leftover `scale_stateA_observer.go` (new one is already present in `observers` directory)
- register `clustertstate` with `scaleStateNotifier`
- use `Register` instead of `Add` function in `scaleStateNotifier`
- fix `go build`
- wip: fixing tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix syntax errors
- add utils package `pointers` for converting `time` to pointer (without having to initialize a new variable)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip track scale down failures along with scale up failures
- I was tracking scale up failures but not scale down failures
- fix copyright year 2017 -> 2023 for the new `pointers` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: register failed scale down with scale state notifier
- wip writing tests for `scale_down_candidates_delay_processor`
- fix CI lint errors
- remove test file for `scale_down_candidates_processor` (there is not much to test as of now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add unit tests for `ScaleDownCandidatesDelayProcessor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't track scale up failures in `ScaleDownCandidatesDelayProcessor`
- not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: better doc comments for `TestGetScaleDownCandidates`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't ignore error in `NGChangeObserver`
- return it instead and let the caller decide what to do with it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change pointers to values in `NGChangeObserver` interface
- easier to work with
- remove `expectedAddTime` param from `RegisterScaleUp` (not needed for now)
- add tests for clusterstate's `RegisterScaleUp`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: conditions in `GetScaleDownCandidates`
- set scale down in cool down if the number of scale down candidates is 0
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: use `ng1` instead of `ng2` in existing test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: assign directly instead of using `sdProcessor` variable
- variable is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: first working test for static autoscaler
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: continue working on static autoscaler tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip second static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `Println` used for debugging
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add static_autoscaler tests for scale down delay per nodegroup flags
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: rebase off the latest `master`
- change scale state observer interface's `RegisterFailedScaleup` to reflect latest changes around clusterstate's `RegisterFailedScaleup` in `master`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix clusterstate test failing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix failing orchestrator test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `defaultScaleDownCandidatesProcessor` -> `combinedScaleDownCandidatesProcessor`
- describes the processor better
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: replace `NGChangeObserver` -> `NodeGroupChangeObserver`
- makes it easier to understand for someone not familiar with the codebase
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: reword code comment `after` -> `for which`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't return error from `RegisterScaleDown`
- not needed as of now (no implementer function returns a non-nil error for this function)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address review comments around ng change observer interface
- change dir structure of nodegroup change observer package
- stop returning errors wherever it is not needed in the nodegroup change observer interface
- rename `NGChangeObserver` -> `NodeGroupChangeObserver` interface (makes it easier to understand)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: make nodegroupchange observer thread-safe
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add TODO to consider using multiple mutexes in nodegroupchange observer
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `time.Now()` directly instead of assigning a variable to it
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: share code for checking if there was a recent scale-up/down/failure
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: convert `ScaleDownCandidatesDelayProcessor` into table tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: change scale state notifier's `Register()` -> `RegisterForNotifications()`
- makes it easier to understand what the function does
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: replace scale state notifier `Register` -> `RegisterForNotifications` in test
- to fix syntax errors since it is already renamed in the actual code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `clusterStateRegistry` from `delete_in_batch` tests
- not needed anymore since we have `scaleStateNotifier`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address PR review comments
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add empty `RegisterFailedScaleDown` for clusterstate
- fix syntax error in static autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2024-01-11 21:46:42 +05:30
Kubernetes Prow Robot 0024f2540e
Merge pull request #6290 from faan11/master
Clarify Scale down rule in the documentation (FAQ.md)
2024-01-08 13:08:50 +01:00
faan11 6ec725638c docs: clarifies scale down operation by CA in FAQ.md and main.go
This commit clarifies the condition when a node can be scaled down by the Cluster Autoscaler (CA).
The changes updates the section and flag description in the FAQ.md and main.go files.
2024-01-07 23:36:53 +01:00
Yaroslava Serdiuk d29ffd03b9
Add ProvisioningRequestPodsFilter processor (#6386)
* Introduce ProvisioningRequestPodsFilter processor

* Review
2024-01-03 11:49:36 +01:00
Joachim Bartosik a5e540d5da Restore flags for setting QPS limit in CA
Partially undo #6274. I noticed that with this change CA get rate limited and
slows down significantly (especially during large scale downs).
2023-12-29 13:28:08 +00:00
Kubernetes Prow Robot fc48d5c052
Merge pull request #6139 from damikag/priority-evictor
Implement priority based evictor
2023-12-21 18:18:53 +01:00
damikag 9ffbea4408 implement priority based evictor and refactor drain logic 2023-12-21 16:57:05 +00:00
Kubernetes Prow Robot b95adf1fa9
Merge pull request #6246 from prashantrewar/deprecate-unused-flag
Deprecate unused node-autoprovisioning-enabled and max-autoprovisioned-node-group-count flags
2023-12-14 17:26:40 +01:00
Prashant Rewar cbebabb452 deprecate unused node-autoprovisioning-enabled and max-autoprovisioned-node-group-count flags
Signed-off-by: Prashant Rewar <108176843+prashantrewar@users.noreply.github.com>
2023-12-09 13:49:05 +05:30
Kubernetes Prow Robot 0a1d74f352
Merge pull request #6294 from vadasambar/refactor/kube-client
refactor(*): move getKubeClient to utils/kubernetes
2023-11-29 19:28:44 +01:00
qianlei.qianl ae18f05a61 refactor(*): move getKubeClient to utils/kubernetes
(cherry picked from commit b9f636d2ef)

Signed-off-by: qianlei.qianl <qianlei.qianl@bytedance.com>

refactor: move logic to create client to utils/kubernetes pkg
- expose `CreateKubeClient` as public function
- make `GetKubeConfig` into a private `getKubeConfig` function (can be exposed as a public function in the future if needed)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CI failing because cloudproviders were not updated to use new autoscaling option fields
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: define errors as constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: pass kube client options by value
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-11-29 00:43:59 +05:30
Kubernetes Prow Robot e23e63a192
Merge pull request #5820 from vadasambar/kwok-poc
feat: implement `kwok` cloud provider
2023-11-27 15:15:57 +01:00
vadasambar cfbee9a4d6 feat: implement kwok cloudprovider
feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add builder for `kwok`
- add logic to scale up and scale down nodes in `kwok` provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip parse node templates from file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add short README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement remaining things
- to get the provider in a somewhat working state
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add in-cluster `kwok` as pre-requisite in the README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: templates file not correctly marshalling into node list
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: `invalid leading UTF-8 octet` error during template parsing
- remove encoding using `gob`
- not required
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: use lister to get and list
- instead of uncached kube client
- add lister as a field on the provider and nodegroup struct
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: `did not find nodegroup annotation` error
- CA was thinking the annotation is not present even though it is
- fix a bug with parsing annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CA node recognizing fake nodegroups
- add provider ID to nodes in the format `kwok:<node-name>`
- fix invalid `KwokManagedAnnotation`
- sanitize template nodes (remove `resourceVersion` etc.,)
- not sanitizing the node leads to error during creation of new nodes
- abstract code to get NG name into a separate function `getNGNameFromAnnotation`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: node not getting deleted
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add empty test file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: add OWNERS file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip kwok provider config
- add samples for static and dynamic template nodes
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip implement pulling node templates from cluster
- add status field to kwok provider config
- this is to capture how the nodes would be grouped by (can be annotation or label)
- use kwok provider config status to get ng name from the node template
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: syntax error in calling `loadNodeTemplatesFromCluster`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: first draft of dynamic node templates
- this allows node templates to be pulled from the cluster
- instead of having to specify static templates manually
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: syntax error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: abstract out related code into separate files
- use named constants instead of hardcoded values
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: cleanup kwok nodes when CA is exiting
- so that the user doesn't have to cleanup the fake nodes themselves
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: return `nil` instead of err for `HasInstance`
- because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: start working on tests for kwok provider config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add `gpuLabelKey` under `nodes` field in kwok provider config
- fix validation for kwok provider config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add motivation doc
- update README with more details
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: update kwok provider config example to support pulling gpu labels and types from existing providers
- still needs to be implemented in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip update kwok provider config to get gpu label and available types
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: wip read gpu label and available types from specified provider
- add available gpu types in kwok provider config status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add validation for gpu fields in kwok provider config
- load gpu related fields in kwok provider config status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: implement `GetAvailableGPUTypes`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add support to install and uninstall kwok
- add option to disable installation
- add option to manually specify kwok release tag
- add future scope in readme
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add future scope 'evaluate adding support to check if kwok controller already exists'
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: vendor conflict and cyclic import
- remove support to get gpu config from the specified provider (can't be used because leads to cyclic import)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add a TODO 'get gpu config from other providers'
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `file` -> `configmap`
- load config and templates from configmap instead of file
- move `nodes` and `nodegroups` config to top level
- add helper to encode configmap data into `[]bytes`
- add helper to get current pod namespace
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add new options to the kwok provider config
- auto install kwok only if the version is >= v0.4.0
- add test for `GPULabel()`
- use `kubectl apply` way of installing kwok instead of kustomize
- add test for kwok helpers
- add test for kwok config
- inject service account name in CA deployment
- add example configmap for node templates and kwok provider config in CA helm chart
- add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests)
- update kwok provider sample configs
- update `README`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: update go.mod to use v1.28 packages
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: `go mod tidy` and `go mod vendor` (again)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: kwok installation code
- add functions to create and delete clusterrolebinding to create kwok resources
- refactor kwok install and uninstall fns
- delete manifests in the opposite order of install ]
- add cleaning up left-over kwok installation to future scope
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: nil ptr error
- add `TODO` in README for adding docs around kwok config fields
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove code to automatically install and uninstall `kwok`
- installing/uninstalling requires strong permissions to be granted to `kwok`
- granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase
- this can pose a security risk
- I have removed the code related to install and uninstall for now
- will proceed after discussion with the community
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: run `go mod tidy` and `go mod vendor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add permission to create nodes
- to fix permissions error for kwok provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add more unit tests
- add tests for kwok helpers
- fix and update kwok config tests
- fix a bug where gpu label was getting assigned to `kwokConfig.status.key`
- expose `loadConfigFile` -> `LoadConfigFile`
- throw error if templates configmap does not have `templates` key (value of which is node templates)
- finish test for `GPULabel()`
- add tests for `NodeGroupForNode()`
- expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap`
- fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: abstract provider ID code into `getProviderID` fn
- fix provider name in test `kwok` -> `kwok:kind-worker-xxx`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: run `go mod vendor` and `go mod tidy
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"`
- `"kubernetes.io/hostname"` leads to infinite scale-up
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: support running CA with kwok provider locally
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use global informer factory
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: `Cleanup()` logic
- clean up only nodes managed by the kwok provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix/refactor: nodegroup creation logic
- fix issue where fake node was getting created which caused fatal error
- use ng annotation to keep track of nodegroups
- (when creating nodegroups) don't process nodes which don't have the right ng nabel
- suffix ng name with unix timestamp
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup`
- pass only the required node lister to cloud provider instead of the entire informer factory
- pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config
- implement fake node lister for testing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add test case for dynamic templates in `TestNodeGroupForNode`
- remove non-required fields from template node
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add tests for `NodeGroups()`
- add extra node template without ng selector label to add more variability in the test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: write tests for `GetNodeGpuConfig()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add test for `GetAvailableGPUTypes`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add test for `GetResourceLimiter()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()`
- abstract error msgs into variables to use them in tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn
- add check for deleting too many nodes
- rename err msg var names to make them consistent
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()`
- abstract error msgs into variables (for easy use in tests)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `Nodes()`
- add extra test case for `DecreaseTargetSize()` to check lister error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `TemplateNodeInfo`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): improve tests for `BuildKwokProvider()`
- add more test cases
- refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): add test for ng `GetOptions`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test
- not doing so leads to failure in other tests
- remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: bump CA chart version
- this is because of changes made related to kwok
- fix type `everwhere` -> `everywhere`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: fix linting checks
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: address CI lint errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: generate helm docs for `kwokConfigMapName`
- remove `KWOK_CONFIG_MAP_KEY` (not being used in the code)
- bump helm chart version
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: revise the outline for README
- add AEP link to the motivation doc
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: wip create an outline for the README
- remove `kwok` field from examples (not needed right now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add outline for ascii gifs
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update README with info around installation and benefits of using kwok provider
- add `Kwok` as a provider in main CA README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: run `go mod vendor`
- remove TODOs that are not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: finish first draft of README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove redundant/deprecated code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: bump chart version `9.30.1` -> `9.30.2`
- because of kwok provider related changes
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: fix typo `offical` -> `official`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: remove debug log msg
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add links for getting help
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: fix type in log `external cluster` -> `cluster`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

chore: add newline in chart.yaml to fix CI lint
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: fix mistake `sig-kwok` -> `sig-scheduling`
- kwok is a part if sig-scheduling (there is no sig-kwok)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: fix type `release"` -> `"release"`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: pass informer instead of lister to cloud provider builder fn
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-11-25 00:22:47 +05:30
Kubernetes Prow Robot 39245a5613
Merge pull request #6235 from atwamahmoud/ignore-scheduler-processing
Ignore scheduler processing
2023-11-22 13:54:30 +01:00
Mahmoud Atwa a1ae4d3b57 Update flags, Improve tests readability & use Bypass instead of ignore in naming 2023-11-22 11:18:55 +00:00
Mahmoud Atwa 4635a6dc04 Allow users to specify which schedulers to ignore 2023-11-22 11:18:44 +00:00
Mahmoud Atwa 86ab017967 Fix multiple comments and update flags 2023-11-22 11:17:48 +00:00
Mahmoud Atwa a1ab7b9e20 Add new pod list processors for clearing TPU requests & filtering out
expendable pods

Treat non-processed pods yet as unschedulable
2023-11-22 11:16:33 +00:00
Artur Żyliński e836e47c1e Remove gce-expander-ephemeral-storage-support flag
Always enable the feature
2023-11-15 10:59:06 +01:00
Artur Żyliński 747d0b9af4 Cleanup: Remove separate client for k8s events
Remove RateLimiting options - replay on APF for apiserver protection.
Details: https://github.com/kubernetes/kubernetes/issues/111880
2023-11-14 11:20:36 +01:00
Artem Minyaylov ab4c5cb8c7 Initialize default drainability rules 2023-10-19 15:55:34 +00:00
Artem Minyaylov 324a33ede8 Pass DeleteOptions once during default rule creation 2023-10-10 20:35:49 +00:00
Artem Minyaylov a68b748fd7 Refactor NodeDeleteOptions for use in drainability rules 2023-09-29 17:55:19 +00:00
Kubernetes Prow Robot e461782e27
Merge pull request #6162 from pohly/log-init-fix
fix log initialization
2023-09-29 06:32:42 -07:00
Patrick Ohly b9c0d91da7 fix log initialization
InitLogs unconditionally disables contextual logging, while ValidateAndApply
checks the feature gate for that. InitLogs must come first, otherwise

    --feature-gates=ContextualLogging=true

doesn't work.
2023-09-29 14:36:02 +02:00
Eric Lin 74d1f7f349 Trim managedFields in shared informer factory
Signed-off-by: Eric Lin <exlin@google.com>
2023-09-28 12:45:47 +00:00
Mahmoud Atwa 1bbbbd6036 fix typo 2023-09-27 09:00:32 +00:00
Mahmoud Atwa 267476fd06 Rename variables & methods to StartupTaint... instead of IgnoreTaint 2023-09-26 13:54:52 +00:00
Mahmoud Atwa f9d3185f16 Rename IgnoreTaints to StartupTaints & deprecate --ignore-taints flag 2023-09-26 09:12:01 +00:00
Mahmoud Atwa 79a55bbb05 Add startup taint flag, prefix & add status taint prefix 2023-09-22 21:08:07 +00:00
Mahmoud Atwa d2fe118db9 Add startup taint flag, prefix & add status taint prefix 2023-09-22 20:51:34 +00:00
Patrick Ohly ade5e0814e fix incomplete startup of informers
Previously, SharedInformerFactory.Start was called before core.NewAutoscaler.
That had the effect that any new informer created as part of
core.NewAutoscaler, in particular in
kubernetes.NewListerRegistryWithDefaultListers, never got started.

One of them was the DaemonSet informer. This had the effect that the DaemonSet
lister had an empty cache and scale down failed with:

    I0920 11:06:36.046889   31805 cluster.go:164] node gke-cluster-pohly-default-pool-c9f60a43-5rvz cannot be removed: daemonset for kube-system/pdcsi-node-7hnmc is not present, err: daemonset.apps "pdcsi-node" not found

This was on a GKE cluster with cluster-autoscaler running outside of the
cluster on a development machine.
2023-09-20 11:20:35 +02:00
Kubernetes Prow Robot f9a7c7f73f
Merge pull request #6114 from linxiulei/cmd
Allow setting content-type in command
2023-09-19 06:23:07 -07:00
Eric Lin 61d784a662 Allow setting content-type in command
Signed-off-by: Eric Lin <exlin@google.com>
2023-09-19 08:58:59 +00:00
Youn Jae Kim 6cf41290b6 add elect-leader flag to the pflag 2023-09-15 14:35:56 -07:00
Eric Lin 0f02235e98 Use informer factory to reuse listers
Signed-off-by: Eric Lin <exlin@google.com>
2023-09-15 13:57:15 +00:00
Damika Gamlath aef5a2bffc Disable dynamic-node-delete-delay-after-taint-enabled by default 2023-09-11 13:20:08 +00:00
Kubernetes Prow Robot 9622a50f55
Merge pull request #6019 from damikag/fix-rc-ca-sched
Implement dynamically adjustment of NodeDeleteDelayAfterTaint based on round trip time between CA and api-server
2023-09-08 07:04:15 -07:00
Damika Gamlath 42691d3443 fix race condition between ca and scheduler
Implement dynamically adjustment of NodeDeleteDelayAfterTaint based on round trip time between ca and apiserver
2023-09-08 12:14:23 +00:00
Marc-Andre Dufresne 55c92e1025 add json logging support 2023-08-28 10:32:39 -04:00
Bartłomiej Wróblewski e39d1b028d Clean up NodeGroupConfigProcessor interface 2023-08-04 16:00:50 +00:00
Daniel Kłobuszewski 990cd6581c Enable parallel drain by default. 2023-07-21 17:57:19 +02:00
vadasambar 23f03e112e feat: support custom scheduler config for in-tree schedulr plugins (without extenders)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `--scheduler-config` -> `--scheduler-config-file` to avoid confusion
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: `goto` causing infinite loop
- abstract out running extenders in a separate function
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove code around extenders
- we decided not to use scheduler extenders for checking if a pod would fit on a node
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move scheduler config to a `utils/scheduler` package`
- use default config as a fallback
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix static_autoscaler test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: `GetSchedulerConfiguration` fn
- remove falling back
- add mechanism to detect if the scheduler config file flag was set
-
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: wip add tests for `GetSchedulerConfig`
- tests are failing now
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add tests for `GetSchedulerConfig`
- abstract error messages so that we can use them in the tests
- set api version explicitly (this is what upstream does as well)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: do a round of cleanup to make PR ready for review
- make import names consistent
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: use `pflag` to check if the `--scheduler-config-file` flag was set
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add comments for exported error constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: don't export error messages
- exporting is not needed
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: add underscore in test file name
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix test failing because of no comment on exported `SchedulerConfigFileFlag`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refacotr: change name of flag variable `schedulerConfig` -> `schedulerConfigFile`
- avoids confusion
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add extra test cases for predicate checker
- where the predicate checker uses custom scheduler config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: remove `setFlags` variable
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: abstract custom scheduler configs into `conifg` package
- make them constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: fix linting error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: introduce a new custom test predicate checker
- instead of adding a param to the current one
- this is so that we don't have to pass `nil` to the existing test predicate checker in many places
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `NewCustomPredicateChecker` -> `NewTestPredicateCheckerWithCustomConfig`
- latter narrows down meaning of the function better than former
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `GetSchedulerConfig` -> `ConfigFromPath`
- `scheduler.ConfigFromPath` is shorter and feels less vague than `scheduler.GetSchedulerConfig`
- move test config to a new package `test` under `config` package
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add `TODO` for replacing code to parse scheduler config
- with upstream function
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-07-13 09:51:33 +05:30
Kubernetes Prow Robot da96d89b17
Merge pull request #5890 from Bryce-Soghigian/bsoghigian/respecting-bulk-delete
fix: setting maxEmptyBulkDelete, and maxScaleDownParallelism to be the same value
2023-07-12 09:45:12 -07:00
Kubernetes Prow Robot c6893e9e28
Merge pull request #5672 from vadasambar/feat/5399/ignore-daemonsets-utilization-per-nodegroup
feat: set `IgnoreDaemonSetsUtilization` per nodegroup for AWS
2023-07-12 07:43:12 -07:00
Damika Gamlath 0f8502c623 Refactor autoscaler.go and static_autoscalar.go to move declaration of the NodeDeletion option to main.go 2023-07-10 08:49:49 +00:00