Commit Graph

222 Commits

Author SHA1 Message Date
Bartłomiej Wróblewski 2c7d8dc378 Rewrite TestCloudProvider to use builder pattern 2025-05-23 12:42:15 +00:00
Norbert Cyran 9a5e3d9f3d Allow using scheduled pods as samples in proactive scale up 2025-03-19 12:33:39 +01:00
Daniel Kłobuszewski bac35046fb Fix incorrect usage of klog Warningf function
The .*f variants should only ever be called with arguments to format.
This should've really been a part of
https://github.com/kubernetes/autoscaler/pull/7917
2025-03-13 13:50:39 +01:00
Kubernetes Prow Robot 5e7a559aa8
Merge pull request #7841 from pmendelski/force-ds-fix2
Force system-node-critical daemon-sets in node group templates
2025-03-06 10:49:45 -08:00
Maciej Skoczeń 90eabc6a4d Differentiate provisioning requests using Parameters field. Keep prefixing as not recommended approach 2025-03-04 11:41:51 +00:00
Maciej Skoczeń 7115527077 Allow to prefix provisioningClassName to filter provisioning requests 2025-03-04 10:48:21 +00:00
mendelski 4f58055eeb
Skip to be deleted nodes from template candidates 2025-02-17 18:10:36 +00:00
Justyna Betkier b8db30c2fb Improve events when max total nodes of the cluster is reached.
- log cluster wide event - previous event would never get fired because
  the estimators would already cap the options they generate and
additionally it would fire once and events are kept only for some time
- log per pod event explaining why the scale up is not triggered
  (previously it would either get no scale up because no matching group
or it would not get an event at all)

This required adding a list of pods that were unschedulable to the
status in case when the max total nodes were reached.
2025-02-12 13:24:51 +01:00
Justyna Betkier 86ee2b723a Improve logging when the cluster reaches max nodes total.
- add autoscaling status to reflect that
- change the log severity to warning as this means that autoscaler will
  not be fully functional (in praticular scaling up will not work)
- fix the scale up enforcer logic not to skip the max nodes reached
  logging point
2025-01-29 13:37:48 +01:00
Maciej Skoczeń d7c325abf7 Enforce provisioning requests processing even if all pods are new 2025-01-10 13:08:56 +00:00
Maciej Skoczeń 39882551f7 Parallelize cluster snapshot creation 2025-01-03 10:35:11 +00:00
Kubernetes Prow Robot 7b648361c3
Merge pull request #7613 from walidghallab/err
Refactor NewAutoscalerError function.
2024-12-17 13:48:53 +01:00
Walid Ghallab 720f5946fd Refactor NewAutoscalerError function.
We will have two functions instead of one:
1. One that doesn't do formatting, like klog.Error
2. One that accepts formating, like klog.Errorf

The main reason behind this is to avoid go vet errors and have clear
interfaces to catch accidental bugs and rely on go vet to catch those
accidental bugs (or go test in go 1.24, as those are treated as errors).
2024-12-16 17:46:40 +00:00
Maciej Skoczeń 2426d7f836 Don't accept ProvisioningRequest twice when checkCapacityBatchProcessing enabled 2024-12-16 09:57:18 +00:00
Kubernetes Prow Robot 37b3da4e79
Merge pull request #7529 from towca/jtuznik/dra-prep
CA: prepare for DRA integration
2024-12-09 17:14:03 +00:00
Kuba Tużnik 466f94b780 CA: extend ClusterSnapshotStore to allow storing, retrieving and modifying DRA objects
A new DRA Snapshot type is introduced, for now with just dummy methods
to be implemented in later commits. The new type is intended to hold all
DRA objects in the cluster.

ClusterSnapshotStore.SetClusterState() is extended to take the new DRA Snapshot in
addition to the existing parameters.

ClusterSnapshotStore.DraSnapshot() is added to retrieve the DRA snapshot set by
SetClusterState() back. This will be used by PredicateSnapshot to implement DRA
logic later.

This should be a no-op, as DraSnapshot() is never called, and no DRA
snapshot is passed to SetClusterState() yet.
2024-12-09 17:14:45 +01:00
Kubernetes Prow Robot 52dd6d7488
Merge pull request #7561 from gabesaba/check_capacity_parallel
[CheckCapacity] Update Conditions in Parallel
2024-12-05 14:00:00 +00:00
Gabe 5877f9670f [CheckCapacity] Set Provisioned/Accepted in parallel 2024-12-05 12:58:54 +00:00
Kuba Tużnik 6876289228 CA: remove PredicateChecker, use the new ClusterSnapshot methods instead 2024-12-04 14:33:51 +01:00
Kuba Tużnik 0ace148d3d CA: rename BasicClusterSnapshot and DeltaClusterSnapshot to reflect the ClusterSnapshotStore change 2024-12-04 14:33:51 +01:00
Kuba Tużnik 67773a5509 CA: move BasicClusterSnapshot and DeltaClusterSnapshot to a dedicated subpkg 2024-12-04 14:33:51 +01:00
Kuba Tużnik 540725286f CA: migrate the codebase to use PredicateSnapshot 2024-12-04 14:33:51 +01:00
Kuba Tużnik a35f830f1d CA: extract a Handle to scheduleframework.Framework out of PredicateChecker
This decouples PredicateChecker from the Framework initialization logic,
and allows creating multiple PredicateChecker instances while only
initializing the framework once.

This commit also fixes how CA integrates with Framework metrics. Instead
of Registering them they're only Initialized so that CA doesn't expose
scheduler metrics. And the initialization is moved from multiple
different places to the Handle constructor.
2024-12-03 16:47:54 +01:00
Kuba Tużnik eb26816ce9 CA: refactor utils related to NodeInfos
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.

DeepCopyNodeInfo is changed to be a framework.NodeInfo method.

MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
2024-11-27 12:51:30 +01:00
Kuba Tużnik a81aa5c616 CA: remove AddNode from ClusterSnapshot
AddNodeInfo already provides the same functionality, and has to be used
in production code in order to propagate DRA objects correctly.

Uses in production are replaced with SetClusterState(), which will later
take DRA objects into account. Uses in the test code are replaced with
AddNodeInfo().
2024-11-19 15:28:16 +01:00
Kuba Tużnik 879c6a84a4 DRA: migrate all of CA to use the new internal NodeInfo/PodInfo
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.

Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.

After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
2024-11-05 16:43:43 +01:00
Omran f945fc4add
Modify scale down set processor to add reasons to unremovable nodes 2024-10-29 10:28:37 +00:00
Devansh Das d73bdb1902 Implement unit tests for batch processing of check capacity class 2024-10-24 21:14:55 +00:00
Bartłomiej Wróblewski 068ce78272 Register scheduler metrics 2024-10-23 16:47:34 +00:00
Devansh Das 1ce64e93d4 Add support for frequent loops when provisioningrequest is encountered in last iteration 2024-10-20 17:55:05 +00:00
Devansh Das 0946d851e7
Revert "Add support for frequent loops when provisioningrequest is encountered in last iteration" 2024-10-18 12:21:04 +02:00
Kubernetes Prow Robot 64a64322d4
Merge pull request #7376 from damikag/cleanup-remove-or-update-logs
Remove/update spamming logs
2024-10-16 13:37:03 +01:00
mendelski 4ef901cdbb
Synchronize access to scale-ups in AsyncNodeGroupInitializer 2024-10-16 11:32:35 +00:00
Devansh Das 0a64fb0c27 Add support for frequent loops when provisioningrequest is encountered in last iteration 2024-10-15 09:37:54 +00:00
Kubernetes Prow Robot 9a2e450164
Merge pull request #7310 from kawych/htn
Remove an assumption that node initialization can be performed with a single 'targetSize' number input
2024-10-11 15:14:20 +01:00
Damika Gamlath e20e5e600b Remove spamming logs in compare_nodegroups.go and filter_out_daemon_sets.go
Change the log lovel and type of spamming logs in clusterstate.go and pre_filtering_processor.go
2024-10-10 08:48:24 +00:00
olagacek 44dcaa8cf3
Revert "CAS: cloudprovider-specific nodegroupset" 2024-10-04 12:54:22 +02:00
Mahmoud Atwa b185b14ea1 Report only injected pods after enforcing pod limit 2024-10-03 16:32:00 +00:00
Mahmoud Atwa 16688dcdbb Adds injection metrics for fake pod injection 2024-10-03 12:19:03 +00:00
Karol Wychowaniec 95ea94cf4e Remove an assumption that node initialization can be performed with a single 'targetSize' number input 2024-10-02 11:55:22 +00:00
Yaroslava Serdiuk 04b1402ddc
Add backoff mechanism for ProvReq retry (#7182)
* Add backoff mechanism for ProvReq retry

* Add flags for intital and max backoff time, and cache size

* Review remarks

* Add LRU cache

* Review remark
2024-09-23 09:16:00 +01:00
Omran 38ce500d5f
Fix scale up status processor overriding default one with proactive scaleup enabled 2024-09-12 18:26:56 +00:00
Yaroslava Serdiuk 93897d8d1b Delete old ProvReqs 2024-09-11 12:59:18 +00:00
Jack Francis 4ff4079041 cloudprovider-specific nodegroupset
Signed-off-by: Jack Francis <jackfrancis@gmail.com>
2024-09-06 10:09:40 -07:00
Devansh Das 6d5bfeb67c Add unit test to ensure unschedulable pods slice is not overwritten by injector 2024-09-06 14:10:17 +00:00
Devansh Das 835b79bfce Subdivide provision method 2024-09-06 10:26:38 +00:00
Walid Ghallab a91c771f37 Fix nil pointer check in nodegroup_manager.go 2024-09-02 13:01:49 +00:00
Kubernetes Prow Robot 9226cf6bb2
Merge pull request #7145 from abdelrahman882/proactive-scaleup
Add proactive scaleup
2024-08-26 16:40:14 +01:00
Omran 01e943304a
Add proactive scaleup 2024-08-23 21:59:15 +00:00
Kubernetes Prow Robot 70f0bcbca9
Merge pull request #7195 from aleksandra-malinowska/prov-req-api-v1-5
ProvisioningRequest v1 client
2024-08-23 17:07:53 +01:00