autoscaler

Commit Graph

Author	SHA1	Message	Date
MenD32	3a2933a24c	docs: added helpful comments Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>	2025-05-27 21:31:47 +03:00
MenD32	4e8bd0ada5	tests: added test to check that scaledowns work with topology spread constraints Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>	2025-05-24 15:34:19 +03:00
MenD32	ea1c308130	fix: hard topology spread constraints stop scaledown Signed-off-by: MenD32 <amit.mendelevitch@gmail.com>	2025-05-24 15:09:38 +03:00
Kubernetes Prow Robot	c85f22f7dd	Merge pull request #7798 from omerap12/migrate-claimReservedForPod migrate claimReservedForPod to use upstream IsReservedForPod	2025-05-13 09:57:16 -07:00
Norbert Cyran	6ab7e2eb78	Prevent nil dereference of preFilterStatus	2025-05-07 10:38:20 +02:00
Piotr Betkier	ac1c7b5463	use k8s.io/component-helpers/resource for pod request calculations	2025-04-22 17:36:17 +02:00
jinglinliang	25af21c515	Add unit test to allow draining when StatefulSet kind has custom API Group	2025-04-09 14:03:00 -07:00
jinglinliang	cc3a9f5d10	Allow draining when StatefulSet kind has custom API Group	2025-04-09 14:03:00 -07:00
Omran	696af986ed	Add time based drainability rule for non-pdb-assigned system pods	2025-03-24 12:47:16 +00:00
mendelski	68c7d1a84e	Force preempting system-node-critical daemon sets	2025-02-17 18:10:27 +00:00
Kushagra Nigam	c47cb7083c	ignore unexpored fields	2025-02-12 11:57:28 +00:00
Omer Aplatony	e02272a4f1	migrate claimReservedForPod to use upstream IsReservedForPod Signed-off-by: Omer Aplatony <omerap12@gmail.com>	2025-02-02 21:09:08 +02:00
Maciej Skoczeń	b36e3879a2	Fix data race while setting delta cluster state in parallel	2025-01-15 13:28:52 +00:00
Ismail Alidzhikov	b676bb91ef	logging: Start from a capital letter	2025-01-07 17:49:29 +02:00
Maciej Skoczeń	39882551f7	Parallelize cluster snapshot creation	2025-01-03 10:35:11 +00:00
Kubernetes Prow Robot	50c65906fd	Merge pull request #7530 from towca/jtuznik/dra-actual CA: DRA integration MVP	2024-12-20 16:30:08 +01:00
Kuba Tużnik	a45e6b7003	CA: implement DRA integration tests for StaticAutoscaler	2024-12-20 13:30:36 +01:00
Kuba Tużnik	c5cb8a077d	CA: add DRA object handling logic to PredicateSnapshot All added logic is behind the DRA flag guard, this should be a no-op if the flag is disabled.	2024-12-20 13:30:36 +01:00
Kuba Tużnik	714ab661ca	CA: implement calculating utilization for DRA resources The logic is very basic and will likely need to be revised, but it's something for initial testing. Utilization of a given Pool is calculated as the number of allocated devices in the pool divided by the number of all devices in the pool. For scale-down purposes, the max utilization of all Node-local Pools is used. The new logic is mostly behind the DRA flag guard, so this should be a no-op if the flag is disabled. The only difference should be that FilterOutUnremovable marks a Node as unremovable if calculating utilization fails. Not sure why this wasn't the case before, but I think we need it for DRA - if CA sees an incomplete picture of a resource pool, we probably don't want to scale the Node down.	2024-12-20 13:30:36 +01:00
Kuba Tużnik	4e68a0c6ef	CA: sanitize and propagate DRA objects through NodeInfos in node_info utils	2024-12-20 13:30:36 +01:00
Kuba Tużnik	479d7ce3d6	CA: implement a Provider for dynamicresources.Snapshot The Provider uses DRA object listers to create a Snapshot of the DRA objects.	2024-12-20 13:30:36 +01:00
Kuba Tużnik	377639a8dc	CA: implement dynamicresources.Snapshot for storing and modifying the state of DRA objects The Snapshot can hold all DRA objects in the cluster, and expose them to the scheduler framework via the SharedDRAManager interface. The state of the objects can be modified during autoscaling simulations using the provided methods.	2024-12-20 13:30:10 +01:00
Kuba Tużnik	66d0aeb3cb	CA: implement utils for interacting with ResourceClaims These utils will be used by various parts of the DRA logic in the following commits.	2024-12-19 15:55:49 +01:00
Kubernetes Prow Robot	c2972a8000	Merge pull request #7606 from towca/jtuznik/node-info-fix CA: fix a nil map write in NodeInfo.AddPod()	2024-12-16 12:48:51 +01:00
Kuba Tużnik	410bd7cea5	CA: fix a nil map write in NodeInfo.AddPod() If the NodeInfo is created via WrapSchedulerNodeInfo with nil podExtraInfos, subsequent AddPod() calls panic on trying to add extra info for the pod.	2024-12-13 16:21:37 +01:00
Kuba Tużnik	4e283e34ee	CA: don't error out in HintingSimulator if a hinted Node is gone If a hinted Node is no longer in the cluster snapshot (e.g. it was a fake upcoming Node and the real one appeared). This was introduced in the recent PredicateChecker->PredicateSnapshot refactor. Previously, PredicateChecker.CheckPredicates() would return an error if the hinted Node was gone, and HintingSimulator treated this error the same as failing predicates - it would move on to the non-hinting logic. After the refactor, HintingSimulator explicitly errors out if it can't retrieve the hinted Node from the snapshot, so the behavior changed. I checked other CheckPredicates()/SchedulePod() callsites, and this is the only one when ignoring the missing Node makes sense. For the others, the Node is added to the snapshot just before the call, so it being missing should cause an error.	2024-12-12 14:20:51 +01:00
Kubernetes Prow Robot	37b3da4e79	Merge pull request #7529 from towca/jtuznik/dra-prep CA: prepare for DRA integration	2024-12-09 17:14:03 +00:00
Kuba Tużnik	0691512d27	CA: extend SchedulerPluginRunner with RunReserveOnNode RunReserveOnNode runs the Reserve phase of schedulerframework, which is necessary to obtain ResourceClaim allocations computed by the DRA scheduler plugin. RunReserveOnNode isn't used anywhere yet, so this should be a no-op.	2024-12-09 17:38:13 +01:00
Kuba Tużnik	307002eb42	CA: move NodeInfo methods from ClusterSnapshotStore to ClusterSnapshot All the NodeInfo methods have to take DRA into account, and the logic for that will be the same for different ClusterSnapshotStore implementations. Instead of duplicating the new logic in Basic and Delta, the methods are moved to ClusterSnapshot and the logic will be implemented once in PredicateSnapshot. PredicateSnapshot will use the DRA Snapshot exposed by its ClusterSnapshotStore to implement these methods. The DRA Snapshot has to be stored in the ClusterSnapshotStore layer, as we need to be able to fork/commit/revert it. Lower-level methods for adding/removing just the schedulerframework.NodeInfo parts are added to ClusterSnapshotStore. PredicateSnapshot utilizes these methods to implement AddNodeInfo and RemoveNodeInfo. This should be a no-op, it's just a refactor.	2024-12-09 17:38:04 +01:00
Kuba Tużnik	eba5e08f6d	CA: integrate BasicSnapshotStore with drasnapshot.Snapshot Store the DRA snapshot inside the current internal data in SetClusterState(). Retrieve the DRA snapshot from the current internal data in DraSnapshot(). Clone the DRA snapshot whenever the internal data is cloned during Fork(). This matches the forking logic that BasicSnapshotStore uses, ensuring that the DRA object state is correctly forked/commited/reverted during the corresponding ClusterSnapshot operations. This should be a no-op, as DraSnapshot() isn't called anywhere yet, adn no DRA snapshot is passed to SetClusterState() yet.	2024-12-09 17:14:45 +01:00
Kuba Tużnik	466f94b780	CA: extend ClusterSnapshotStore to allow storing, retrieving and modifying DRA objects A new DRA Snapshot type is introduced, for now with just dummy methods to be implemented in later commits. The new type is intended to hold all DRA objects in the cluster. ClusterSnapshotStore.SetClusterState() is extended to take the new DRA Snapshot in addition to the existing parameters. ClusterSnapshotStore.DraSnapshot() is added to retrieve the DRA snapshot set by SetClusterState() back. This will be used by PredicateSnapshot to implement DRA logic later. This should be a no-op, as DraSnapshot() is never called, and no DRA snapshot is passed to SetClusterState() yet.	2024-12-09 17:14:45 +01:00
Kuba Tużnik	1e560274d5	CA: extend WrapSchedulerNodeInfo to allow passing DRA objects This should be a no-op, as no DRA objects are passed yet.	2024-12-09 17:14:45 +01:00
Kuba Tużnik	d0338fa301	CA: integrate simulator with schedulerframework.SharedDRAManager Make SharedDRAManager a part of the ClusterSnapshotStore interface, and implement dummy methods to satisfy the interface. Actual implementation will come in later commits. This is needed so that ClusterSnapshot can feed DRA objects to the DRA scheduler plugin, and obtain ResourceClaim modifications back from it. The integration is behind the DRA flag guard, this should be a no-op if the flag is disabled.	2024-12-09 17:14:34 +01:00
Kuba Tużnik	8c7f3fadc6	CA: plumb the DRA flag guard to PredicateSnapshot	2024-12-06 13:40:47 +01:00
Kuba Tużnik	16983d2cdd	CA: Fix a data race in framework.NewHandle Multiple tests can call NewHandle() concurrently, because of t.Parallel(). NewHandle calls schedulermetrics.InitMetrics() which modifies global variables, so there's a race. Wrapped the schedulermetrics.InitMetrics() call in a sync.Once.Do() so that it's only done once, in a thread-safe manner.	2024-12-04 20:18:50 +01:00
Kuba Tużnik	054d5d2e7c	CA: refactor SchedulerBasedPredicateChecker into SchedulerPluginRunner For DRA, this component will have to call the Reserve phase in addition to just checking predicates/filters. The new version also makes more sense in the context of PredicateSnapshot, which is the only context now. While refactoring, I noticed that CheckPredicates for some reason doesn't check the provided Node against the eligible Nodes returned from PreFilter (while FitsAnyNodeMatching does do that). This seems like a bug, so the check is added. The checks in FitsAnyNodeMatching are also reordered so that the cheapest ones are checked earliest.	2024-12-04 14:33:51 +01:00
Kuba Tużnik	6876289228	CA: remove PredicateChecker, use the new ClusterSnapshot methods instead	2024-12-04 14:33:51 +01:00
Kuba Tużnik	0ace148d3d	CA: rename BasicClusterSnapshot and DeltaClusterSnapshot to reflect the ClusterSnapshotStore change	2024-12-04 14:33:51 +01:00
Kuba Tużnik	67773a5509	CA: move BasicClusterSnapshot and DeltaClusterSnapshot to a dedicated subpkg	2024-12-04 14:33:51 +01:00
Kuba Tużnik	540725286f	CA: migrate the codebase to use PredicateSnapshot	2024-12-04 14:33:51 +01:00
Kuba Tużnik	46e9f398fe	CA: introduce PredicateSnapshot PredicateSnapshot implements the ClusterSnapshot methods that need to run predicates on top of a ClusterSnapshotStore. testsnapshot pkg is introduced, providing functions abstracting away the snapshot creation for tests. ClusterSnapshot tests are moved near PredicateSnapshot, as it'll be the only "full" implementation.	2024-12-04 14:33:50 +01:00
Kuba Tużnik	ce185226d1	CA: extend ClusterSnapshot interface with predicate-checking methods To handle DRA properly, scheduling predicates will need to be run whenever Pods are scheduled in the snapshot. PredicateChecker always needs a ClusterSnapshot to work, and ClusterSnapshot scheduling methods need to run the predicates first. So it makes most sense to have PredicateChecker be a dependency for ClusterSnapshot implementations, and move the PredicateChecker methods to ClusterSnapshot. This commit mirrors PredicateChecker methods in ClusterSnapshot (with the exception of FitsAnyNode which isn't used anywhere and is trivial to do via FitsAnyNodeMatching). Further commits will remove the PredicateChecker interface and move the implementation under clustersnapshot. Dummy methods are added to current ClusterSnapshot implementations to get the tests to pass. Further commits will actually implement them. PredicateError is refactored into a broader SchedulingError so that the ClusterSnapshot methods can return a single error that the callers can use to distinguish between a failing predicate and other, unexpected errors.	2024-12-04 14:33:40 +01:00
Kuba Tużnik	a35f830f1d	CA: extract a Handle to scheduleframework.Framework out of PredicateChecker This decouples PredicateChecker from the Framework initialization logic, and allows creating multiple PredicateChecker instances while only initializing the framework once. This commit also fixes how CA integrates with Framework metrics. Instead of Registering them they're only Initialized so that CA doesn't expose scheduler metrics. And the initialization is moved from multiple different places to the Handle constructor.	2024-12-03 16:47:54 +01:00
Kuba Tużnik	eb26816ce9	CA: refactor utils related to NodeInfos simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate, and scheduler_utils.DeepCopyTemplateNode all had very similar logic for sanitizing and copying NodeInfos. They're all consolidated to one file in simulator, sharing common logic. DeepCopyNodeInfo is changed to be a framework.NodeInfo method. MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to correlate Nodes to scheduled pods, instead of using a live Pod lister. This means that the snapshot now has to be properly initialized in a bunch of tests.	2024-11-27 12:51:30 +01:00
Kuba Tużnik	473a1a8ffc	CA: remove Clear from ClusterSnapshot It's now redundant - SetClusterState with empty arguments does the same thing.	2024-11-19 15:28:27 +01:00
Kuba Tużnik	f67db627e2	CA: rename ClusterSnapshot AddPod, RemovePod, RemoveNode RemoveNode is renamed to RemoveNodeInfo for consistency with other NodeInfo methods. For DRA, the snapshot will have to potentially allocate ResourceClaims when adding a Pod to a Node, and deallocate them when removing a Pod from a Node. This will happen in new methods added to ClusterSnapshot in later commits - SchedulePod and UnschedulePod. These new methods should be the "default" way of moving pods around the snapshot going forward. However, we'll still need to be able to add and remove pods from the snapshot "forcefully" to handle some corner cases (e.g. expendable pods). AddPod is renamed to ForceAddPod, and RemovePod to ForceRemovePod to highlight that these are no longer the "default" methods of moving pods around the snapshot, and are bypassing something important.	2024-11-19 15:28:21 +01:00
Kuba Tużnik	a81aa5c616	CA: remove AddNode from ClusterSnapshot AddNodeInfo already provides the same functionality, and has to be used in production code in order to propagate DRA objects correctly. Uses in production are replaced with SetClusterState(), which will later take DRA objects into account. Uses in the test code are replaced with AddNodeInfo().	2024-11-19 15:28:16 +01:00
Kuba Tużnik	38603883db	CA: remove redundant IsPVCUsedByPods from ClusterSnapshot The method is already accessible via StorageInfos(), it's redundant.	2024-11-19 15:28:11 +01:00
Kuba Tużnik	517ecb992f	CA: add SetClusterState to ClusterSnapshot, remove AddNodes AddNodes() is redundant - it was indended for batch adding nodes, with batch-specific optimizations in mind probably. However, it has always been implemented as just iterating over AddNode(), and is only used in test code. Most of the uses in the test code were initializing the cluster state. They are replaced with SetClusterState(), which will later be needed for handling DRA anyway (we'll have to start tracking things that aren't node- or pod-scoped). The other uses are replaced with inline loops over AddNode().	2024-11-19 15:28:06 +01:00
Kuba Tużnik	269c7a339e	CA: remove AddNodeWithPods from ClusterSnapshot, replace uses with AddNodeInfo We need AddNodeInfo in order to propagate DRA objects through the snapshot, which makes AddNodeWithPods redundant.	2024-11-19 15:27:59 +01:00

1 2 3 4 5 ...

358 Commits