Commit Graph

339 Commits

Author SHA1 Message Date
Bartłomiej Wróblewski 2c7d8dc378 Rewrite TestCloudProvider to use builder pattern 2025-05-23 12:42:15 +00:00
Norbert Cyran 6ab7e2eb78 Prevent nil dereference of preFilterStatus 2025-05-07 10:38:20 +02:00
Piotr Betkier ac1c7b5463 use k8s.io/component-helpers/resource for pod request calculations 2025-04-22 17:36:17 +02:00
Norbert Cyran 9a5e3d9f3d Allow using scheduled pods as samples in proactive scale up 2025-03-19 12:33:39 +01:00
Kubernetes Prow Robot 5e7a559aa8
Merge pull request #7841 from pmendelski/force-ds-fix2
Force system-node-critical daemon-sets in node group templates
2025-03-06 10:49:45 -08:00
Norbert Cyran a946d3d7c9 Make AddTaints and CleanTaints thread safe 2025-02-26 16:17:09 +01:00
mendelski 68c7d1a84e
Force preempting system-node-critical daemon sets 2025-02-17 18:10:27 +00:00
Norbert Cyran 1ffc13fc0d Ensure added taints are present on the node in the snapshot 2025-02-17 11:36:14 +01:00
Karol Wychowaniec 5245a5b934 Minor refactor to scale-up orchestrator for more re-usability 2025-01-21 14:19:59 +00:00
Kubernetes Prow Robot 50c65906fd
Merge pull request #7530 from towca/jtuznik/dra-actual
CA: DRA integration MVP
2024-12-20 16:30:08 +01:00
Kuba Tużnik 377639a8dc CA: implement dynamicresources.Snapshot for storing and modifying the state of DRA objects
The Snapshot can hold all DRA objects in the cluster, and expose them
to the scheduler framework via the SharedDRAManager interface.

The state of the objects can be modified during autoscaling simulations
using the provided methods.
2024-12-20 13:30:10 +01:00
Kuba Tużnik 66d0aeb3cb CA: implement utils for interacting with ResourceClaims
These utils will be used by various parts of the DRA logic in the
following commits.
2024-12-19 15:55:49 +01:00
Walid Ghallab 720f5946fd Refactor NewAutoscalerError function.
We will have two functions instead of one:
1. One that doesn't do formatting, like klog.Error
2. One that accepts formating, like klog.Errorf

The main reason behind this is to avoid go vet errors and have clear
interfaces to catch accidental bugs and rely on go vet to catch those
accidental bugs (or go test in go 1.24, as those are treated as errors).
2024-12-16 17:46:40 +00:00
Kuba Tużnik eb26816ce9 CA: refactor utils related to NodeInfos
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.

DeepCopyNodeInfo is changed to be a framework.NodeInfo method.

MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
2024-11-27 12:51:30 +01:00
Kuba Tużnik bc16a6f55b CA: wrap the provided errors in ToAutoscalerError() and AddPrefix(), implement Unwrap()
This allows using errors.Is() to check if an AutoscalerError wraps
a sentinel error (e.g. cloudprovider.ErrNotImplemented) when a prefix is
added to it.
2024-11-27 12:44:59 +01:00
Kuba Tużnik 879c6a84a4 DRA: migrate all of CA to use the new internal NodeInfo/PodInfo
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.

Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.

After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
2024-11-05 16:43:43 +01:00
Kuba Tużnik 358f8c0d21 DRA: remove utils/test dependency on cloudprovider
utils/test is supposed to be usable in any CA package. Having a
dependency on cloudprovider makes it unusuable in any package
that cloudprovider depends on because of import cycles.

The cloudprovider import is only needed by GetGpuConfigFromNode,
which is only used in info_test.go. This commit just moves
GetGpuConfigFromNode there as an unexported function.
2024-11-05 16:42:26 +01:00
Kubernetes Prow Robot c75c3254b2
Merge pull request #7152 from walidghallab/sanitize
Sanitize ENV from podSpec to improve scheduling simulation performance.
2024-08-29 09:06:30 +01:00
Walid Ghallab e6ee407001 Sanitize ENV from podSpec.
This is because it doesn't affect scheduling, but it affect performance
of scheduling simulation.

It triggered alert from 'overflowing_controllers_count' metric.
2024-08-28 14:50:28 +00:00
Devansh Das 1dba38e850 Run gofmt 2024-08-14 14:24:29 +00:00
Devansh Das ed38cbbffa Add unique error message for when in-cluster config is not found 2024-08-12 17:59:59 +00:00
Devansh Das ee6d4f9b5d Add in cluster kubeconfig config 2024-08-12 17:59:43 +00:00
Kubernetes Prow Robot 442c35391c
Merge pull request #6782 from jbartosik/nice-format-blocking-pod-reason
Format BlockingPodReason in human readbale way with %v
2024-05-07 04:25:48 -07:00
Joachim Bartosik 4e42fe78b6 Format BlockingPodReason in human readbale way with %v
Without this the enum is output as its int value, which is not readable
2024-05-02 10:44:54 +02:00
Joachim Bartosik d8e1e6dd0e Test enum formatting 2024-05-02 10:40:46 +02:00
Kubernetes Prow Robot 3fd892a37b
Merge pull request #6725 from ZihanJiang96/filter-out-evicted-pod-on-drained-nodes-1
filter out pods that have deletion timestamp set in currentlyDrainedPods
2024-04-30 02:33:09 -07:00
Zihan Jiang e6ab3438fa refine log and test case 2024-04-26 20:10:48 -07:00
Zihan Jiang 15ca53f4a6 add unit test 2024-04-25 14:26:59 -07:00
Yaroslava Serdiuk 5f94f2c429
Add provreqOrchestrator that handle ProvReq classes (#6627)
* Add provreqOrchestrator that handle ProvReq classes

* Review remarks

* Review remarks
2024-04-17 09:37:54 -07:00
Daniel Gutowski 5aa6b2cb07 Introduce binbacking optimization for similar pods.
The optimization uses the fact that pods which are equivalent do not
need to be check multiple times against already filled nodes.
This changes the time complexity from O(pods*nodes) to O(pods).
2024-04-04 10:15:47 -07:00
Walid Ghallab aada657452 Add BuildTestNodeWithAllocatable test utility method. 2024-02-27 19:26:48 +00:00
Daniel Kłobuszewski a842d4f108 Reduce log spam in AtomicResizeFilteringProcessor
Also, introduce default per-node logging quotas. For now, identical to
the per-pod ones.
2024-02-07 12:01:05 +01:00
Yaroslava Serdiuk ed6ebbe8ba
ScaleUp for check-capacity ProvisioningRequestClass (#6451)
* ScaleUp for check-capacity ProvisioningRequestClass

* update condition logic

* Update tests

* Naming update

* Update cluster-autoscaler/core/scaleup/orchestrator/wrapper_orchestrator_test.go

Co-authored-by: Bartek Wróblewski <bwroblewski@google.com>

---------

Co-authored-by: Bartek Wróblewski <bwroblewski@google.com>
2024-01-30 02:36:59 -08:00
Kubernetes Prow Robot 838ba52860
Merge pull request #6378 from BigDarkClown/multitaint
Taint utils taking multiple taints
2024-01-12 14:07:47 +01:00
Bartłomiej Wróblewski c4063ef8d6 Taint utils taking multiple taints 2024-01-11 16:45:07 +00:00
Joachim Bartosik a5e540d5da Restore flags for setting QPS limit in CA
Partially undo #6274. I noticed that with this change CA get rate limited and
slows down significantly (especially during large scale downs).
2023-12-29 13:28:08 +00:00
Kubernetes Prow Robot fc48d5c052
Merge pull request #6139 from damikag/priority-evictor
Implement priority based evictor
2023-12-21 18:18:53 +01:00
damikag 9ffbea4408 implement priority based evictor and refactor drain logic 2023-12-21 16:57:05 +00:00
Kubernetes Prow Robot 2afb9683dd
Merge pull request #6376 from x13n/master
[GCE] Support paginated instance listing
2023-12-15 12:36:18 +01:00
Daniel Kłobuszewski e3d3303f89 [GCE] Support paginated instance listing 2023-12-15 09:16:09 +01:00
Walid Ghallab f89427ad9f Make backoff.Status.ErrorInfo non-pointer.
Change-Id: I1f812d4d6f42db97670ef7304fc0e895c837a13b
2023-12-14 15:28:27 +00:00
Walid Ghallab cf6176f80d Add error details to autoscaling backoff.
Change-Id: I3b5c62ba13c2e048ce2d7170016af07182c11eee
2023-12-14 13:45:55 +00:00
Kubernetes Prow Robot 0a1d74f352
Merge pull request #6294 from vadasambar/refactor/kube-client
refactor(*): move getKubeClient to utils/kubernetes
2023-11-29 19:28:44 +01:00
qianlei.qianl ae18f05a61 refactor(*): move getKubeClient to utils/kubernetes
(cherry picked from commit b9f636d2ef)

Signed-off-by: qianlei.qianl <qianlei.qianl@bytedance.com>

refactor: move logic to create client to utils/kubernetes pkg
- expose `CreateKubeClient` as public function
- make `GetKubeConfig` into a private `getKubeConfig` function (can be exposed as a public function in the future if needed)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: CI failing because cloudproviders were not updated to use new autoscaling option fields
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: define errors as constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: pass kube client options by value
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-11-29 00:43:59 +05:30
Kubernetes Prow Robot 39245a5613
Merge pull request #6235 from atwamahmoud/ignore-scheduler-processing
Ignore scheduler processing
2023-11-22 13:54:30 +01:00
Mahmoud Atwa a1ae4d3b57 Update flags, Improve tests readability & use Bypass instead of ignore in naming 2023-11-22 11:18:55 +00:00
Mahmoud Atwa 4635a6dc04 Allow users to specify which schedulers to ignore 2023-11-22 11:18:44 +00:00
Mahmoud Atwa 86ab017967 Fix multiple comments and update flags 2023-11-22 11:17:48 +00:00
Mahmoud Atwa a1ab7b9e20 Add new pod list processors for clearing TPU requests & filtering out
expendable pods

Treat non-processed pods yet as unschedulable
2023-11-22 11:16:33 +00:00
piotrwrotniak f2b8272949 Remove maps.Copy usage. 2023-11-09 09:46:48 +00:00