Commit Graph

8592 Commits

Author SHA1 Message Date
Kuba Tużnik 55388f1136 CA: plumb the DRA provider to SetClusterState callsites, grab and pass DRA snapshot
The new logic is flag-guarded, it should be a no-op if DRA is disabled.
2024-12-20 13:30:36 +01:00
Kuba Tużnik c5cb8a077d CA: add DRA object handling logic to PredicateSnapshot
All added logic is behind the DRA flag guard, this should be a no-op
if the flag is disabled.
2024-12-20 13:30:36 +01:00
Kuba Tużnik 714ab661ca CA: implement calculating utilization for DRA resources
The logic is very basic and will likely need to be revised, but it's
something for initial testing. Utilization of a given Pool is calculated
as the number of allocated devices in the pool divided by the number of
all devices in the pool. For scale-down purposes, the max utilization
of all Node-local Pools is used.

The new logic is mostly behind the DRA flag guard, so this should be a no-op
if the flag is disabled. The only difference should be that FilterOutUnremovable
marks a Node as unremovable if calculating utilization fails. Not sure
why this wasn't the case before, but I think we need it for DRA - if CA sees an
incomplete picture of a resource pool, we probably don't want to scale
the Node down.
2024-12-20 13:30:36 +01:00
Kuba Tużnik 4e68a0c6ef CA: sanitize and propagate DRA objects through NodeInfos in node_info utils 2024-12-20 13:30:36 +01:00
Kuba Tużnik 479d7ce3d6 CA: implement a Provider for dynamicresources.Snapshot
The Provider uses DRA object listers to create a Snapshot of the
DRA objects.
2024-12-20 13:30:36 +01:00
Kuba Tużnik 377639a8dc CA: implement dynamicresources.Snapshot for storing and modifying the state of DRA objects
The Snapshot can hold all DRA objects in the cluster, and expose them
to the scheduler framework via the SharedDRAManager interface.

The state of the objects can be modified during autoscaling simulations
using the provided methods.
2024-12-20 13:30:10 +01:00
Kuba Tużnik 66d0aeb3cb CA: implement utils for interacting with ResourceClaims
These utils will be used by various parts of the DRA logic in the
following commits.
2024-12-19 15:55:49 +01:00
Kubernetes Prow Robot 756db6aa66
Merge pull request #7591 from davidspek/fix/vpa-nonroot
feat(vpa): run containers as nonroot user by default
2024-12-11 17:00:03 +00:00
Kubernetes Prow Robot 723a3bb0d8
Merge pull request #7587 from adrianmoisey/fix_link
VPA: Fix typo in documentation
2024-12-11 12:40:04 +00:00
David van der Spek 8dd3dcdbc0 fix(vpa): run containers as nonroot user by default
Signed-off-by: David van der Spek <david.vanderspek@flyrlabs.com>
2024-12-11 13:36:28 +01:00
Kubernetes Prow Robot 562059be93
Merge pull request #7527 from adrianmoisey/logging
Pass the whole VPA into cappingRecommendationProcessor.Apply()
2024-12-11 10:00:03 +00:00
Adrian Moisey 59236c93a2
Ensure that recommendation is not nil 2024-12-11 11:42:49 +02:00
Kubernetes Prow Robot 0b1cddd850
Merge pull request #7586 from adrianmoisey/clarify_history
Clarify that the VPA can be run without Prometheus
2024-12-11 09:04:08 +00:00
Kubernetes Prow Robot 2f27a1238e
Merge pull request #7588 from WalkerMills/patch-1
Fix typo in custom memory bump-up documentation
2024-12-11 09:02:04 +00:00
Walker Mills 1730af52d7
Fix typo in custom memory bump-up documentation
Move max function to the start of the expression to match the behavior of the recommender:
c44d4f0033/vertical-pod-autoscaler/pkg/recommender/model/container.go (L194-L196)
2024-12-10 22:31:13 -08:00
Adrian Moisey d3f0619e33
Fix typo in link 2024-12-10 19:56:36 +02:00
Adrian Moisey 3e707e94a0
Clarify that the VPA can be run without Prometheus 2024-12-10 19:54:54 +02:00
Kubernetes Prow Robot c44d4f0033
Merge pull request #7577 from zendesk/grosser/docs3
document scale-down-gpu-utilization-threshold
2024-12-10 12:08:02 +00:00
Kubernetes Prow Robot ef19cf8f00
Merge pull request #7546 from omerap12/issue-7528
VPA(feat): separate estimator object for CPU and memory
2024-12-10 08:02:03 +00:00
Omer Aplatony 4310a161b6 added comment
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
2024-12-10 09:45:43 +02:00
Kubernetes Prow Robot 444e3295df
Merge pull request #7507 from Shubham82/CA_Reviewer
Add Shubham82 to the reviewers
2024-12-09 23:36:01 +00:00
Kubernetes Prow Robot 37b3da4e79
Merge pull request #7529 from towca/jtuznik/dra-prep
CA: prepare for DRA integration
2024-12-09 17:14:03 +00:00
Kuba Tużnik 0691512d27 CA: extend SchedulerPluginRunner with RunReserveOnNode
RunReserveOnNode runs the Reserve phase of schedulerframework,
which is necessary to obtain ResourceClaim allocations computed
by the DRA scheduler plugin.

RunReserveOnNode isn't used anywhere yet, so this should be a no-op.
2024-12-09 17:38:13 +01:00
Kuba Tużnik 307002eb42 CA: move NodeInfo methods from ClusterSnapshotStore to ClusterSnapshot
All the NodeInfo methods have to take DRA into account, and the logic
for that will be the same for different ClusterSnapshotStore implementations.
Instead of duplicating the new logic in Basic and Delta, the methods
are moved to ClusterSnapshot and the logic will be implemented once in
PredicateSnapshot.

PredicateSnapshot will use the DRA Snapshot exposed by its ClusterSnapshotStore
to implement these methods. The DRA Snapshot has to be stored in the
ClusterSnapshotStore layer, as we need to be able to fork/commit/revert it.

Lower-level methods for adding/removing just the schedulerframework.NodeInfo
parts are added to ClusterSnapshotStore. PredicateSnapshot utilizes these methods
to implement AddNodeInfo and RemoveNodeInfo.

This should be a no-op, it's just a refactor.
2024-12-09 17:38:04 +01:00
Kuba Tużnik eba5e08f6d CA: integrate BasicSnapshotStore with drasnapshot.Snapshot
Store the DRA snapshot inside the current internal data in
SetClusterState().

Retrieve the DRA snapshot from the current internal data in
DraSnapshot().

Clone the DRA snapshot whenever the internal data is cloned
during Fork(). This matches the forking logic that BasicSnapshotStore
uses, ensuring that the DRA object state is correctly
forked/commited/reverted during the corresponding ClusterSnapshot
operations.

This should be a no-op, as DraSnapshot() isn't called anywhere yet,
adn no DRA snapshot is passed to SetClusterState() yet.
2024-12-09 17:14:45 +01:00
Kuba Tużnik 466f94b780 CA: extend ClusterSnapshotStore to allow storing, retrieving and modifying DRA objects
A new DRA Snapshot type is introduced, for now with just dummy methods
to be implemented in later commits. The new type is intended to hold all
DRA objects in the cluster.

ClusterSnapshotStore.SetClusterState() is extended to take the new DRA Snapshot in
addition to the existing parameters.

ClusterSnapshotStore.DraSnapshot() is added to retrieve the DRA snapshot set by
SetClusterState() back. This will be used by PredicateSnapshot to implement DRA
logic later.

This should be a no-op, as DraSnapshot() is never called, and no DRA
snapshot is passed to SetClusterState() yet.
2024-12-09 17:14:45 +01:00
Kuba Tużnik 1e560274d5 CA: extend WrapSchedulerNodeInfo to allow passing DRA objects
This should be a no-op, as no DRA objects are passed yet.
2024-12-09 17:14:45 +01:00
Kuba Tużnik d0338fa301 CA: integrate simulator with schedulerframework.SharedDRAManager
Make SharedDRAManager a part of the ClusterSnapshotStore interface, and
implement dummy methods to satisfy the interface. Actual implementation
will come in later commits.

This is needed so that ClusterSnapshot can feed DRA objects to the DRA
scheduler plugin, and obtain ResourceClaim modifications back from it.

The integration is behind the DRA flag guard, this should be a no-op
if the flag is disabled.
2024-12-09 17:14:34 +01:00
Kubernetes Prow Robot f6c990e25d
Merge pull request #7585 from adrianmoisey/add-adrianmoisey-as-reviewer
Add adrianmoisey as VPA reviewer
2024-12-09 15:28:02 +00:00
Kubernetes Prow Robot 6fd8188a11
Merge pull request #7576 from adrianmoisey/local-e2e-tests-actuation
Configure e2e local scripts for the actuation suite
2024-12-09 15:18:03 +00:00
Kubernetes Prow Robot dda0dc8e81
Merge pull request #7572 from davidspek/feat/vpa-remove-go-mod
cleanup(vpa): remove go vendoring
2024-12-09 10:54:02 +00:00
Adrian Moisey 475db5b120
Add adrianmoisey as VPA reviewer 2024-12-08 06:28:18 +02:00
Kubernetes Prow Robot ccbae98104
Merge pull request #7579 from Azure/tallaxes/spot-refresh
fix: correctly set the default refresh period for VMSS size (used for Spot instances)
2024-12-08 03:44:04 +00:00
Alex Leites 61c8cdeff7 fix: corresponding test 2024-12-08 02:22:02 +00:00
Alex Leites 5e7ceee507 fix: setting getVmssSizeRefreshPeriod 2024-12-08 01:23:04 +00:00
Kubernetes Prow Robot bd7156e837
Merge pull request #7557 from gvnc/handle-ooh-capacity-nodes
Avoid making delete api calls for nodes that don't have an instance id
2024-12-06 22:48:01 +00:00
Michael Grosser d65ac6445f
document scale-down-gpu-utilization-threshold
Signed-off-by: Michael Grosser <michael@grosser.it>
2024-12-06 12:27:06 -08:00
Adrian Moisey f0af7c1b78
Configure e2e local scripts for the actuation suite
When looking at the e2e tests that run on every PR, I noticed that
actuation was run, and it isn't available for local e2e testing
2024-12-06 20:42:07 +02:00
David van der Spek 82908da8e3 fix: undo change to nonroot distroless
Signed-off-by: David van der Spek <david.vanderspek@flyrlabs.com>
2024-12-06 15:16:39 +01:00
David van der Spek 99e2f7ff3c fix(vpa): update dockerfiles to work without vendor
Signed-off-by: David van der Spek <david.vanderspek@flyrlabs.com>
2024-12-06 15:16:39 +01:00
David van der Spek f3a50a76d2 cleanup: tidy go mod
Signed-off-by: David van der Spek <david.vanderspek@flyrlabs.com>
2024-12-06 15:16:39 +01:00
David van der Spek 41c6eeef49 cleanup(vpa): add gitignore for vendor
Signed-off-by: David van der Spek <david.vanderspek@flyrlabs.com>
2024-12-06 15:16:39 +01:00
David van der Spek 0b6e7be086 cleanup(vpa): remove vendor directory
Signed-off-by: David van der Spek <david.vanderspek@flyrlabs.com>
2024-12-06 15:16:13 +01:00
Kubernetes Prow Robot 5b57c9fdcb
Merge pull request #7573 from davidspek/feat/vpa-e2e-remove-go-mod
cleanup(vpa): remove go vendoring for e2e
2024-12-06 14:12:01 +00:00
Kubernetes Prow Robot a994301b5b
Merge pull request #7575 from omerap12/logging-patch
fix: remove newlines from VPA recommendations
2024-12-06 13:40:00 +00:00
Kuba Tużnik 8c7f3fadc6 CA: plumb the DRA flag guard to PredicateSnapshot 2024-12-06 13:40:47 +01:00
Omer Aplatony 55b2b1e792 fmt
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
2024-12-06 14:01:22 +02:00
Omer Aplatony 4bb7fb04dc fix: remove newlines from VPA recommendations
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
2024-12-06 13:54:05 +02:00
Kubernetes Prow Robot c88f72f149
Merge pull request #7567 from towca/jtuznik/handle-fix
CA: Fix a data race in framework.NewHandle
2024-12-06 09:22:02 +00:00
Kubernetes Prow Robot 4d092e5f0a
Merge pull request #7563 from ialidzhikov/fix/vpa-updater-event-logs
vpa-updater: Fix logging of events
2024-12-06 03:54:00 +00:00