The logic is very basic and will likely need to be revised, but it's
something for initial testing. Utilization of a given Pool is calculated
as the number of allocated devices in the pool divided by the number of
all devices in the pool. For scale-down purposes, the max utilization
of all Node-local Pools is used.
The new logic is mostly behind the DRA flag guard, so this should be a no-op
if the flag is disabled. The only difference should be that FilterOutUnremovable
marks a Node as unremovable if calculating utilization fails. Not sure
why this wasn't the case before, but I think we need it for DRA - if CA sees an
incomplete picture of a resource pool, we probably don't want to scale
the Node down.
The Snapshot can hold all DRA objects in the cluster, and expose them
to the scheduler framework via the SharedDRAManager interface.
The state of the objects can be modified during autoscaling simulations
using the provided methods.
If a hinted Node is no longer in the cluster snapshot (e.g. it was
a fake upcoming Node and the real one appeared).
This was introduced in the recent PredicateChecker->PredicateSnapshot
refactor. Previously, PredicateChecker.CheckPredicates() would return
an error if the hinted Node was gone, and HintingSimulator treated
this error the same as failing predicates - it would move on to the
non-hinting logic. After the refactor, HintingSimulator explicitly
errors out if it can't retrieve the hinted Node from the snapshot,
so the behavior changed.
I checked other CheckPredicates()/SchedulePod() callsites, and this is
the only one when ignoring the missing Node makes sense. For the others,
the Node is added to the snapshot just before the call, so it being
missing should cause an error.
RunReserveOnNode runs the Reserve phase of schedulerframework,
which is necessary to obtain ResourceClaim allocations computed
by the DRA scheduler plugin.
RunReserveOnNode isn't used anywhere yet, so this should be a no-op.
All the NodeInfo methods have to take DRA into account, and the logic
for that will be the same for different ClusterSnapshotStore implementations.
Instead of duplicating the new logic in Basic and Delta, the methods
are moved to ClusterSnapshot and the logic will be implemented once in
PredicateSnapshot.
PredicateSnapshot will use the DRA Snapshot exposed by its ClusterSnapshotStore
to implement these methods. The DRA Snapshot has to be stored in the
ClusterSnapshotStore layer, as we need to be able to fork/commit/revert it.
Lower-level methods for adding/removing just the schedulerframework.NodeInfo
parts are added to ClusterSnapshotStore. PredicateSnapshot utilizes these methods
to implement AddNodeInfo and RemoveNodeInfo.
This should be a no-op, it's just a refactor.
Store the DRA snapshot inside the current internal data in
SetClusterState().
Retrieve the DRA snapshot from the current internal data in
DraSnapshot().
Clone the DRA snapshot whenever the internal data is cloned
during Fork(). This matches the forking logic that BasicSnapshotStore
uses, ensuring that the DRA object state is correctly
forked/commited/reverted during the corresponding ClusterSnapshot
operations.
This should be a no-op, as DraSnapshot() isn't called anywhere yet,
adn no DRA snapshot is passed to SetClusterState() yet.
A new DRA Snapshot type is introduced, for now with just dummy methods
to be implemented in later commits. The new type is intended to hold all
DRA objects in the cluster.
ClusterSnapshotStore.SetClusterState() is extended to take the new DRA Snapshot in
addition to the existing parameters.
ClusterSnapshotStore.DraSnapshot() is added to retrieve the DRA snapshot set by
SetClusterState() back. This will be used by PredicateSnapshot to implement DRA
logic later.
This should be a no-op, as DraSnapshot() is never called, and no DRA
snapshot is passed to SetClusterState() yet.
Make SharedDRAManager a part of the ClusterSnapshotStore interface, and
implement dummy methods to satisfy the interface. Actual implementation
will come in later commits.
This is needed so that ClusterSnapshot can feed DRA objects to the DRA
scheduler plugin, and obtain ResourceClaim modifications back from it.
The integration is behind the DRA flag guard, this should be a no-op
if the flag is disabled.
Multiple tests can call NewHandle() concurrently, because of
t.Parallel(). NewHandle calls schedulermetrics.InitMetrics()
which modifies global variables, so there's a race.
Wrapped the schedulermetrics.InitMetrics() call in a sync.Once.Do()
so that it's only done once, in a thread-safe manner.
For DRA, this component will have to call the Reserve phase in addition
to just checking predicates/filters.
The new version also makes more sense in the context of
PredicateSnapshot, which is the only context now.
While refactoring, I noticed that CheckPredicates for some reason
doesn't check the provided Node against the eligible Nodes returned
from PreFilter (while FitsAnyNodeMatching does do that). This seems like
a bug, so the check is added.
The checks in FitsAnyNodeMatching are also reordered so that the
cheapest ones are checked earliest.
PredicateSnapshot implements the ClusterSnapshot methods that need
to run predicates on top of a ClusterSnapshotStore.
testsnapshot pkg is introduced, providing functions abstracting away
the snapshot creation for tests.
ClusterSnapshot tests are moved near PredicateSnapshot, as it'll be
the only "full" implementation.
To handle DRA properly, scheduling predicates will need to be run
whenever Pods are scheduled in the snapshot.
PredicateChecker always needs a ClusterSnapshot to work, and ClusterSnapshot
scheduling methods need to run the predicates first. So it makes most
sense to have PredicateChecker be a dependency for ClusterSnapshot
implementations, and move the PredicateChecker methods to
ClusterSnapshot.
This commit mirrors PredicateChecker methods in ClusterSnapshot (with
the exception of FitsAnyNode which isn't used anywhere and is trivial to
do via FitsAnyNodeMatching). Further commits will remove the
PredicateChecker interface and move the implementation under
clustersnapshot.
Dummy methods are added to current ClusterSnapshot implementations to
get the tests to pass. Further commits will actually implement them.
PredicateError is refactored into a broader SchedulingError so that the
ClusterSnapshot methods can return a single error that the callers can
use to distinguish between a failing predicate and other, unexpected
errors.
This decouples PredicateChecker from the Framework initialization logic,
and allows creating multiple PredicateChecker instances while only
initializing the framework once.
This commit also fixes how CA integrates with Framework metrics. Instead
of Registering them they're only Initialized so that CA doesn't expose
scheduler metrics. And the initialization is moved from multiple
different places to the Handle constructor.
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.
DeepCopyNodeInfo is changed to be a framework.NodeInfo method.
MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
RemoveNode is renamed to RemoveNodeInfo for consistency with other
NodeInfo methods.
For DRA, the snapshot will have to potentially allocate ResourceClaims
when adding a Pod to a Node, and deallocate them when removing a Pod
from a Node. This will happen in new methods added to ClusterSnapshot
in later commits - SchedulePod and UnschedulePod. These new methods
should be the "default" way of moving pods around the snapshot going
forward.
However, we'll still need to be able to add and remove pods from the
snapshot "forcefully" to handle some corner cases (e.g. expendable pods).
AddPod is renamed to ForceAddPod, and RemovePod to ForceRemovePod to
highlight that these are no longer the "default" methods of moving pods
around the snapshot, and are bypassing something important.
AddNodeInfo already provides the same functionality, and has to be used
in production code in order to propagate DRA objects correctly.
Uses in production are replaced with SetClusterState(), which will later
take DRA objects into account. Uses in the test code are replaced with
AddNodeInfo().
AddNodes() is redundant - it was indended for batch adding nodes,
with batch-specific optimizations in mind probably. However, it
has always been implemented as just iterating over AddNode(), and
is only used in test code.
Most of the uses in the test code were initializing the cluster state.
They are replaced with SetClusterState(), which will later be needed for
handling DRA anyway (we'll have to start tracking things that aren't
node- or pod-scoped). The other uses are replaced with inline loops over
AddNode().