The Snapshot can hold all DRA objects in the cluster, and expose them
to the scheduler framework via the SharedDRAManager interface.
The state of the objects can be modified during autoscaling simulations
using the provided methods.
We will have two functions instead of one:
1. One that doesn't do formatting, like klog.Error
2. One that accepts formating, like klog.Errorf
The main reason behind this is to avoid go vet errors and have clear
interfaces to catch accidental bugs and rely on go vet to catch those
accidental bugs (or go test in go 1.24, as those are treated as errors).
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.
DeepCopyNodeInfo is changed to be a framework.NodeInfo method.
MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
This allows using errors.Is() to check if an AutoscalerError wraps
a sentinel error (e.g. cloudprovider.ErrNotImplemented) when a prefix is
added to it.
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.
Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.
After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
utils/test is supposed to be usable in any CA package. Having a
dependency on cloudprovider makes it unusuable in any package
that cloudprovider depends on because of import cycles.
The cloudprovider import is only needed by GetGpuConfigFromNode,
which is only used in info_test.go. This commit just moves
GetGpuConfigFromNode there as an unexported function.
This is because it doesn't affect scheduling, but it affect performance
of scheduling simulation.
It triggered alert from 'overflowing_controllers_count' metric.
The optimization uses the fact that pods which are equivalent do not
need to be check multiple times against already filled nodes.
This changes the time complexity from O(pods*nodes) to O(pods).
(cherry picked from commit b9f636d2ef)
Signed-off-by: qianlei.qianl <qianlei.qianl@bytedance.com>
refactor: move logic to create client to utils/kubernetes pkg
- expose `CreateKubeClient` as public function
- make `GetKubeConfig` into a private `getKubeConfig` function (can be exposed as a public function in the future if needed)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: CI failing because cloudproviders were not updated to use new autoscaling option fields
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: define errors as constants
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: pass kube client options by value
Signed-off-by: vadasambar <surajrbanakar@gmail.com>