- log cluster wide event - previous event would never get fired because
the estimators would already cap the options they generate and
additionally it would fire once and events are kept only for some time
- log per pod event explaining why the scale up is not triggered
(previously it would either get no scale up because no matching group
or it would not get an event at all)
This required adding a list of pods that were unschedulable to the
status in case when the max total nodes were reached.
- add autoscaling status to reflect that
- change the log severity to warning as this means that autoscaler will
not be fully functional (in praticular scaling up will not work)
- fix the scale up enforcer logic not to skip the max nodes reached
logging point
We will have two functions instead of one:
1. One that doesn't do formatting, like klog.Error
2. One that accepts formating, like klog.Errorf
The main reason behind this is to avoid go vet errors and have clear
interfaces to catch accidental bugs and rely on go vet to catch those
accidental bugs (or go test in go 1.24, as those are treated as errors).
A new DRA Snapshot type is introduced, for now with just dummy methods
to be implemented in later commits. The new type is intended to hold all
DRA objects in the cluster.
ClusterSnapshotStore.SetClusterState() is extended to take the new DRA Snapshot in
addition to the existing parameters.
ClusterSnapshotStore.DraSnapshot() is added to retrieve the DRA snapshot set by
SetClusterState() back. This will be used by PredicateSnapshot to implement DRA
logic later.
This should be a no-op, as DraSnapshot() is never called, and no DRA
snapshot is passed to SetClusterState() yet.
This decouples PredicateChecker from the Framework initialization logic,
and allows creating multiple PredicateChecker instances while only
initializing the framework once.
This commit also fixes how CA integrates with Framework metrics. Instead
of Registering them they're only Initialized so that CA doesn't expose
scheduler metrics. And the initialization is moved from multiple
different places to the Handle constructor.
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.
DeepCopyNodeInfo is changed to be a framework.NodeInfo method.
MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
AddNodeInfo already provides the same functionality, and has to be used
in production code in order to propagate DRA objects correctly.
Uses in production are replaced with SetClusterState(), which will later
take DRA objects into account. Uses in the test code are replaced with
AddNodeInfo().
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.
Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.
After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
* Add backoff mechanism for ProvReq retry
* Add flags for intital and max backoff time, and cache size
* Review remarks
* Add LRU cache
* Review remark