This decouples PredicateChecker from the Framework initialization logic,
and allows creating multiple PredicateChecker instances while only
initializing the framework once.
This commit also fixes how CA integrates with Framework metrics. Instead
of Registering them they're only Initialized so that CA doesn't expose
scheduler metrics. And the initialization is moved from multiple
different places to the Handle constructor.
simulator.BuildNodeInfoForNode, core_utils.GetNodeInfoFromTemplate,
and scheduler_utils.DeepCopyTemplateNode all had very similar logic
for sanitizing and copying NodeInfos. They're all consolidated to
one file in simulator, sharing common logic.
DeepCopyNodeInfo is changed to be a framework.NodeInfo method.
MixedTemplateNodeInfoProvider now correctly uses ClusterSnapshot to
correlate Nodes to scheduled pods, instead of using a live Pod lister.
This means that the snapshot now has to be properly initialized in a
bunch of tests.
RemoveNode is renamed to RemoveNodeInfo for consistency with other
NodeInfo methods.
For DRA, the snapshot will have to potentially allocate ResourceClaims
when adding a Pod to a Node, and deallocate them when removing a Pod
from a Node. This will happen in new methods added to ClusterSnapshot
in later commits - SchedulePod and UnschedulePod. These new methods
should be the "default" way of moving pods around the snapshot going
forward.
However, we'll still need to be able to add and remove pods from the
snapshot "forcefully" to handle some corner cases (e.g. expendable pods).
AddPod is renamed to ForceAddPod, and RemovePod to ForceRemovePod to
highlight that these are no longer the "default" methods of moving pods
around the snapshot, and are bypassing something important.
AddNodeInfo already provides the same functionality, and has to be used
in production code in order to propagate DRA objects correctly.
Uses in production are replaced with SetClusterState(), which will later
take DRA objects into account. Uses in the test code are replaced with
AddNodeInfo().
The new wrapper types should behave like the direct schedulerframework
types for most purposes, so most of the migration is just changing
the imported package.
Constructors look a bit different, so they have to be adapted -
mostly in test code. Accesses to the Pods field have to be changed
to a method call.
After this, the schedulerframework types are only used in the new
wrappers, and in the parts of simulator/ that directly interact with
the scheduler framework. The rest of CA codebase operates on the new
wrapper types.
The optimization uses the fact that pods which are equivalent do not
need to be check multiple times against already filled nodes.
This changes the time complexity from O(pods*nodes) to O(pods).
The grouping should be made by the schedulability equivalence
meaning we can introduce optimizations to the binpacking.
Introduce a benchmark that estimates capacity needed for 51k pods,
which can be grouped to two equivalence groups 50k and 1k.
When PermissionToAddNode gets called without actually adding a new node, the
node counter in the thresholdBasedEstimationLimiter gets out of sync with the
actual number of new nodes. This can happen when the "is the last node empty"
check triggers.
The solution used here is to rearrange the checks so that PermissionToAddNode
is followed by adding a new node. Alternatively, it might also be possible
to pass the current number of nodes as parameter.
For SNG threshold include capacity of the currently estimated node group (as it is not part of SNG itself)
Replaced direct calls with use of getters in cluster capacity threshold
Renamed getters removing the verb Get
Replace EstimationContext struct with interface
Add support for negative threshold value in estimation limiter
Add EstimationContext to take into account runtime state of the autoscaling for estimations
Implement static threshold
Implement cluster capacity threshold for Estimation Limiter
Implement similar node groups capacity threshold for Estimation Limiter
Set default estimation thresholds
This was meant to be used as a signal that the estimation
did consider all of the pods, but x13n pointed out that
other k8s primitives may also limit it so the best option
is to compare a list of estimated pods.
Refactor the Binpacking estimator by making the pod sort extendable by moving the logic into decreasing_pod_orderer.go
Cleanup sorting logic/structs from binpacking_estimator.go
Previously we've just assumed pod will always fit on a newly added node
during binpacking, because we've already checked that a pod fits on an
empty template node earlier in scale-up logic.
This assumption is incorrect, as it doesn't take into account potential
impact of other scheduling we've done in binpacking. For pods using
zonal Filters (such as PodTopologySpreading with zonal topology key) the
pod may no longer be able to schedule even on an empty node as a result
of earlier decisions we've made in binpacking.
The binpacking algorithm is O(#pending_pods * #new_nodes) and
calculating a very large scale-up can get stuck for minutes or even
hours, leading to CA failing it's healthcheck and going down.
The new limiting prevents this scenario by stopping binpacking after
reaching specified threshold. Any pods that remain pending as a result
of shorter binpacking will be processed next autoscaler loop.
The thresholds used can be controlled with newly introduced flags:
--max-nodes-per-scaleup and --max-nodegroup-binpacking-duration. The
limiting can be disabled by setting both flags to 0 (not recommended,
especially for --max-nodegroup-binpacking-duration).
Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses
the same shared DeepCopyTemplateNode function and inherits its naming
pattern, which is great as that fixes a long standing bug.
Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with
generated nodeinfos and nodes having predictable names (using template name
+ an incremental ordinal starting at 0) for upcoming nodes.
Later, when it looks for fitting nodes for unschedulable pods (when upcoming
nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity,
or pods antiaffinity, ...), the binpacking estimator will also build virtual
nodes and place them in a snapshot fork to evaluate scheduler predicates.
Those temporary virtual nodes are built using the same pattern (template name
and an index ordinal also starting at 0) as the one previously used by
`getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes
names for nodegroups having upcoming nodes.
But adding nodes by the same name in an existing cluster snapshot isn't
allowed, and the evaluation attempt will fail.
Practically this blocks re-upscales for nodegroups having upcoming nodes,
which can cause a significant delay.
Function copying template node to use for upcoming nodes was
not chaning hostname label, meaning that features relying on
this label (ex. pod antiaffinity on hostname topology) would
treat all upcoming nodes as a single node.
This resulted in triggering too many scale-ups for pods
using such features. Analogous function in binpacking didn't
have the same bug (but it didn't set unique UID or pod names).
I extracted the functionality to a util function used in both
places to avoid the two functions getting out of sync again.
This means that PreFilters are run once per pod in binpacking
instead of #pods*#nodes times. This makes a huge performance
difference in very large clusters.
The following things changed in scheduler and needed to be fixed:
* NodeInfo was moved to schedulerframework
* Some fields on NodeInfo are now exposed directly instead of via getters
* NodeInfo.Pods is now a list of *schedulerframework.PodInfo, not *apiv1.Pod
* SharedLister and NodeInfoLister were moved to schedulerframework
* PodLister was removed