Merge pull request #93 from MaciekPytel/node_group_sets_proposal
Proposal for balancing node groups for multizone
This commit is contained in:
commit
cd6b27b48f
|
|
@ -0,0 +1,175 @@
|
|||
# Balance similar node groups between zones
|
||||
##### Author: MaciekPytel
|
||||
|
||||
## Introduction
|
||||
We have multiple requests from people who want to use node groups with the same
|
||||
instance type, but located in multiple zones for redundancy / HA. Currently
|
||||
Cluster Autoscaler is randomly adding and deleting nodes in those node groups,
|
||||
which results in uneven node distribution across different zones. The goal of
|
||||
this proposal is to introduce mechanism to balance the number of nodes in
|
||||
similar node groups.
|
||||
|
||||
## Constraints
|
||||
We want this feature to work reasonably well with any currently
|
||||
supported CA configuration. In particular we want to support both homogenous and heterogenous clusters,
|
||||
allowing the user to easily choose or implement strategy for defining what kind
|
||||
of node should be added (i.e. large instance vs several small instances, etc).
|
||||
|
||||
Those goals imply a few more specifc constraints that we want to keep:
|
||||
* We want to avoid balancing size of node groups composed of different instances
|
||||
(there is no point in trying to balance the number of 2-CPU and 32-CPU
|
||||
machines).
|
||||
* We want to respect size limits of each node group. In particular we shouldn't
|
||||
expect that the limits will be the same for corresponding node groups.
|
||||
* Some pods may require to run on a specific node group due to using
|
||||
labalSelector on zone label, or antiaffinity. In such cases we must scale the
|
||||
correct node group.
|
||||
* Preferably we should avoid implementing this using existing
|
||||
expansion.Strategy interface, as that would likely require non-backward
|
||||
compatible refactor of the interface and we want to allow using this feature
|
||||
with different strategies.
|
||||
* User must be able to disable this logic using a flag.
|
||||
|
||||
## General idea
|
||||
The general idea behind this proposal is to introduce a concept of "Node Group
|
||||
Set", consisting of one or more of "similar" node groups (the definition of
|
||||
"similar" is provided in separate section). When scaling up we would split the
|
||||
nodes between node groups in the same set to make their size as similar as
|
||||
possible. For example assume node group set made of node groups A (currently 1 node), B (3 nodes), and C (6 nodes).
|
||||
If we needed to add a new node to the cluster it would go to group A. If we
|
||||
needed to add 4 nodes, 3 of them would go to group A and 1 to group B.
|
||||
|
||||
Note that this does not gurantee that node groups will always have the same
|
||||
size. Cluster Autoscaler will add exactly as many nodes as are required for
|
||||
pending pods, which may not be divisible by number of node groups in node group
|
||||
set. Additionaly we scale down underutilized nodes, which may happen to be in
|
||||
the same node group. Including relative sizes of similar node groups in scale
|
||||
down logic will be covered by a different proposal later on.
|
||||
|
||||
## Implementation proposal
|
||||
There will be no change to how expansion options are generated in ScaleUp
|
||||
function. Instead the balancing will be executed after expansion option is
|
||||
chosen by expansion.Strategy and before node group is resized. The high-level
|
||||
algorithm will be as follows:
|
||||
1. During loop generating expansion options create a map {node group -> set of
|
||||
pods that pass predicates for this node group}. We already calculate that,
|
||||
just need to store it.
|
||||
2. Take expansion option chosen by expansion.Strategy and call it E. Let NG be node group
|
||||
chosen by strategy and K be the number of nodes that need to be added. Let P
|
||||
be the set of all pods that will be scheduled thanks to E.
|
||||
3. Find all node groups "similar" to NG and call them NGS.
|
||||
4. Check if every pod in P passes scheduler predicates for sample node in every
|
||||
node group in NGS (by checking if P is a subset of set we stored in step 1).
|
||||
Remove from NGS any node group on which at least one pod from P can't be
|
||||
scheduled.
|
||||
5. Add NG to NGS.
|
||||
6. Get current and maximum size of all node groups in NGS. Split K between node
|
||||
groups as described in example above.
|
||||
7. Resize all groups in NGS as per result of step 6.
|
||||
|
||||
If the user sets the corresponding flag to 'false' we skip step 3,
|
||||
resulting in a single element in NGS (this makes step 4 no-op and step 6 trivial).
|
||||
|
||||
## Similar node groups
|
||||
We will balance size of similar node groups. We want similar groups to consist
|
||||
of machine with the same instance type and with the same set of custom labels.
|
||||
In particular we define "similar" node groups as having:
|
||||
* The same Capacity for every resource.
|
||||
* Allocatable for each resource within 5% of each other (this number can depend on a
|
||||
few different factors and so it's good to have some minor slack).
|
||||
* "Free" resources (defined as Allocatable minus resources used by daemonsets and
|
||||
kube-proxy) for each resource within 5% of each other (this number can depend on a
|
||||
few different factors and so it's good to have some minor slack).
|
||||
* The same set of labels, except for zone and hostname labels (defined in
|
||||
https://github.com/kubernetes/kube-state-metrics/blob/master/vendor/k8s.io/client-go/pkg/api/unversioned/well_known_labels.go)
|
||||
|
||||
---
|
||||
|
||||
## Other possible solutions and discussion
|
||||
There are other ways to implement the general idea than the proposed solution.
|
||||
This section lists other options that were considered and discusses pros and
|
||||
cons of each one. Feel free to skip it.
|
||||
|
||||
#### [S1]Split selected expansion option
|
||||
This is the solution described in "Implementation proposal" section.
|
||||
|
||||
Pros:
|
||||
* Simplest solution.
|
||||
|
||||
Cons:
|
||||
* If at least a single pending pod uses zone-based scheduling features the
|
||||
whole scale-up will likely go to a single node group.
|
||||
* We add slightly different nodes than those chosen by expansion.Strategy. In
|
||||
particular any zone-based choices made by expansion.Strategy will be
|
||||
discarded.
|
||||
* We operate on single node groups when creating expansion options, so maximum
|
||||
scale-up size is limited by maximum size of a single group.
|
||||
|
||||
#### [S2]Update expansion options before choosing them
|
||||
This idea is somewhat similar to [S1], but the new method would be called
|
||||
on a set of expansion options before expansion.Strategy chooses one. The new
|
||||
method could either modify each option to contain a set of scale-ups on similar
|
||||
node groups.
|
||||
|
||||
Pros:
|
||||
* Addresses some issues of [S1], but not the issues related to pods using
|
||||
zone-based scheduling.
|
||||
|
||||
Cons:
|
||||
* To fix issues of [S1] the function processing expansion strategies need to be
|
||||
more complex.
|
||||
* Need to update expansion.Option to be based on multiple NodeGroups, not just
|
||||
one. This will make expansion.Strategy considerably more complex and difficult to implement.
|
||||
|
||||
#### [S3]Make a wrapper for cloudeprovider.CloudProvider
|
||||
This solution would work by implementing a NodeGroupSet wrapper implementing
|
||||
cloudprovider.NodeGroup interface. It would consist of one or more
|
||||
NodeGroups and internally load balance their sizes.
|
||||
|
||||
Pros:
|
||||
* We could use an aggregated maximum size for all node groups when creating
|
||||
expansion options.
|
||||
* We could add additional methods to NodeGroupSet to split pending pods between
|
||||
underlying NodeGroups in a smart way, allowing to deal with pod antiaffinity
|
||||
and zone based label selectors without completly skipping size balancing.
|
||||
|
||||
Cons:
|
||||
* NodeGroup contract assumes all nodes in NodeGroup are identical. However we
|
||||
want to allow at least different zone labels between node groups in node
|
||||
group set.
|
||||
* Actually doing the smart stuff with labelSelectors, etc. will be complex and
|
||||
hard to implement.
|
||||
|
||||
#### [S4]Refactor scale-up logic to include balancing similar node groups
|
||||
This solution would change how expansion options are generated in
|
||||
core/scale_up.go. The main ScaleUp function could be largely rewritten to take
|
||||
balancing node groups into account.
|
||||
|
||||
Pros:
|
||||
* Most benefits of [S3].
|
||||
* Avoids problems caused by breaking NodeGroup interface contract.
|
||||
|
||||
Cons:
|
||||
* Complex.
|
||||
* Large changes to critical parts of code, most likely to cause regression.
|
||||
|
||||
#### Discussion
|
||||
A lot of difficulty of the problem comes from the fact that we can have pods who
|
||||
can only schedule on some of the node groups in a given node group set.
|
||||
Such pods require specific config by user (zone-based labelSelector or
|
||||
antiaffinity) and are likely not very common in most
|
||||
clusters. Additionally one can argue that having a majority of pods explicitly
|
||||
specify the zone they want to run in defies the purpose of
|
||||
automatically balancing the size of node groups between zones in the first place.
|
||||
|
||||
If we treat those pods as edge case options [S3] and [S4] don't seem very
|
||||
attractive. Their main benefit of options [S3] and [S4] is allowing to deal with
|
||||
such edge cases at the cost of significantly increased complexity.
|
||||
|
||||
That leaves options [S1] and [S2]. Once again this is a decision between better
|
||||
handling of difficult cases versus complexity. This time this tradeoff applies
|
||||
mostly to expansion.Strategy interface. So far there are no implementations of
|
||||
this interface that make zone-based decisions and making expansion options more
|
||||
complex (by consisting of a set of NodeGroups) will make all existing strategies
|
||||
more complex as well, for no benefit. So it seems that [S1] is the best
|
||||
available option by virtue of its relative simplicity.
|
||||
Loading…
Reference in New Issue