Merge pull request #93 from MaciekPytel/node_group_sets_proposal

Proposal for balancing node groups for multizone
This commit is contained in:
Marcin Wielgus 2017-05-31 11:23:29 +02:00 committed by GitHub
commit cd6b27b48f
1 changed files with 175 additions and 0 deletions

View File

@ -0,0 +1,175 @@
# Balance similar node groups between zones
##### Author: MaciekPytel
## Introduction
We have multiple requests from people who want to use node groups with the same
instance type, but located in multiple zones for redundancy / HA. Currently
Cluster Autoscaler is randomly adding and deleting nodes in those node groups,
which results in uneven node distribution across different zones. The goal of
this proposal is to introduce mechanism to balance the number of nodes in
similar node groups.
## Constraints
We want this feature to work reasonably well with any currently
supported CA configuration. In particular we want to support both homogenous and heterogenous clusters,
allowing the user to easily choose or implement strategy for defining what kind
of node should be added (i.e. large instance vs several small instances, etc).
Those goals imply a few more specifc constraints that we want to keep:
* We want to avoid balancing size of node groups composed of different instances
(there is no point in trying to balance the number of 2-CPU and 32-CPU
machines).
* We want to respect size limits of each node group. In particular we shouldn't
expect that the limits will be the same for corresponding node groups.
* Some pods may require to run on a specific node group due to using
labalSelector on zone label, or antiaffinity. In such cases we must scale the
correct node group.
* Preferably we should avoid implementing this using existing
expansion.Strategy interface, as that would likely require non-backward
compatible refactor of the interface and we want to allow using this feature
with different strategies.
* User must be able to disable this logic using a flag.
## General idea
The general idea behind this proposal is to introduce a concept of "Node Group
Set", consisting of one or more of "similar" node groups (the definition of
"similar" is provided in separate section). When scaling up we would split the
nodes between node groups in the same set to make their size as similar as
possible. For example assume node group set made of node groups A (currently 1 node), B (3 nodes), and C (6 nodes).
If we needed to add a new node to the cluster it would go to group A. If we
needed to add 4 nodes, 3 of them would go to group A and 1 to group B.
Note that this does not gurantee that node groups will always have the same
size. Cluster Autoscaler will add exactly as many nodes as are required for
pending pods, which may not be divisible by number of node groups in node group
set. Additionaly we scale down underutilized nodes, which may happen to be in
the same node group. Including relative sizes of similar node groups in scale
down logic will be covered by a different proposal later on.
## Implementation proposal
There will be no change to how expansion options are generated in ScaleUp
function. Instead the balancing will be executed after expansion option is
chosen by expansion.Strategy and before node group is resized. The high-level
algorithm will be as follows:
1. During loop generating expansion options create a map {node group -> set of
pods that pass predicates for this node group}. We already calculate that,
just need to store it.
2. Take expansion option chosen by expansion.Strategy and call it E. Let NG be node group
chosen by strategy and K be the number of nodes that need to be added. Let P
be the set of all pods that will be scheduled thanks to E.
3. Find all node groups "similar" to NG and call them NGS.
4. Check if every pod in P passes scheduler predicates for sample node in every
node group in NGS (by checking if P is a subset of set we stored in step 1).
Remove from NGS any node group on which at least one pod from P can't be
scheduled.
5. Add NG to NGS.
6. Get current and maximum size of all node groups in NGS. Split K between node
groups as described in example above.
7. Resize all groups in NGS as per result of step 6.
If the user sets the corresponding flag to 'false' we skip step 3,
resulting in a single element in NGS (this makes step 4 no-op and step 6 trivial).
## Similar node groups
We will balance size of similar node groups. We want similar groups to consist
of machine with the same instance type and with the same set of custom labels.
In particular we define "similar" node groups as having:
* The same Capacity for every resource.
* Allocatable for each resource within 5% of each other (this number can depend on a
few different factors and so it's good to have some minor slack).
* "Free" resources (defined as Allocatable minus resources used by daemonsets and
kube-proxy) for each resource within 5% of each other (this number can depend on a
few different factors and so it's good to have some minor slack).
* The same set of labels, except for zone and hostname labels (defined in
https://github.com/kubernetes/kube-state-metrics/blob/master/vendor/k8s.io/client-go/pkg/api/unversioned/well_known_labels.go)
---
## Other possible solutions and discussion
There are other ways to implement the general idea than the proposed solution.
This section lists other options that were considered and discusses pros and
cons of each one. Feel free to skip it.
#### [S1]Split selected expansion option
This is the solution described in "Implementation proposal" section.
Pros:
* Simplest solution.
Cons:
* If at least a single pending pod uses zone-based scheduling features the
whole scale-up will likely go to a single node group.
* We add slightly different nodes than those chosen by expansion.Strategy. In
particular any zone-based choices made by expansion.Strategy will be
discarded.
* We operate on single node groups when creating expansion options, so maximum
scale-up size is limited by maximum size of a single group.
#### [S2]Update expansion options before choosing them
This idea is somewhat similar to [S1], but the new method would be called
on a set of expansion options before expansion.Strategy chooses one. The new
method could either modify each option to contain a set of scale-ups on similar
node groups.
Pros:
* Addresses some issues of [S1], but not the issues related to pods using
zone-based scheduling.
Cons:
* To fix issues of [S1] the function processing expansion strategies need to be
more complex.
* Need to update expansion.Option to be based on multiple NodeGroups, not just
one. This will make expansion.Strategy considerably more complex and difficult to implement.
#### [S3]Make a wrapper for cloudeprovider.CloudProvider
This solution would work by implementing a NodeGroupSet wrapper implementing
cloudprovider.NodeGroup interface. It would consist of one or more
NodeGroups and internally load balance their sizes.
Pros:
* We could use an aggregated maximum size for all node groups when creating
expansion options.
* We could add additional methods to NodeGroupSet to split pending pods between
underlying NodeGroups in a smart way, allowing to deal with pod antiaffinity
and zone based label selectors without completly skipping size balancing.
Cons:
* NodeGroup contract assumes all nodes in NodeGroup are identical. However we
want to allow at least different zone labels between node groups in node
group set.
* Actually doing the smart stuff with labelSelectors, etc. will be complex and
hard to implement.
#### [S4]Refactor scale-up logic to include balancing similar node groups
This solution would change how expansion options are generated in
core/scale_up.go. The main ScaleUp function could be largely rewritten to take
balancing node groups into account.
Pros:
* Most benefits of [S3].
* Avoids problems caused by breaking NodeGroup interface contract.
Cons:
* Complex.
* Large changes to critical parts of code, most likely to cause regression.
#### Discussion
A lot of difficulty of the problem comes from the fact that we can have pods who
can only schedule on some of the node groups in a given node group set.
Such pods require specific config by user (zone-based labelSelector or
antiaffinity) and are likely not very common in most
clusters. Additionally one can argue that having a majority of pods explicitly
specify the zone they want to run in defies the purpose of
automatically balancing the size of node groups between zones in the first place.
If we treat those pods as edge case options [S3] and [S4] don't seem very
attractive. Their main benefit of options [S3] and [S4] is allowing to deal with
such edge cases at the cost of significantly increased complexity.
That leaves options [S1] and [S2]. Once again this is a decision between better
handling of difficult cases versus complexity. This time this tradeoff applies
mostly to expansion.Strategy interface. So far there are no implementations of
this interface that make zone-based decisions and making expansion options more
complex (by consisting of a set of NodeGroups) will make all existing strategies
more complex as well, for no benefit. So it seems that [S1] is the best
available option by virtue of its relative simplicity.