Merge pull request #93 from MaciekPytel/node_group_sets_proposal

Proposal for balancing node groups for multizone
2017-05-31 11:23:29 +02:00 · 2017-05-31 11:23:29 +02:00 · cd6b27b48f
parent bdeae4e68c d7a7fc659b
commit cd6b27b48f
1 changed files with 175 additions and 0 deletions
--- a/cluster-autoscaler/proposals/balance_similar.md
+++ b/cluster-autoscaler/proposals/balance_similar.md
@ -0,0 +1,175 @@
 # Balance similar node groups between zones
 ##### Author: MaciekPytel
 ## Introduction
 We have multiple requests from people who want to use node groups with the same
 instance type, but located in multiple zones for redundancy / HA. Currently
 Cluster Autoscaler is randomly adding and deleting nodes in those node groups,
 which results in uneven node distribution across different zones. The goal of
 this proposal is to introduce mechanism to balance the number of nodes in
 similar node groups.
 ## Constraints
 We want this feature to work reasonably well with any currently
 supported CA configuration. In particular we want to support both homogenous and heterogenous clusters,
 allowing the user to easily choose or implement strategy for defining what kind
 of node should be added (i.e. large instance vs several small instances, etc).
 Those goals imply a few more specifc constraints that we want to keep:
 * We want to avoid balancing size of node groups composed of different instances
 (there is no point in trying to balance the number of 2-CPU and 32-CPU
 machines).
 * We want to respect size limits of each node group. In particular we shouldn't
 expect that the limits will be the same for corresponding node groups.
 * Some pods may require to run on a specific node group due to using
   labalSelector on zone label, or antiaffinity. In such cases we must scale the
   correct node group.
 * Preferably we should avoid implementing this using existing
   expansion.Strategy interface, as that would likely require non-backward
   compatible refactor of the interface and we want to allow using this feature
   with different strategies.
 * User must be able to disable this logic using a flag.
 ## General idea
 The general idea behind this proposal is to introduce a concept of "Node Group
 Set", consisting of one or more of "similar" node groups (the definition of
 "similar" is provided in separate section). When scaling up we would split the
 nodes between node groups in the same set to make their size as similar as
 possible. For example assume node group set made of node groups A (currently 1 node), B (3 nodes), and C (6 nodes).
 If we needed to add a new node to the cluster it would go to group A. If we
 needed to add 4 nodes, 3 of them would go to group A and 1 to group B.
 Note that this does not gurantee that node groups will always have the same
 size. Cluster Autoscaler will add exactly as many nodes as are required for
 pending pods, which may not be divisible by number of node groups in node group
 set. Additionaly we scale down underutilized nodes, which may happen to be in
 the same node group. Including relative sizes of similar node groups in scale
 down logic will be covered by a different proposal later on.
 ## Implementation proposal
 There will be no change to how expansion options are generated in ScaleUp
 function. Instead the balancing will be executed after expansion option is
 chosen by expansion.Strategy and before node group is resized. The high-level
 algorithm will be as follows:
 1. During loop generating expansion options create a map {node group -> set of
   pods that pass predicates for this node group}. We already calculate that,
   just need to store it.
 2. Take expansion option chosen by expansion.Strategy and call it E. Let NG be node group
   chosen by strategy and K be the number of nodes that need to be added. Let P
   be the set of all pods that will be scheduled thanks to E.
 3. Find all node groups "similar" to NG and call them NGS.
 4. Check if every pod in P passes scheduler predicates for sample node in every
   node group in NGS (by checking if P is a subset of set we stored in step 1).
   Remove from NGS any node group on which at least one pod from P can't be
   scheduled.
 5. Add NG to NGS.
 6. Get current and maximum size of all node groups in NGS. Split K between node
   groups as described in example above.
 7. Resize all groups in NGS as per result of step 6.
 If the user sets the corresponding flag to 'false' we skip step 3,
 resulting in a single element in NGS (this makes step 4 no-op and step 6 trivial).
 ## Similar node groups
 We will balance size of similar node groups. We want similar groups to consist
 of machine with the same instance type and with the same set of custom labels.
 In particular we define "similar" node groups as having:
 * The same Capacity for every resource.
 * Allocatable for each resource within 5% of each other (this number can depend on a
   few different factors and so it's good to have some minor slack).
 * "Free" resources (defined as Allocatable minus resources used by daemonsets and
   kube-proxy) for each resource within 5% of each other (this number can depend on a
   few different factors and so it's good to have some minor slack).
 * The same set of labels, except for zone and hostname labels (defined in
   https://github.com/kubernetes/kube-state-metrics/blob/master/vendor/k8s.io/client-go/pkg/api/unversioned/well_known_labels.go)
 ---
 ## Other possible solutions and discussion
 There are other ways to implement the general idea than the proposed solution.
 This section lists other options that were considered and discusses pros and
 cons of each one. Feel free to skip it.
 #### [S1]Split selected expansion option
 This is the solution described in "Implementation proposal" section.
 Pros:
 * Simplest solution.
 Cons:
 * If at least a single pending pod uses zone-based scheduling features the
   whole scale-up will likely go to a single node group.
 * We add slightly different nodes than those chosen by expansion.Strategy. In
   particular any zone-based choices made by expansion.Strategy will be
   discarded.
 * We operate on single node groups when creating expansion options, so maximum
   scale-up size is limited by maximum size of a single group.
 #### [S2]Update expansion options before choosing them
 This idea is somewhat similar to [S1], but the new method would be called
 on a set of expansion options before expansion.Strategy chooses one. The new
 method could either modify each option to contain a set of scale-ups on similar
 node groups.
 Pros:
 * Addresses some issues of [S1], but not the issues related to pods using
   zone-based scheduling.
 Cons:
 * To fix issues of [S1] the function processing expansion strategies need to be
   more complex.
 * Need to update expansion.Option to be based on multiple NodeGroups, not just
   one. This will make expansion.Strategy considerably more complex and difficult to implement.
 #### [S3]Make a wrapper for cloudeprovider.CloudProvider
 This solution would work by implementing a NodeGroupSet wrapper implementing
 cloudprovider.NodeGroup interface. It would consist of one or more
 NodeGroups and internally load balance their sizes.
 Pros:
 * We could use an aggregated maximum size for all node groups when creating
   expansion options.
 * We could add additional methods to NodeGroupSet to split pending pods between
   underlying NodeGroups in a smart way, allowing to deal with pod antiaffinity
   and zone based label selectors without completly skipping size balancing.
 Cons:
 * NodeGroup contract assumes all nodes in NodeGroup are identical. However we
   want to allow at least different zone labels between node groups in node
   group set.
 * Actually doing the smart stuff with labelSelectors, etc. will be complex and
   hard to implement.
 #### [S4]Refactor scale-up logic to include balancing similar node groups
 This solution would change how expansion options are generated in
 core/scale_up.go. The main ScaleUp function could be largely rewritten to take
 balancing node groups into account.
 Pros:
 * Most benefits of [S3].
 * Avoids problems caused by breaking NodeGroup interface contract.
 Cons:
 * Complex.
 * Large changes to critical parts of code, most likely to cause regression.
 #### Discussion
 A lot of difficulty of the problem comes from the fact that we can have pods who
 can only schedule on some of the node groups in a given node group set.
 Such pods require specific config by user (zone-based labelSelector or
 antiaffinity) and are likely not very common in most
 clusters. Additionally one can argue that having a majority of pods explicitly
 specify the zone they want to run in defies the purpose of
 automatically balancing the size of node groups between zones in the first place.
 If we treat those pods as edge case options [S3] and [S4] don't seem very
 attractive. Their main benefit of options [S3] and [S4] is allowing to deal with
 such edge cases at the cost of significantly increased complexity.
 That leaves options [S1] and [S2]. Once again this is a decision between better
 handling of difficult cases versus complexity. This time this tradeoff applies
 mostly to expansion.Strategy interface. So far there are no implementations of
 this interface that make zone-based decisions and making expansion options more
 complex (by consisting of a set of NodeGroups) will make all existing strategies
 more complex as well, for no benefit. So it seems that [S1] is the best
 available option by virtue of its relative simplicity.