From d7a7fc659b9e0c037dfab9029a101cb5076f2379 Mon Sep 17 00:00:00 2001
From: Maciej Pytel <maciekpytel@google.com>
Date: Fri, 26 May 2017 15:02:34 +0200
Subject: [PATCH] Proposal for balancing node groups for multizone

---
 .../proposals/balance_similar.md              | 175 ++++++++++++++++++
 1 file changed, 175 insertions(+)
 create mode 100644 cluster-autoscaler/proposals/balance_similar.md

diff --git a/cluster-autoscaler/proposals/balance_similar.md b/cluster-autoscaler/proposals/balance_similar.md
new file mode 100644
index 0000000000..a27176d4fe
--- /dev/null
+++ b/cluster-autoscaler/proposals/balance_similar.md
@@ -0,0 +1,175 @@
+# Balance similar node groups between zones
+##### Author: MaciekPytel
+
+## Introduction
+We have multiple requests from people who want to use node groups with the same
+instance type, but located in multiple zones for redundancy / HA. Currently
+Cluster Autoscaler is randomly adding and deleting nodes in those node groups,
+which results in uneven node distribution across different zones. The goal of
+this proposal is to introduce mechanism to balance the number of nodes in
+similar node groups.
+
+## Constraints
+We want this feature to work reasonably well with any currently
+supported CA configuration. In particular we want to support both homogenous and heterogenous clusters,
+allowing the user to easily choose or implement strategy for defining what kind
+of node should be added (i.e. large instance vs several small instances, etc).
+
+Those goals imply a few more specifc constraints that we want to keep:
+ * We want to avoid balancing size of node groups composed of different instances
+(there is no point in trying to balance the number of 2-CPU and 32-CPU
+machines).
+ * We want to respect size limits of each node group. In particular we shouldn't
+ expect that the limits will be the same for corresponding node groups.
+ * Some pods may require to run on a specific node group due to using
+   labalSelector on zone label, or antiaffinity. In such cases we must scale the
+   correct node group.
+ * Preferably we should avoid implementing this using existing
+   expansion.Strategy interface, as that would likely require non-backward
+   compatible refactor of the interface and we want to allow using this feature
+   with different strategies.
+ * User must be able to disable this logic using a flag.
+
+## General idea
+The general idea behind this proposal is to introduce a concept of "Node Group
+Set", consisting of one or more of "similar" node groups (the definition of
+"similar" is provided in separate section). When scaling up we would split the
+nodes between node groups in the same set to make their size as similar as
+possible. For example assume node group set made of node groups A (currently 1 node), B (3 nodes), and C (6 nodes).
+If we needed to add a new node to the cluster it would go to group A. If we
+needed to add 4 nodes, 3 of them would go to group A and 1 to group B.
+
+Note that this does not gurantee that node groups will always have the same
+size. Cluster Autoscaler will add exactly as many nodes as are required for
+pending pods, which may not be divisible by number of node groups in node group
+set. Additionaly we scale down underutilized nodes, which may happen to be in
+the same node group. Including relative sizes of similar node groups in scale
+down logic will be covered by a different proposal later on.
+
+## Implementation proposal
+There will be no change to how expansion options are generated in ScaleUp
+function. Instead the balancing will be executed after expansion option is
+chosen by expansion.Strategy and before node group is resized. The high-level
+algorithm will be as follows:
+1. During loop generating expansion options create a map {node group -> set of
+   pods that pass predicates for this node group}. We already calculate that,
+   just need to store it.
+2. Take expansion option chosen by expansion.Strategy and call it E. Let NG be node group
+   chosen by strategy and K be the number of nodes that need to be added. Let P
+   be the set of all pods that will be scheduled thanks to E.
+3. Find all node groups "similar" to NG and call them NGS.
+4. Check if every pod in P passes scheduler predicates for sample node in every
+   node group in NGS (by checking if P is a subset of set we stored in step 1).
+   Remove from NGS any node group on which at least one pod from P can't be
+   scheduled.
+5. Add NG to NGS.
+6. Get current and maximum size of all node groups in NGS. Split K between node
+   groups as described in example above.
+7. Resize all groups in NGS as per result of step 6.
+
+If the user sets the corresponding flag to 'false' we skip step 3,
+resulting in a single element in NGS (this makes step 4 no-op and step 6 trivial).
+
+## Similar node groups
+We will balance size of similar node groups. We want similar groups to consist
+of machine with the same instance type and with the same set of custom labels.
+In particular we define "similar" node groups as having:
+ * The same Capacity for every resource.
+ * Allocatable for each resource within 5% of each other (this number can depend on a
+   few different factors and so it's good to have some minor slack).
+ * "Free" resources (defined as Allocatable minus resources used by daemonsets and
+   kube-proxy) for each resource within 5% of each other (this number can depend on a
+   few different factors and so it's good to have some minor slack).
+ * The same set of labels, except for zone and hostname labels (defined in
+   https://github.com/kubernetes/kube-state-metrics/blob/master/vendor/k8s.io/client-go/pkg/api/unversioned/well_known_labels.go)
+
+---
+
+## Other possible solutions and discussion
+There are other ways to implement the general idea than the proposed solution.
+This section lists other options that were considered and discusses pros and
+cons of each one. Feel free to skip it.
+
+#### [S1]Split selected expansion option
+This is the solution described in "Implementation proposal" section.
+
+Pros:
+ * Simplest solution.
+
+Cons:
+ * If at least a single pending pod uses zone-based scheduling features the
+   whole scale-up will likely go to a single node group.
+ * We add slightly different nodes than those chosen by expansion.Strategy. In
+   particular any zone-based choices made by expansion.Strategy will be
+   discarded.
+ * We operate on single node groups when creating expansion options, so maximum
+   scale-up size is limited by maximum size of a single group.
+
+#### [S2]Update expansion options before choosing them
+This idea is somewhat similar to [S1], but the new method would be called
+on a set of expansion options before expansion.Strategy chooses one. The new
+method could either modify each option to contain a set of scale-ups on similar
+node groups.
+
+Pros:
+ * Addresses some issues of [S1], but not the issues related to pods using
+   zone-based scheduling.
+
+Cons:
+ * To fix issues of [S1] the function processing expansion strategies need to be
+   more complex.
+ * Need to update expansion.Option to be based on multiple NodeGroups, not just
+   one. This will make expansion.Strategy considerably more complex and difficult to implement.
+
+#### [S3]Make a wrapper for cloudeprovider.CloudProvider
+This solution would work by implementing a NodeGroupSet wrapper implementing
+cloudprovider.NodeGroup interface. It would consist of one or more
+NodeGroups and internally load balance their sizes.
+
+Pros:
+ * We could use an aggregated maximum size for all node groups when creating
+   expansion options.
+ * We could add additional methods to NodeGroupSet to split pending pods between
+   underlying NodeGroups in a smart way, allowing to deal with pod antiaffinity
+   and zone based label selectors without completly skipping size balancing.
+
+Cons:
+ * NodeGroup contract assumes all nodes in NodeGroup are identical. However we
+   want to allow at least different zone labels between node groups in node
+   group set.
+ * Actually doing the smart stuff with labelSelectors, etc. will be complex and
+   hard to implement.
+
+#### [S4]Refactor scale-up logic to include balancing similar node groups
+This solution would change how expansion options are generated in
+core/scale_up.go. The main ScaleUp function could be largely rewritten to take
+balancing node groups into account.
+
+Pros:
+ * Most benefits of [S3].
+ * Avoids problems caused by breaking NodeGroup interface contract.
+
+Cons:
+ * Complex.
+ * Large changes to critical parts of code, most likely to cause regression.
+
+#### Discussion
+A lot of difficulty of the problem comes from the fact that we can have pods who
+can only schedule on some of the node groups in a given node group set.
+Such pods require specific config by user (zone-based labelSelector or
+antiaffinity) and are likely not very common in most
+clusters. Additionally one can argue that having a majority of pods explicitly
+specify the zone they want to run in defies the purpose of
+automatically balancing the size of node groups between zones in the first place.
+
+If we treat those pods as edge case options [S3] and [S4] don't seem very
+attractive. Their main benefit of options [S3] and [S4] is allowing to deal with
+such edge cases at the cost of significantly increased complexity.
+
+That leaves options [S1] and [S2]. Once again this is a decision between better
+handling of difficult cases versus complexity. This time this tradeoff applies
+mostly to expansion.Strategy interface. So far there are no implementations of
+this interface that make zone-based decisions and making expansion options more
+complex (by consisting of a set of NodeGroups) will make all existing strategies
+more complex as well, for no benefit. So it seems that [S1] is the best
+available option by virtue of its relative simplicity.