community/contributors/guide/scalability-good-practices.md

*This document is written for contributors who would like to avoid their code being reverted for performance reasons*

**Table of Contents**

- [Who should read this document and what is in it?](#who-should-read-this-document-and-what-is-in-it)
- [What does it mean to "break scalability"?](#what-does-it-mean-to-break-scalability)
- [Examples](#examples)
  - [Inefficient use of memory](#inefficient-use-of-memory)
  - [Explicit lists from the API server](#explicit-lists-from-the-api-server)
  - [Superfluous API calls](#superfluous-api-calls)
  - [Complex and expensive computations on a critical path](#complex-and-expensive-computations-on-a-critical-path)
  - [Big dependency changes](#big-dependency-changes)
- [Summary](#summary)
- [Closing remarks](#closing-remarks)

## Who should read this document and what is in it?

This document is targeted at developers of "vanilla Kubernetes" who do not want their
changes rolled-back or blocked because they cause performance regressions. It contains
some of the knowledge and experience gathered by the scalability team over more than two years.

It is presented as a set of examples from the past which broke scalability tests,
followed by some explanations and general suggestions on how to avoid causing similar problems.

## What does it mean to "break scalability"?

"Breaking scalability" means causing performance SLO violations in one of our performance tests.
Performance SLOs for Kubernetes are <sup>[2](#2)</sup>:

- 99th percentile of API call latencies <= 1s
- 99th percentile of e2e Pod startup, excluding image pulling, latencies <= 5s

We run density and load tests, and we invite anyone interested in the details to read the code.

We run those tests on large clusters (100+ Nodes). This means tests are somewhat
resistant to limited concurrency in Kubelet (e.g. they are routinely failing on
very small clusters, when the Scheduler cannot spread Pod creations broadly enough).

## Examples

### Inefficient use of memory

Consider the following sample code snippet:

```golang
func (s Scheduler) ScheduleOne(pod v1.Pod, nodes []v1.Nodes) v1.Node {
  for _, node := range nodes {
    if s.FitsNode(pod, node) {
      …
    }
  }
}

func (s Scheduler) DoSchedule(podsChan chan v1.Pod) {
  for {
    …
    node := s.ScheduleOne(pod, s.nodes)
    …
  }
}
```

This snippet contains a number of problems that were always present in the Kubernetes
codebase, and continue to appear. We try to address them in the most important places,
but the work never ends.

The first problem is that `func (s Scheduler) ScheduleOne…` means each call of `ScheduleOne`
will run on a new copy of the Scheduler object. This in turn means Golang will need to copy
the entire `Scheduler` struct every time the `ScheduleOne` function is called. The copy will
then be discarded when the function returns. Clearly, this is a waste of resources, and
in some cases may be incorrect.

Next, `(pod v1.Pod, nodes []v1.Nodes)` has much in common with the first problem. By default,
Golang passes arguments as values, i.e. copies them when they are passed to the function.
_Note that this is very different from Java or Python_. Of course, some things are fine to
pass directly. Slices, maps, strings and interfaces are actually pointers
(in general interfaces might not be pointers, but in our code they are - see first point),
so only a pointer value is copied when they are passed as an argument. For flat structures,
copying is sometimes necessary (e.g. when doing asynchronous modifications), but most often
it is not. In such cases, use pointers.

As there are no constant references in Golang, this is the only option for passing objects
without copying them (except creating read-only interfaces for all types, but that is not feasible).
Note that it is (and should be) scary to pass a pointer to your object to strangers.
Before you do so, make sure the code to which you are passing the pointer will not modify the object.
Races are bad as well. Note that all `Informers` (see next paragraph) caches are expected to be immutable.

We could go on and on, but the point is clear -- when writing code that will be executed often,
you need to think about memory management. From time to time we all occasionally forget to keep
this in mind, but we are reminded of it when we look at performance. General rules are:

- Using heap is very expensive (garbage collection)
- Avoid unnecessary heap operations altogether
- Repeatedly copying objects is noticeable and should be minimized.
- Learn how Golang manages memory. This is especially important in components running
  on the control plane. Otherwise we may end up in the situation where the API server
  is starved on CPU and cannot respond quickly to requests.

### Explicit lists from the API server

Some time ago most of our controllers looked like this:

```golang
func (c *ControllerX) Reconcile() {
  items, err := c.kubeClient.X(v1.NamespaceAll).List(&v1.ListOptions{})
  if err != nil {
    ...
  }
  for _, item := range items {
    ...
  }
}

func (c *ControllerX) Run() {
  wait.Until(c.Reconcile, c.Period, wait.NeverStop)
  ...
}
```

This may look OK, but List() calls are expensive. Objects can have sizes of a few
kilobytes, and there can be 150,000 of those. This means List() would need to send
hundreds of megabytes through the network, not to mention the API server would need
to do conversions of all this data along the way. It is not the end of the world,
but it needs to be minimized. The solution is simple (quoting Clayton):

> As a rule, use Informer. If using Informer, use shared Informers.
> If your use case does not look like an Informer, look harder.
> If at the very end of that it still does not look like an Informer,
> consider using something else after talking to someone. But probably use Informer.

`Informer` is our library which provides a read interface to the store - it is a
read-only cache that provides you with a local copy of the store that contains
only the object you are interested in (matching given selector). From it you
can Get(), List() or whatever read operations you desire. `Informer` also allows
you to register functions that will be called when an object is created, modified or deleted.

The magic behind `Informers` is that they are populated by the WATCH,
so they create minimal stress on the API server. Code for Informer is
[here](https://git.k8s.io/kubernetes/staging/src/k8s.io/client-go/tools/cache/shared_informer.go).

In general: use `Informers` - if we were able to rewrite most vanilla controllers to use them,
you should be able to do so as well. Otherwise, you may dramatically increase the CPU
requirements of the API server which will starve it and make it too slow to meet our SLOs.

### Superfluous API calls

One past regression was caused by `Secret` refreshing logic in Kubelet. By contract
we want to update values of `Secrets` (update env variables, contents of `Secret` volume)
when the contents of `Secret` are updated in the API server. Normally we would use
`Informer` (see above), but there is an additional security constraint; Kubelet should
know only `Secrets` that are attached to `Pods` scheduled on the corresponding `Node`,
so there should be no watching of all `Secret` updates (which is how `Informers` work).
We already know that List() calls are also bad (not to mention that they have the same
security problem as WATCH), so the only way we can read `Secrets` is through GET.

For each `Secret` we were periodically GETting its value and updating underlying
variables/volumes as necessary. We have the same logic for `ConfigMaps`. Everything
was great until we turned on the `ServiceAccount` admission controller in our
performance tests. Then everything went wrong for a very simple reason;
the `ServiceAccount` admission controller creates a `Secret` that it attaches to
every `Pod` (a different one in every Namespace, but this does not change anything).
Multiply this behavior by 150,000 and, given a refresh period of 60 seconds,
an additional 2.5k QPS were being sent to the API server, which of course caused it to fail.

To mitigate this issue we had to reimplement Informers using GETs instead of WATCHes.
The current solution consists of a `Secret` cache shared between all `Pod`s.
When a `Pod` wants to check if the `Secret` has changed it looks in the cache.
If the `Secret` stored in the cache is too old, the cache issues a GET request to
the API server to refresh the value. As `Pods` within a single `Namespace` share
the `Secret` for `ServiceAccount`, it means Kubelet will need to refresh the
`Secret` only once in a while per `Namespace`, not per `Pod`, as it was before.
This of course is a stopgap and not a final solution, which is currently (as of early May 2017)
being designed as a ["Bulk Watch"](https://github.com/kubernetes/community/pull/443).

This example demonstrates why you need to treat API calls as a rather expensive
shared resource. This is especially important on the Node side, as every change
is multiplied by 5,000. In controllers, especially when writing some disaster
recovery logic, it is perfectly fine to add a new call. There are not a lot of
controllers, and disaster recovery should not happen too often. That being said,
whenever you add a new API server request you should do quick estimation of QPS
that will be added to the API server, and if the result is a noticeable number
you probably should think about a way to reduce it.

One obvious consequence of not reducing API calls is that you will starve the API
server on CPU. This particular pattern can also drain `max-inflight-request` in
the API server, which will make it respond with 429's (Too Many Requests) and thus
slow down the system. At best it will only cause draining of the local client rate
limiter for API calls in your component (default value is 5 QPS, controllers normally have 20).
This will result in your component being very, very slow.

### Complex and expensive computations on a critical path

Let us use the `PodAntiAffinity` scheduling feature as an example.
The goal of this feature is to allow users to prevent co-scheduling of `Pods`
(using a very broad definition of co-scheduling). When defining `PodAntiAffinity`
you pass two things: `Node` grouping and `Pod` selector. The semantics is that for
each group of `Nodes` you check if any `Node` in the group runs a `Pod` matching
the selector. If it does, all `Nodes` from the group are discarded. This of course
needs to be symmetric, as if you prevent pods from set A to be co-scheduled with
`Pods` from set B, but not the other way around. When adding new `Pod` to set B,
you'll end up with `Pods` from A and B running in the same group, which you wanted to avoid.

This means that even when scheduling `Pods` that do not explicitly use the `PodAntiAffinity`
feature you need to check `PodAntiAffinities` of all `Pods` running in the cluster.
It also means that scheduling of every `Pod` gets an additional check of `O(#Pods * #Nodes)`
complexity, if naively implemented. Given the fact that we can have 150.000 `Pods` in the
cluster, it becomes obvious it is not a good idea to have quadratic algorithms on a critical
path for Pods - even for ones that do not use the PodAntiAffinity feature!

This was initially implemented in a very simple way, rapidly making the scheduler
unusable, and `Pod` startup times went through the roof. We were forced to block
this feature, and it did not make into the target release. Later, we slightly
improved the algorithm to `O(#(scheduled Pods with PodAntiAffinity) * #Nodes)`,
which was enough to allow the feature to get in as beta, with a huge asterisk next to it.

This example illustrates how many problems in this area can be much more complex
than they seem. Not only that, they are non-linear, and some of them are NP-complete.
Understandably, sometimes you need to write something complex, but when you do,
you must protect the rest of the system from that complexity,
and add it only where it is absolutely necessary.

### Big dependency changes

Kubernetes depends on pretty much the whole universe. From time to time we need
to update some dependencies (Godeps, etcd, go version). This can break us in
many ways, as has already happened a couple of times. We skipped one version
of Golang (1.5) precisely because it broke our performance. As this is being
written, we are working with the Golang team to try to understand why Golang
version 1.8 negatively affects Kubernetes performance.

If you are changing a large and important dependency, the only way to know
what performance impact it will have is to run test and check.

#### Where to look to get data?

If you want to check the impact of your changes there are a number of places to look.

- Density and load tests output quite a lot of data either to test logs, or files
  inside 'ReportDir' - both of them include API call latencies, and density tests
  also include pod e2e startup latency information.
- For resource usage you can either use monitoring tools (heapster + Grafana, but
  note that at the time of writing, this stops working at around 100 Nodes), or
  just plain 'top' on the control plane (which scales as much as you want),
- More data is available on the `/metrics` endpoint of all our components
  (e.g. the one for the API server contains API call latencies),
  to profile a component create an ssh tunnel to the machine running it,
  and run `go tool pprof localhost:<your_tunnel_port>` locally

## Summary

To summarize, when writing code you should:

- understand how Golang manages memory and use it wisely,
- not List() from the API server,
- run performance tests when making large systemwide changes (e.g. updating big dependencies),

When designing new features or thinking about refactoring you should:

- Estimate the number of additional QPS you will be sending to the API server when adding new API calls
- Make sure to not add any complex logic on a critical path of any basic workflow

## Closing remarks

We know that thinking about the performance impact of changes is hard. This is
exactly why we want you to help us cater for it, by keeping all the knowledge
we have given you here in the back of your mind as you write your code.
In return, we will answer all your question and doubts about possible impact
of your changes if you post them either to #sig-scalability Slack channel,
or cc @kubernetes/sig-scalability-pr-reviews in your PR/proposal.

---

<a name="1">1</a>: If you are using List() in tight loops, it is common to do so on a subset of a list (field, label, or namespace). Most Informers have indices on namespaces, but you may end up needing another index if profile shows the need.

<a name="2">2</a>: We are working on adding new SLOs and improving the system to meet them.