kube-state-metrics/docs/design/metrics-store-performance-o...

# Kube-State-Metrics - Performance Optimization Proposal


---

Author: Max Inden (IndenML@gmail.com)

Date: 23. July 2018

Target release: v1.5.0

---


## Glossary

- kube-state-metrics: “Simple service that listens to the Kubernetes API server
  and generates metrics about the state of the objects”

- Time series: A single line in a /metrics response e.g.
  “metric_name{label="value"} 1”


## Problem Statement

There has been repeated reports of two issues running kube-state-metrics on
production Kubernetes clusters. First kube-state-metrics takes a long time
(“10s - 20s”) to respond on its /metrics endpoint, leading to Prometheus
instances dropping the scrape interval request and marking the given time series
as stale. Second kube-state-metrics uses a lot of memory and thereby being
out-of-memory killed due to low set Kubernetes resource limits.


## Goal

The goal of this proposal can be split into the following sub-goals ordered by
their priority:

1. Decrease response time on /metrics endpoint

2. Decrease overall runtime memory usage


## Status Quo

Instead of requesting the needed information from the Kubernetes API-Server on
demand (on scrape), kube-state-metrics uses the Kubernetes client-go cache tool
to keep a full in memory representation of all Kubernetes objects of a given
cluster. Using the cache speeds up the performance critical path of replying to
a scrape request, and reduces the load on the Kubernetes API-Server by only
sending deltas whenever they occur. Kube-state-metrics does not make use of all
properties and sub-objects of these Kubernetes objects that it stores in its
cache.

On a scrape request by e.g. Prometheus on the /metrics endpoint
kube-state-metrics calculates the configured time series on demand based on the
objects in its cache and converts them to the Prometheus string representation.


## Proposal

Instead of a full representation of all Kubernetes objects with all its
properties in memory via the Kubernetes client-go cache, use a map, addressable
by the Kubernetes object uuid, containing all time series of that object as a
single multi-line string.

```
var cache = map[uuid][]byte{}
```

Kube-state-metrics listens on add, update and delete events via Kubernetes
client-go reflectors. On add and update events kube-state-metrics generates all
time series related to the Kubernetes object based on the event’s payload,
concatenates the time series to a single byte slice and sets / replaces the byte
slice in the store at the uuid of the Kubernetes object. One can precompute the
length of a time series byte slice before allocation as the sum of the length of
the metric name, label keys and values as well as the metric value in string
representation. On delete events kube-state-metrics deletes the uuid entry of
the given Kubernetes object in the cache map.

On a scrape request on the /metrics endpoint, kube-state-metrics iterates over
the cache map and concatenates all time series string blobs into a single
string, which is finally passed on as a response.

```
       +---------------+ +-----------+                 +---------------+         +-------------------+
       | pod_reflector | | pod_store |                 | pod_collector |         | metrics_endpoint  |
       +---------------+ +-----------+                 +---------------+         +-------------------+
-------------\ |               |                               |                           |
| new pod p1 |-|               |                               |                           |
|------------| |               |                               |                           |
               |               |                               |                           |
               | Add(p1)       |                               |                           |
               |-------------->|                               |                           |
               |               | ----------------------\       |                           |
               |               |-| generateMetrics(p1) |       |                           |
               |               | |---------------------|       |                           |
               |               |                               |                           |
               |           nil |                               |                           |
               |<--------------|                               |                           |
               |               |                               |                           | ---------------\
               |               |                               |                           |-| GET /metrics |
               |               |                               |                           | |--------------|
               |               |                               |                           |
               |               |                               |                 Collect() |
               |               |                               |<--------------------------|
               |               |                               |                           |
               |               |                      GetAll() |                           |
               |               |<------------------------------|                           |
               |               |                               |                           |
               |               | []string{metrics}             |                           |
               |               |------------------------------>|                           |
               |               |                               |                           |
               |               |                               | concat(metrics)           |
               |               |                               |-------------------------->|
               |               |                               |                           |

```

<details>
 <summary>Code to reproduce diagram</summary>

Build via [text-diagram](http://weidagang.github.io/text-diagram/)

```
object pod_reflector pod_store pod_collector metrics_endpoint

note left of pod_reflector: new pod p1
pod_reflector -> pod_store: Add(p1)
note right of pod_store: generateMetrics(p1)
pod_store -> pod_reflector: nil

note right of metrics_endpoint: GET /metrics
metrics_endpoint -> pod_collector: Collect()
pod_collector -> pod_store: GetAll()
pod_store -> pod_collector: []string{metrics}
pod_collector -> metrics_endpoint: concat(metrics)
```

</details>


## FAQ / Follow up improvements

- If kube-state-metrics only listens on add, update and delete events, how is it
  aware of already existing Kubernetes objects created before kube-state-metrics
  was started? Leveraging Kubernetes client-go, reflectors can initialize all
  existing objects before any add, update or delete events. To ensure no events
  are missed in the long run, periodic resyncs via Kubernetes client-go can be
  triggered. This extra confidence is not a must and should be compared to its
  costs, as Kubernetes client-go already gives decent guarantees on event
  delivery.

- What about metadata (HELP and description) in the /metrics output? As a first
  iteration they would be skipped until we have a better idea on the design.

- How can the cache map be concurrently accessed? The core golang map
  implementation is not thread-safe. As a first iteration a simple mutex should
  be sufficient. Golang's sync.Map might be considered.

- To solve the problem of out of order events send by the Kubernetes API-Server
  to kube-state-metrics, to each blob of time series inside the cache map it can
  keep the Kubernetes resource version. On add and update events, first compare
  the resource version of the event with than the resource version in the cache.
  Only move forward if the former is higher than the latter.

- In case the memory consumption of the time series string blobs is a problem
  the following optimization can be considered: Among the time series strings,
  multiple sub-strings will be heavily duplicated like the metric name. Instead
  of saving unstructured strings inside the cache map, one can structure them,
  using pointers to deduplicate e.g. metric names.

- ...

- Kube-state-metrics does not make use of all properties of all Kubernetes
  objects. Instead of unmarshalling unused properties, their json struct tags or
  their Protobuf representation could be removed.