community/sig-scalability/slis/apimachinery_slis.md

8.3 KiB
Raw Blame History

API-machinery SLIs and SLOs

The document was converted from Google Doc. Please refer to the original for extended commentary and discussion.

Background

Scalability is an important aspect of the Kubernetes. However, Kubernetes is such a large system that we need to manage users expectations in this area. To achieve it, we are in process of redefining what does it mean that Kubernetes supports X-node clusters - this doc describes the high-level proposal. In this doc we are describing API-machinery related SLIs we would like to introduce and suggest which of those should eventually have a corresponding SLO replacing current "99% of API calls return in under 1s" one.

The SLOs we are proposing in this doc are our goal - they may not be currently satisfied. As a result, while in the future we would like to block the release when we are violating SLOs, we first need to understand where exactly we are now, define and implement proper tests and potentially improve the system. Only once this is done, we may try to introduce a policy of blocking the release on SLO violation. But this is out of scope of this doc.

SLIs and SLOs proposal

Below we introduce all SLIs and SLOs we would like to have in the api-machinery area. A bunch of those are not easy to understand for users, as they are designed for developers or performance tracking of higher level user-understandable SLOs. The user-oriented one (which we want to publicly announce) are additionally highlighted with bold.

Prerequisite

Kubernetes cluster is available and serving.

Latency1 of API calls for single objects

SLI1: Non-streaming API calls for single objects (POST, PUT, PATCH, DELETE, GET) latency for every (resource, verb) pair, measured as 99th percentile over last 5 minutes

SLI2: 99th percentile for (resource, verb) pairs [excluding virtual and aggregated resources and Custom Resource Definitions] combined

SLO: In default Kubernetes installation, 99th percentile of SLI2 per cluster-day2 <= 1s

User stories:

  • As a user of vanilla Kubernetes, I want some guarantee how quickly I get the response from an API call.
  • As an administrator of Kubernetes cluster, if I know characteristics of my external dependencies of apiserver (e.g custom admission plugins, webhooks and initializers) I want to be able to provide guarantees for API calls latency to users of my cluster

Background:

  • We obviously cant give any guarantee in general, because cluster administrators are allowed to register custom admission plugins, webhooks and/or initializers, which we dont have any control about and they obviously impact API call latencies.
  • As a result, we define the SLIs to be very generic (no matter how your cluster is set up), but we provide SLO only for default installations (where we have control over what apiserver is doing). This doesnt provide a false impression, that we provide guarantee no matter how the cluster is setup and what is installed on top of it.
  • At the same time, API calls are part of pretty much every non-trivial workflow in Kubernetes, so this metric is a building block for less trivial SLIs and SLOs.

Other notes:

  • The SLO has to be satisfied independently from from the used encoding. This makes the mix of client important while testing. However, we assume that all core components communicate with apiserver with protocol buffers (otherwise the SLO doesnt have to be satisfied).
  • In case of GET requests, user has an option to opt-in for accepting potentially stale data (the request is then served from cache and not hitting underlying storage). However, the SLO has to be satisfied even if all requests ask for up-to-date data, which again makes careful choice of requests in tests important while testing.

Latency of API calls for multiple objects

SLI1: Non-streaming API calls for multiple objects (LIST) latency for every (resource, verb) pair, measure as 99th percentile over last 5 minutes

SLI2: 99th percentile for (resource, verb) pairs [excluding virtual and aggregated resources and Custom Resource Definitions] combined

SLO1: In default Kubernetes installation, 99th percentile of SLI2 per cluster-day

  • is <= 1s if total number of objects of the same type as resource in the system <= X
  • is <= 5s if total number of objects of the same type as resource in the system <= Y
  • is <= 30s if total number of objects of the same types as resource in the system <= Z

User stories:

  • As a user of vanilla Kubernetes, I want some guarantee how quickly I get the response from an API call.
  • As an administrator of Kubernetes cluster, if I know characteristics of my external dependencies of apiserver (e.g custom admission plugins, webhooks and initializers) I want to be able to provide guarantees for API calls latency to users of my cluster.

Background:

  • On top of arguments from latency of API calls for single objects, LIST operations are crucial part of watch-related frameworks, which in turn are responsible for overall system performance and responsiveness.
  • The above SLO is user-oriented and may have significant buffer in threshold. In fact, the latency of the request should be proportional to the amount of work to do (which in our case is number of objects of a given type (potentially in a requested namespace if specified)) plus some constant overhead. For better tracking of performance, we define the other SLIs which are supposed to be purely internal (developer-oriented)

SLI3: Non-streaming API calls for multiple objects (LIST) latency minus 1s (maxed with 0) divided by number of objects in the collection 3 (which may be many more than the number of returned objects) for every (resource, verb) pair, measured as 99th percentile over last 5 minutes.

SLI4: 99th percentile for (resource, verb) pairs [excluding virtual and aggregated resources and Custom Resource Definitions] combined

SLO2: In default Kubernetes installation, 99th percentile of SLI4 per cluster-day <= Xms

Watch latency

SLI1: API-machinery watch latency (measured from the moment when object is stored in database to when its ready to be sent to all watchers), measured as 99th percentile over last 5 minutes

SLO1 (developer-oriented): 99th percentile of SLI1 per cluster-day <= Xms

User stories:

  • As an administrator, if system is slow, I would like to know if the root cause is slow api-machinery or something farther the path (lack of network bandwidth, slow or cpu-starved controllers, ...).

Background:

  • Pretty much all control loops in Kubernetes are watch-based, so slow watch means slow system in general. As a result, we want to give some guarantees on how fast it is.
  • Note that how we measure it, silently assumes no clock-skew in case of HA clusters.

Admission plugin latency

SLI1: Admission latency for each admission plugin type, measured as 99th percentile over last 5 minutes

User stories:

  • As an administrator, if API calls are slow, I would like to know if this is because slow admission plugins and if so which ones are responsible.

Webhook latency

SLI1: Webhook call latency for each webhook type, measured as 99th percentile over last 5 minutes

User stories:

  • As an administrator, if API calls are slow, I would like to know if this is because slow webhooks and if so which ones are responsible.

Initializer latency

SLI1: Initializer latency for each initializer, measured as 99th percentile over last 5 minutes

User stories:

  • As an administrator, if API calls are slow, I would like to know if this is because of slow initializers and if so which ones are responsible.

[1]By latency of API call in this doc we mean time from the moment when apiserver gets the request to last byte of response sent to the user.

[2] For the purpose of visualization it will be a sliding window. However, for the purpose of reporting the SLO, it means one point per day (whether SLO was satisfied on a given day or not).

[3]A collection contains: (a) all objects of that type for cluster-scoped resources, (b) all object of that type in a given namespace for namespace-scoped resources.