From 1fbbfc8a3d0096f8bb997ee70ef4d37c6a43dc2d Mon Sep 17 00:00:00 2001 From: Wojciech Tyczynski Date: Mon, 7 Aug 2017 15:17:54 +0200 Subject: [PATCH] API-machinery SLIs --- sig-scalability/slis/apimachinery_slis.md | 196 ++++++++++++++++++ .../{slo => slos}/throughput_burst_slo.md | 0 2 files changed, 196 insertions(+) create mode 100644 sig-scalability/slis/apimachinery_slis.md rename sig-scalability/{slo => slos}/throughput_burst_slo.md (100%) diff --git a/sig-scalability/slis/apimachinery_slis.md b/sig-scalability/slis/apimachinery_slis.md new file mode 100644 index 000000000..512548ee8 --- /dev/null +++ b/sig-scalability/slis/apimachinery_slis.md @@ -0,0 +1,196 @@ +# API-machinery SLIs and SLOs + +The document was converted from [Google Doc]. Please refer to the original for +extended commentary and discussion. + +## Background + +Scalability is an important aspect of the Kubernetes. However, Kubernetes is +such a large system that we need to manage users expectations in this area. +To achieve it, we are in process of redefining what does it mean that +Kubernetes supports X-node clusters - this doc describes the high-level +proposal. In this doc we are describing API-machinery related SLIs we would +like to introduce and suggest which of those should eventually have a +corresponding SLO replacing current "99% of API calls return in under 1s" one. + +The SLOs we are proposing in this doc are our goal - they may not be currently +satisfied. As a result, while in the future we would like to block the release +when we are violating SLOs, we first need to understand where exactly we are +now, define and implement proper tests and potentially improve the system. +Only once this is done, we may try to introduce a policy of blocking the +release on SLO violation. But this is out of scope of this doc. + + +### SLIs and SLOs proposal + +Below we introduce all SLIs and SLOs we would like to have in the api-machinery +area. A bunch of those are not easy to understand for users, as they are +designed for developers or performance tracking of higher level +user-understandable SLOs. The user-oriented one (which we want to publicly +announce) are additionally highlighted with bold. + +### Prerequisite + +Kubernetes cluster is available and serving. + +### Latency[1](#footnote1) of API calls for single objects + +__***SLI1: Non-streaming API calls for single objects (POST, PUT, PATCH, DELETE, +GET) latency for every (resource, verb) pair, measured as 99th percentile over +last 5 minutes***__ + +__***SLI2: 99th percentile for (resource, verb) pairs \[excluding virtual and +aggregated resources and Custom Resource Definitions\] combined***__ + +__***SLO: In default Kubernetes installation, 99th percentile of SLI2 +per cluster-day[2](#footnote2) <= 1s***__ + +User stories: +- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the +response from an API call. +- As an administrator of Kubernetes cluster, if I know characteristics of my +external dependencies of apiserver (e.g custom admission plugins, webhooks and +initializers) I want to be able to provide guarantees for API calls latency to +users of my cluster + +Background: +- We obviously can’t give any guarantee in general, because cluster +administrators are allowed to register custom admission plugins, webhooks +and/or initializers, which we don’t have any control about and they obviously +impact API call latencies. +- As a result, we define the SLIs to be very generic (no matter how your +cluster is set up), but we provide SLO only for default installations (where we +have control over what apiserver is doing). This doesn’t provide a false +impression, that we provide guarantee no matter how the cluster is setup and +what is installed on top of it. +- At the same time, API calls are part of pretty much every non-trivial workflow +in Kubernetes, so this metric is a building block for less trivial SLIs and +SLOs. + +Other notes: +- The SLO has to be satisfied independently from from the used encoding. This +makes the mix of client important while testing. However, we assume that all +`core` components communicate with apiserver with protocol buffers (otherwise +the SLO doesn’t have to be satisfied). +- In case of GET requests, user has an option to opt-in for accepting +potentially stale data (the request is then served from cache and not hitting +underlying storage). However, the SLO has to be satisfied even if all requests +ask for up-to-date data, which again makes careful choice of requests in tests +important while testing. + + +### Latency of API calls for multiple objects + +__***SLI1: Non-streaming API calls for multiple objects (LIST) latency for +every (resource, verb) pair, measure as 99th percentile over last 5 minutes***__ + +__***SLI2: 99th percentile for (resource, verb) pairs [excluding virtual and +aggregated resources and Custom Resource Definitions] combined***__ + +__***SLO1: In default Kubernetes installation, 99th percentile of SLI2 per +cluster-day***__ +- __***is <= 1s if total number of objects of the same type as resource in the +system <= X***__ +- __***is <= 5s if total number of objects of the same type as resource in the +system <= Y***__ +- __***is <= 30s if total number of objects of the same types as resource in the +system <= Z***__ + +User stories: +- As a user of vanilla Kubernetes, I want some guarantee how quickly I get the +response from an API call. +- As an administrator of Kubernetes cluster, if I know characteristics of my +external dependencies of apiserver (e.g custom admission plugins, webhooks and +initializers) I want to be able to provide guarantees for API calls latency to +users of my cluster. + +Background: +- On top of arguments from latency of API calls for single objects, LIST +operations are crucial part of watch-related frameworks, which in turn are +responsible for overall system performance and responsiveness. +- The above SLO is user-oriented and may have significant buffer in threshold. +In fact, the latency of the request should be proportional to the amount of +work to do (which in our case is number of objects of a given type (potentially +in a requested namespace if specified)) plus some constant overhead. For better +tracking of performance, we define the other SLIs which are supposed to be +purely internal (developer-oriented) + + +_SLI3: Non-streaming API calls for multiple objects (LIST) latency minus 1s +(maxed with 0) divided by number of objects in the collection +[3](#footnote3) (which may be many more than the number of returned +objects) for every (resource, verb) pair, measured as 99th percentile over +last 5 minutes._ + +_SLI4: 99th percentile for (resource, verb) pairs [excluding virtual and +aggregated resources and Custom Resource Definitions] combined_ + +_SLO2: In default Kubernetes installation, 99th percentile of SLI4 per +cluster-day <= Xms_ + + +### Watch latency + +_SLI1: API-machinery watch latency (measured from the moment when object is +stored in database to when it’s ready to be sent to all watchers), measured +as 99th percentile over last 5 minutes_ + +_SLO1 (developer-oriented): 99th percentile of SLI1 per cluster-day <= Xms_ + +User stories: +- As an administrator, if system is slow, I would like to know if the root +cause is slow api-machinery or something farther the path (lack of network +bandwidth, slow or cpu-starved controllers, ...). + +Background: +- Pretty much all control loops in Kubernetes are watch-based, so slow watch +means slow system in general. As a result, we want to give some guarantees on +how fast it is. +- Note that how we measure it, silently assumes no clock-skew in case of HA +clusters. + + +### Admission plugin latency + +_SLI1: Admission latency for each admission plugin type, measured as 99th +percentile over last 5 minutes_ + +User stories: +- As an administrator, if API calls are slow, I would like to know if this is +because slow admission plugins and if so which ones are responsible. + + +### Webhook latency + +_SLI1: Webhook call latency for each webhook type, measured as 99th percentile +over last 5 minutes_ + +User stories: +- As an administrator, if API calls are slow, I would like to know if this is +because slow webhooks and if so which ones are responsible. + + +### Initializer latency + +_SLI1: Initializer latency for each initializer, measured as 99th percentile +over last 5 minutes_ + +User stories: +- As an administrator, if API calls are slow, I would like to know if this is +because of slow initializers and if so which ones are responsible. + +--- +\[1\]By latency of API call in this doc we mean time +from the moment when apiserver gets the request to last byte of response sent +to the user. + +\[2\] For the purpose of visualization it will be a +sliding window. However, for the purpose of reporting the SLO, it means one +point per day (whether SLO was satisfied on a given day or not). + +\[3\]A collection contains: (a) all objects of that +type for cluster-scoped resources, (b) all object of that type in a given +namespace for namespace-scoped resources. + + +[Google Doc]: https://docs.google.com/document/d/1Q5qxdeBPgTTIXZxdsFILg7kgqWhvOwY8uROEf0j5YBw/edit# diff --git a/sig-scalability/slo/throughput_burst_slo.md b/sig-scalability/slos/throughput_burst_slo.md similarity index 100% rename from sig-scalability/slo/throughput_burst_slo.md rename to sig-scalability/slos/throughput_burst_slo.md