Remove burst SLIs/SLOs

This commit is contained in:
wojtekt 2019-06-01 21:41:21 +02:00
parent f042a6d212
commit f0c6e48b7b
2 changed files with 0 additions and 58 deletions

View File

@ -63,18 +63,6 @@ we will not provide any guarantees for users.
[Service Level Objectives]: https://en.wikipedia.org/wiki/Service_level_objective
[Service Level Agreement]: https://en.wikipedia.org/wiki/Service-level_agreement
## Types of SLOs
While SLIs are very generic and don't really depend on anything (they just
define what and how we measure), it's not the case for SLOs.
SLOs provide guarantees, and satisfying them may depend on meeting some
specific requirements.
As a result, we build our SLOs in "you promise, we promise" format.
That means, that we provide you a guarantee only if you satisfy the requirement
that we put on you.
As a consequence we introduce the two types of SLOs.
### Steady state SLOs
@ -87,12 +75,6 @@ We define system to be in steady state when the cluster churn per second is <= 2
churn = #(Pod spec creations/updates/deletions) + #(user originated requests) in a given second
```
### Burst SLO
With burst SLOs, we provide guarantees on how system behaves under the heavy load
(when user wants the system to do something as quickly as possible not caring too
much about response time).
## Environment
In order to meet the SLOs, system must run in the environment satisfying
@ -145,12 +127,6 @@ sliding window. However, for the purpose of SLO itself, it basically means
"fraction of good minutes per day" being within threshold.
### Burst SLIs/SLOs
| Status | SLI | SLO | User stories, test scenarios, ... |
| --- | --- | --- | --- |
| WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | [Details](./system_throughput.md) |
### Other SLIs
| Status | SLI | User stories, ... |

View File

@ -1,34 +0,0 @@
## System throughput SLI/SLO details
### Definition
| Status | SLI | SLO |
| --- | --- | --- |
| WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes |
### User stories
- As a user, I want a guarantee that my workload of X pods can be started
within a given time
- As a user, I want to understand how quickly I can react to a dramatic
change in workload profile when my workload exhibits very bursty behavior
(e.g. shop during Back Friday Sale)
- As a user, I want a guarantee how quickly I can recreate the whole setup
in case of a serious disaster which brings the whole cluster down.
### Test scenario
- Start with a healthy (all nodes ready, all cluster addons already running)
cluster with N (>0) running pause pods per node.
- Create a number of `Namespaces` and a number of `Deployments` in each of them.
- All `Namespaces` should be isomorphic, possibly excluding last one which should
run all pods that didn't fit in the previous ones.
- Single namespace should run 5000 `Pods` in the following configuration:
- one big `Deployment` running ~1/3 of all `Pods` from this `namespace`
- medium `Deployments`, each with 120 `Pods`, in total running ~1/3 of all
`Pods` from this `namespace`
- small `Deployment`, each with 10 `Pods`, in total running ~1/3 of all `Pods`
from this `Namespace`
- Each `Deployment` should be covered by a single `Service`.
- Each `Pod` in any `Deployment` contains two pause containers, one `Secret`
other than default `ServiceAccount` and one `ConfigMap`. Additionally it has
resource requests set and doesn't use any advanced scheduling features or
init containers.