Remove burst SLIs/SLOs
This commit is contained in:
parent
f042a6d212
commit
f0c6e48b7b
|
@ -63,18 +63,6 @@ we will not provide any guarantees for users.
|
|||
[Service Level Objectives]: https://en.wikipedia.org/wiki/Service_level_objective
|
||||
[Service Level Agreement]: https://en.wikipedia.org/wiki/Service-level_agreement
|
||||
|
||||
## Types of SLOs
|
||||
|
||||
While SLIs are very generic and don't really depend on anything (they just
|
||||
define what and how we measure), it's not the case for SLOs.
|
||||
SLOs provide guarantees, and satisfying them may depend on meeting some
|
||||
specific requirements.
|
||||
|
||||
As a result, we build our SLOs in "you promise, we promise" format.
|
||||
That means, that we provide you a guarantee only if you satisfy the requirement
|
||||
that we put on you.
|
||||
|
||||
As a consequence we introduce the two types of SLOs.
|
||||
|
||||
### Steady state SLOs
|
||||
|
||||
|
@ -87,12 +75,6 @@ We define system to be in steady state when the cluster churn per second is <= 2
|
|||
churn = #(Pod spec creations/updates/deletions) + #(user originated requests) in a given second
|
||||
```
|
||||
|
||||
### Burst SLO
|
||||
|
||||
With burst SLOs, we provide guarantees on how system behaves under the heavy load
|
||||
(when user wants the system to do something as quickly as possible not caring too
|
||||
much about response time).
|
||||
|
||||
## Environment
|
||||
|
||||
In order to meet the SLOs, system must run in the environment satisfying
|
||||
|
@ -145,12 +127,6 @@ sliding window. However, for the purpose of SLO itself, it basically means
|
|||
"fraction of good minutes per day" being within threshold.
|
||||
|
||||
|
||||
### Burst SLIs/SLOs
|
||||
|
||||
| Status | SLI | SLO | User stories, test scenarios, ... |
|
||||
| --- | --- | --- | --- |
|
||||
| WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | [Details](./system_throughput.md) |
|
||||
|
||||
### Other SLIs
|
||||
|
||||
| Status | SLI | User stories, ... |
|
||||
|
|
|
@ -1,34 +0,0 @@
|
|||
## System throughput SLI/SLO details
|
||||
|
||||
### Definition
|
||||
|
||||
| Status | SLI | SLO |
|
||||
| --- | --- | --- |
|
||||
| WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes |
|
||||
|
||||
### User stories
|
||||
- As a user, I want a guarantee that my workload of X pods can be started
|
||||
within a given time
|
||||
- As a user, I want to understand how quickly I can react to a dramatic
|
||||
change in workload profile when my workload exhibits very bursty behavior
|
||||
(e.g. shop during Back Friday Sale)
|
||||
- As a user, I want a guarantee how quickly I can recreate the whole setup
|
||||
in case of a serious disaster which brings the whole cluster down.
|
||||
|
||||
### Test scenario
|
||||
- Start with a healthy (all nodes ready, all cluster addons already running)
|
||||
cluster with N (>0) running pause pods per node.
|
||||
- Create a number of `Namespaces` and a number of `Deployments` in each of them.
|
||||
- All `Namespaces` should be isomorphic, possibly excluding last one which should
|
||||
run all pods that didn't fit in the previous ones.
|
||||
- Single namespace should run 5000 `Pods` in the following configuration:
|
||||
- one big `Deployment` running ~1/3 of all `Pods` from this `namespace`
|
||||
- medium `Deployments`, each with 120 `Pods`, in total running ~1/3 of all
|
||||
`Pods` from this `namespace`
|
||||
- small `Deployment`, each with 10 `Pods`, in total running ~1/3 of all `Pods`
|
||||
from this `Namespace`
|
||||
- Each `Deployment` should be covered by a single `Service`.
|
||||
- Each `Pod` in any `Deployment` contains two pause containers, one `Secret`
|
||||
other than default `ServiceAccount` and one `ConfigMap`. Additionally it has
|
||||
resource requests set and doesn't use any advanced scheduling features or
|
||||
init containers.
|
Loading…
Reference in New Issue