1.7 KiB
1.7 KiB
System throughput SLI/SLO details
Definition
| Status | SLI | SLO |
|---|---|---|
| WIP | Time to start 30*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes |
User stories
- As a user, I want a guarantee that my workload of X pods can be started within a given time
- As a user, I want to understand how quickly I can react to a dramatic change in workload profile when my workload exhibits very bursty behavior (e.g. shop during Back Friday Sale)
- As a user, I want a guarantee how quickly I can recreate the whole setup in case of a serious disaster which brings the whole cluster down.
Test scenario
- Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause pods per node.
- Create a number of
Namespacesand a number ofDeploymentsin each of them. - All
Namespacesshould be isomorphic, possibly excluding last one which should run all pods that didn't fit in the previous ones. - Single namespace should run 5000
Podsin the following configuration:- one big
Deploymentrunning ~1/3 of allPodsfrom thisnamespace - medium
Deployments, each with 120Pods, in total running ~1/3 of allPodsfrom thisnamespace - small
Deployment, each with 10Pods, in total running ~1/3 of allPodsfrom thisNamespace
- one big
- Each
Deploymentshould be covered by a singleService. - Each
Podin anyDeploymentcontains two pause containers, oneSecretother than defaultServiceAccountand oneConfigMap. Additionally it has resource requests set and doesn't use any advanced scheduling features or init containers.