1.5 KiB
1.5 KiB
System throughput SLI/SLO details
User stories
- As a user, I want a guarantee that my workload of X pods can be started within a given time
- As a user, I want to understand how quickly I can react to a dramatic change in workload profile when my workload exhibits very bursty behavior (e.g. shop during Back Friday Sale)
- As a user, I want a guarantee how quickly I can recreate the whole setup in case of a serious disaster which brings the whole cluster down.
Test scenario
- Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause pods per node.
- Create a number of
Namespacesand a number ofDeploymentsin each of them. - All
Namespacesshould be isomorphic, possibly excluding last one which should run all pods that didn't fit in the previous ones. - Single namespace should run 5000
Podsin the following configuration:- one big
Deploymentrunning ~1/3 of allPodsfrom thisnamespace - medium
Deployments, each with 120Pods, in total running ~1/3 of allPodsfrom thisnamespace - small
Deployment, each with 10Pods, in total running ~1/3 of allPodsfrom thisNamespace
- one big
- Each
Deploymentshould be covered by a singleService. - Each
Podin anyDeploymentcontains two pause containers, oneSecretother than defaultServiceAccountand oneConfigMap. Additionally it has resource requests set and doesn't use any advanced scheduling features or init containers.