## System throughput SLI/SLO details ### Definition | Status | SLI | SLO | | --- | --- | --- | | WIP | Time to start 30\*#nodes pods, measured from test scenario start until observing last Pod as ready | Benchmark: when all images present on all Nodes, 99th percentile <= X minutes | ### User stories - As a user, I want a guarantee that my workload of X pods can be started within a given time - As a user, I want to understand how quickly I can react to a dramatic change in workload profile when my workload exhibits very bursty behavior (e.g. shop during Back Friday Sale) - As a user, I want a guarantee how quickly I can recreate the whole setup in case of a serious disaster which brings the whole cluster down. ### Test scenario - Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause pods per node. - Create a number of `Namespaces` and a number of `Deployments` in each of them. - All `Namespaces` should be isomorphic, possibly excluding last one which should run all pods that didn't fit in the previous ones. - Single namespace should run 5000 `Pods` in the following configuration: - one big `Deployment` running ~1/3 of all `Pods` from this `namespace` - medium `Deployments`, each with 120 `Pods`, in total running ~1/3 of all `Pods` from this `namespace` - small `Deployment`, each with 10 `Pods`, in total running ~1/3 of all `Pods` from this `Namespace` - Each `Deployment` should be covered by a single `Service`. - Each `Pod` in any `Deployment` contains two pause containers, one `Secret` other than default `ServiceAccount` and one `ConfigMap`. Additionally it has resource requests set and doesn't use any advanced scheduling features or init containers.