community/system_throughput.md at 8d512eecebe65dba5d73759b42cdf6e459d1d0df

System throughput SLI/SLO details

Status	SLI	SLO
WIP	Time to start 30*#nodes pods, measured from test scenario start until observing last Pod as ready	Benchmark: when all images present on all Nodes, 99th percentile <= X minutes

As a user, I want a guarantee that my workload of X pods can be started within a given time
As a user, I want to understand how quickly I can react to a dramatic change in workload profile when my workload exhibits very bursty behavior (e.g. shop during Back Friday Sale)
As a user, I want a guarantee how quickly I can recreate the whole setup in case of a serious disaster which brings the whole cluster down.

Start with a healthy (all nodes ready, all cluster addons already running) cluster with N (>0) running pause pods per node.
Create a number of Namespaces and a number of Deployments in each of them.
All Namespaces should be isomorphic, possibly excluding last one which should run all pods that didn't fit in the previous ones.
Single namespace should run 5000 Pods in the following configuration:
- one big Deployment running ~1/3 of all Pods from this namespace
- medium Deployments, each with 120 Pods, in total running ~1/3 of all Pods from this namespace
- small Deployment, each with 10 Pods, in total running ~1/3 of all Pods from this Namespace
Each Deployment should be covered by a single Service.
Each Pod in any Deployment contains two pause containers, one Secret other than default ServiceAccount and one ConfigMap. Additionally it has resource requests set and doesn't use any advanced scheduling features or init containers.