Merge pull request #3242 from wojtek-t/statful_pod_startup_slo

Introduce pod startup SLI for stateful pods.
2019-02-28 06:39:07 -08:00 · 2019-02-28 06:39:07 -08:00 · fbc86e066d
parent 29e5d97ccc ab1edf266b
commit fbc86e066d
2 changed files with 38 additions and 10 deletions
--- a/sig-scalability/slos/pod_startup_latency.md
+++ b/sig-scalability/slos/pod_startup_latency.md
@ -4,29 +4,48 @@

 | Status | SLI | SLO |
 | --- | --- | --- |
-| __Official__ | Startup latency of stateless<sup>[1](#footnote1)</sup> and schedulable<sup>[2](#footnote2)</sup> pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day <= 5s |
+| __Official__ | Startup latency of schedulable<sup>[1](#footnote1)</sup> stateless<sup>[2](#footnote2)</sup> pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day <= 5s |
+| __WIP__ | Startup latency of schedulable<sup>[1](#footnote1)</sup> stateful<sup>[3](#footnote3)</sup> pods, excluding time to pull images, run init containers, provision volumes (in delayed binding mode) and unmount/detach volumes (from previous pod if needed), measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day <= X where X depends on storage provider |

-<a name="footnote1">[1\]</a>A `stateless pod` is defined as a pod that doesn't
+<a name="footnote1">[1\]</a>By schedulable pod we mean a pod that can be
+scheduled in the cluster without causing any preemption.
+
+<a name="footnote2">[2\]</a>A `stateless pod` is defined as a pod that doesn't
 mount volumes with sources other than secrets, config maps, downward API and
 empty dir.

-<a name="footnote2">[2\]</a>By schedulable pod we mean a pod that can be
-scheduled in the cluster without causing any preemption.
+<a name="footnote3">[3\]</a>A `stateful pod` is defined as a pod that mounts
+at least one volume with sources other than secrets, config maps, downward API
+and empty dir.

 ### User stories
 - As a user of vanilla Kubernetes, I want some guarantee how quickly my pods
 will be started.

 ### Other notes
- Only schedulable and stateless pods contribute to the SLI:
+- Only schedulable pods contribute to the SLIs:
  - If there is no space in the cluster to place the pod, there is not much
    we can do about it (it is task for Cluster Autoscaler which should have
    separate SLIs/SLOs).
  - If placing a pod requires preempting other pods, that may heavily depend
    on the application (e.g. on their graceful termination period). We don't
    want that to contribute to this SLI.
-  - Mounting disks required by non-stateless pods may potentially also require
-    non-negligible time, not fully dependent on Kubernetes.
+- We are explicitly splitting stateless and stateful pods from each other:
+  - Starting a stateful pod requires attaching and mounting volumes, that
+    takes non-negligible amount of time and doesn't even fully depend on
+    Kubernetes. However, even though it depends on chosen storage provider,
+    it isn't application specific, thus we make that part of the SLI
+    (though the exact SLO threshold may depend on chosen storage provider).
+  - We also explicitly exclude time to provision a volume (in delayed volume
+    binding mode), even though that also only depends on storage provider,
+    not on the application itself. However, volume provisioning can be
+    perceived as a bootstrapping operation in the lifetime of a stateful
+    workload, being done only once at the beginning, not everytime a pod is
+    created. As a result, we decided to exclude it.
+  - We also explicitly exclude time to unmount and detach the volume (if it
+    was previously mounted to a different pod). This situation is symetric to
+    excluding pods that need to preempt others (it's kind of cleaning after
+    predecessors).
 - We are explicitly excluding image pulling from time the SLI. This is
 because it highly depends on locality of the image, image registry performance
 characteristic (e.g. throughput), image size itself, etc. Since we have
@ -58,10 +77,18 @@ reported as started and observed via watch", because:
    can potentially be fired).

 ### TODOs
- We should try to provide guarantees for non-stateless pods (the threshold
-may be higher for them though).
 - Revisit whether we want "watch pod status" part to be included in the SLI.
+- Consider creating an SLI for pod deletion latency, given that for stateful
+pods, detaching RWO volume is required before it can be attached to a
+different node (i.e. slow pod deletion may block pod startup).
+- While it's easy to exclude pods that require preempting other pods or
+volume unmounting in tests (where we fully control the environment), we need
+to figure out how to do that properly in production environments.

 ### Test scenario

 __TODO: Descibe test scenario.__
+
+Note: when running tests against clusters with nodes in multiple zones, the
+preprovisioned volumes should be balanced across zones so that we don't make
+pods unschedulable due to lack of resources in a single zone.
--- a/sig-scalability/slos/slos.md
+++ b/sig-scalability/slos/slos.md
@ -105,7 +105,8 @@ Prerequisite: Kubernetes cluster is available and serving.
 | --- | --- | --- | --- |
 | __Official__ | Latency of mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 1s | [Details](./api_call_latency.md) |
 | __Official__ | Latency of non-streaming read-only API calls for every (resource, scope pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) |
-| __Official__ | Startup latency of stateless and schedulable pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 5s | [Details](./pod_startup_latency.md) |
+| __Official__ | Startup latency of schedulable stateless pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 5s | [Details](./pod_startup_latency.md) |
+| __WIP__ | Startup latency of schedulable stateful pods, excluding time to pull images, run init containers, provision volumes (in delayed binding mode) and unmount/detach volumes (from previous pod if needed), measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= X where X depends on storage provider | | [Details](./pod_startup_latency.md) |
 | __WIP__ | Latency of programming in-cluster load balancing mechanism (e.g. iptables), measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes aggregated across all programmers | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_programming_latency.md) |
 | __WIP__ | Latency of programming dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes aggregated across all dns instances | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./dns_programming_latency.md) |
 | __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_latency.md) |