Update PRR questionarire

2020-04-01 19:47:05 +02:00 · 2020-04-01 19:47:05 +02:00 · 56d9ea2204
parent 7d8c4a06e5
commit 56d9ea2204
1 changed files with 73 additions and 48 deletions
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@ -41,6 +41,67 @@ aspects of the process.
  with and without the feature are necessary. At the very least, think about
  conversion tests if API types are being modified.

+
+#### Rollout, Upgrade and Rollback Planning
+
+* **How can a rollout fail? Can it impact already running workloads?**
+  Try to be as paranoid as possible - e.g. what if some components will restart
+  in the middle of rollout?
+
+* **What specific metrics should inform a rollback?**
+
+* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?**
+  Describe manual testing that was done and the outcomes.
+  Longer term, we may want to require automated upgrade/rollback tests, but we
+  are missing a bunch of machinery and tooling and do that now.
+
+
+#### Monitoring requirements
+
+* **How can an operator determine if the feature is in use by workloads?**
+  Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
+  checking if there are objects with field X set) may be last resort. Avoid
+  logs or events for this purpose.
+
+* **How can an operator determine if the feature is functioning properly?**
+  Focus on metrics that cluster operators may gather from different
+  components and treat other signals as last resort.
+  TODO: Provide examples to make answering this question easier.
+
+* **What are the SLIs (Service Level Indicators) an operator can use to
+  determine the health of the service?**
+  - [ ] Metrics
+    - Metric name:
+    - [Optional] Aggregation method:
+    - Components exposing the metric:
+  - [ ] Other (treat as last resort)
+    - Details:
+
+* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
+  TODO: Provide examples for different features (e.g. server-side apply, user-space
+  proxy, cronjob controller) to make answering this question easier
+
+* **Are there any missing metrics that would be useful to have to improve
+  observability if this feature?**
+  Describe the metrics themselves and the reason they weren't added (e.g. cost,
+  implementation difficulties, etc.).
+
+#### Dependencies
+
+* **Does this feature depend on any specific services running in the cluster?**
+  Think about both cluster-level services (e.g. metrics-server) as well
+  as node-level agents (e.g. specific version of CRI). Focus on external or
+  optional services that are needed. For example, if this feature depends on
+  a cloud provider API, or upon an external software-defined storage or network
+  control plane.
+	For each of the fill in the following, thinking both about running user workloads
+  and creating new ones, as well as about cluster-level services (e.g. DNS):
+  - [Dependency name]
+    - Usage description:
+		- Impact of its outage on the feature:
+		- Impact of its degraded performance or high error rates on the feature:
+
+
 #### Scalability

 * **Will enabling / using this feature result in any new API calls?**
@ -65,7 +126,7 @@ aspects of the process.
  provider?**

 * **Will enabling / using this feature result in increasing size or count
-  of the existing API objects?*
+  of the existing API objects?**
  Describe them providing:
  - API type(s):
  - Estimated increase in size: (e.g. new annotation of size 32B)
@ -84,43 +145,6 @@ aspects of the process.
  This through this both in small and large cases, again with respect to the
  [supported limits][].

-#### Rollout, Upgrade and Rollback Planning
-
-#### Dependencies
-
-* **Does this feature depend on any specific services running in the cluster?**
-  Think about both cluster-level services (e.g. metrics-server) as well
-  as node-level agents (e.g. specific version of CRI). Focus on external or
-  optional services that are needed. For example, if this feature depends on
-  a cloud provider API, or upon an external software-defined storage or network
-  control plane.
-
-* **How does this feature respond to complete failures of the services on which
-  it depends?**
-  Think about both running and newly created user workloads as well as
-  cluster-level services (e.g. DNS).
-
-* **How does this feature respond to degraded performance or high error rates
-  from services on which it depends?**
-
-#### Monitoring requirements
-
-* **How can an operator determine if the feature is in use by workloads?**
-
-* **How can an operator determine if the feature is functioning properly?**
-  Focus on metrics that cluster operators may gather from different
-  components and treat other signals as last resort.
-
-* **What are the SLIs (Service Level Indicators) an operator can use to
-  determine the health of the service?**
-  - [ ] Metrics
-    - Metric name:
-    - [Optional] Aggregation method:
-    - Components exposing the metric:
-  - [ ] Other (treat as last resort)
-    - Details:
-
-* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**

 #### Troubleshooting
 Troubleshooting section serves the `Playbook` role as of now. We may consider
@ -128,18 +152,19 @@ splitting it into a dedicated `Playbook` document (potentially with some monitor
 details). For now we leave it here though, with some questions not required until
 further stages (e.g. Beta/Ga) of feature lifecycle.

-* **How does this feature react if the API server is unavailable?**
+* **How does this feature react if the API server and/or etcd is unavailable?**

 * **What are other known failure modes?**
-
-* **How can those be detected via metrics or logs?**
-  Stated another way: how can an operator troubleshoot without logging into a
-  master or worker node?
-
-* **What are the mitigations for each of those failure modes?**
-
-* **What are the most useful log messages and what logging levels to they require?**
-  Not required until feature graduates to Beta.
+  For each of them fill in the following information by copying the below template:
+  - [Failure mode brief description]
+    - Detection: How can it be detected via metrics? Stated another way:
+      how can an operator troubleshoot without loogging into a master or worker node?
+    - Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+	  - Diagnostics: What are the useful log messages and their required logging
+      levels that could help debugging the issue?
+      Not required until feature graduated to Beta.
+    - Testing: Are there any tests for failure mode? If not describe why.

 * **What steps should be taken if SLOs are not being met to determine the problem?**