Rework PRR questionaire

2019-12-11 10:48:11 +01:00 · 2019-12-11 10:48:11 +01:00 · 322dac026b
parent 31025aabc8
commit 322dac026b
1 changed files with 118 additions and 57 deletions
--- a/sig-architecture/production-readiness.md
+++ b/sig-architecture/production-readiness.md
@ -15,63 +15,124 @@ aspects of the process.
 ## Questionnaire
-* Feature enablement and rollback
+#### Feature enablement and rollback
-  - How can this feature be enabled / disabled in a live cluster?
+
-  - Can the feature be disabled once it has been enabled (i.e., can we roll
+* **How can this feature be enabled / disabled in a live cluster?**
-    back the enablement)?
+  - [ ] Feature gate
-  - Will enabling / disabling the feature require downtime for the control
+	  - Feature gate name:
-    plane?
+    - Components depending on the feature gate:
-  - Will enabling / disabling the feature require downtime or reprovisioning
+  - [ ] Other
-    of a node?
+    - Describe the mechanism:
-  - What happens if a cluster with this feature enabled is rolled back? What
+    - Will enabling / disabling the feature require downtime of the control
-    happens if it is subsequently upgraded again?
+      plane?
-  - Are there tests for this?
+    - Will enabling / disabling the feature require downtime or reprovisioning
-* Scalability
+      of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
-  - Will enabling / using the feature result in any new API calls?
+
-    Describe them with their impact keeping in mind the [supported limits][]
+* **Can the feature be disabled once it has been enabled (i.e. can we rollback
-    (e.g. 5000 nodes per cluster, 100 pods/s churn) focusing mostly on:
+  the enablement)?**  
-     - components listing and/or watching resources they didn't before
+  Describe the consequences on existing workloads (e.g. if this is runtime
-     - API calls that may be triggered by changes of some Kubernetes
+  feature, can it break the existing applications?).
-       resources (e.g. update object X based on changes of object Y)
+
-     - periodic API calls to reconcile state (e.g. periodic fetching state,
+* **What happens if we reenable the feature if it was previously rolled back?**
-       heartbeats, leader election, etc.)
+
-  - Will enabling / using the feature result in supporting new API types?
+* **Are there any tests for feature enablement/ disablement?**  
-    How many objects of that type will be supported (and how that translates
+  At the very least, think about conversion tests if API types are being modified.
-    to limitations for users)?
+
-  - Will enabling / using the feature result in increasing size or count
+#### Scalability
-    of the existing API objects?
+      
-  - Will enabling / using the feature result in increasing time taken
+* **Will enabling / using this feature result in any new API calls?**  
-    by any operations covered by [existing SLIs/SLOs][] (e.g. by adding
+  Describe them, providing:
-    additional work, introducing new steps in between, etc.)?
+  - API call type (e.g. PATCH pods)
-    Please describe the details if so.
+  - estimated throughput
-  - Will enabling / using the feature result in non-negligible increase
+  - originating component(s) (e.g. Kubelet, Feature-X-controller)  
-    of resource usage (CPU, RAM, disk IO, ...) in any components?
+  focusing mostly on:
-    Things to keep in mind include: additional in-memory state, additional
+  - components listing and/or watching resources they didn't before
-    non-trivial computations, excessive access to disks (including increased
+  - API calls that may be triggered by changes of some Kubernetes resources
-    log volume), significant amount of data sent and/or received over
+    (e.g. update of object X triggers new updates of object Y)
-    network, etc. Think through this in both small and large cases, again
+  - periodic API calls to reconcile state (e.g. periodic fetching state,
-    with respect to the [supported limits][].
+    heartbeats, leader election, etc.)
-* Rollout, Upgrade, and Rollback Planning
+
-* Dependencies
+* **Will enabling / using this feature result in introducing new API types?**  
-  - Does this feature depend on any specific services running in the cluster
+  Describe them providing:
-    (e.g., a metrics service)?
+  - API type
-  - How does this feature respond to complete failures of the services on
+  - Supported number of objects per cluster
-    which it depends?
+  - Supported number of objects per namespace (for namespace-scoped objects)
-  - How does this feature respond to degraded performance or high error rates
+
-    from services on which it depends?
+* **Will enabling / using this feature result in any new calls to cloud
-* Monitoring requirements
+  provider?**
-  - How can an operator determine if the feature is in use by workloads?
+
-  - How can an operator determine if the feature is functioning properly?
+* **Will enabling / using this feature result in increasing size or count
-  - What are the service level indicators an operator can use to determine the
+  of the existing API objects?**  
-    health of the service?
+  Describe them providing:
-  - What are reasonable service level objectives for the feature?
+  - API type(s):
-* Troubleshooting
+  - Estimated increase in size: (e.g. new annotation of size 32B)
-  - What are the known failure modes?
+  - Estimated amount of new objects: (e.g. new Object X for every existing Pod)
-  - How can those be detected via metrics or logs?
+
-  - What are the mitigations for each of those failure modes?
+* **Will enabling / using this feature result in increasing time taken by any
-  - What are the most useful log messages and what logging levels do they require?
+  operations covered by [existing SLIs/SLOs][]?**  
-  - What steps should be taken if SLOs are not being met to determine the
+  Think about adding additional work or introducing new steps in between
-    problem?
+  (e.g. need to do X to start a container), etc. Please describe the details.
 * **Will enabling / using this feature result in non-negligible increase of
  resource usage (CPU, RAM, disk, IO, ...) in any components?**  
  Things to keep in mind include: additional in-memory state, additional
  non-trivial computations, excessive access to disks (including increased log
  volume), significant amount of data send and/or received over network, etc.
  This through this both in small and large cases, again with respect to the
  [supported limits][].
 #### Rollout, Upgrade and Rollback Planning
 #### Dependencies
 * **Does this feature depend on any specific services running in the cluster?**  
  Think about both cluster-level services (e.g. metrics-server) as well
  as node-level agents (e.g. specific version of CRI).
 * **How does this feature respond to complete failures of the services on which
  it depends?**  
  Think about both running and newly created user workloads as well as
  cluster-level services (e.g. DNS).
 * **How does this feature respond to degraded performance or high error rates
  from services on which it depends?**
 #### Monitoring requirements
 * **How can an operator determine if the feature is in use by workloads?**
 * **How can an operator determine if the feature is functioning properly?**  
  Focus on metrics that cluster operators may gather from different
  components and treat other signals as last resort.
 * **What are the SLIs (Service Level Indicators) an operator can use to
  determine the health of the service?**
  - [ ] Metrics
 	  - Metric name:
    - [Optional] Aggregation method:
    - Components exposing the metric:
  - [ ] Other (treat as last resort)
    - Details:
 * **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
 #### Troubleshooting
 Troubleshooting section serves the `Playbook` role as of now. We may consider
 splitting it into a dedicated `Playbook` document (potentially with some monitoring
 details). For now we leave it here though, with some questions not required until
 further stages (e.g. Beta/Ga) of feature lifecycle.
 * **What are the known failure modes?**
 * **How can those be detected via metrics or logs?**
 * **What are the mitigations for each of those failure modes?**
 * **What are the most useful log messages and what logging levels to they require?**  
  Not required until feature graduates to Beta.
 * **What steps should be taken if SLOs are not being met to determine the problem?**
 [PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
 [supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md