Rework PRR questionaire

This commit is contained in:
wojtekt 2019-12-11 10:48:11 +01:00
parent 31025aabc8
commit 322dac026b
1 changed files with 118 additions and 57 deletions

View File

@ -15,63 +15,124 @@ aspects of the process.
## Questionnaire ## Questionnaire
* Feature enablement and rollback #### Feature enablement and rollback
- How can this feature be enabled / disabled in a live cluster?
- Can the feature be disabled once it has been enabled (i.e., can we roll * **How can this feature be enabled / disabled in a live cluster?**
back the enablement)? - [ ] Feature gate
- Will enabling / disabling the feature require downtime for the control - Feature gate name:
- Components depending on the feature gate:
- [ ] Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control
plane? plane?
- Will enabling / disabling the feature require downtime or reprovisioning - Will enabling / disabling the feature require downtime or reprovisioning
of a node? of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
- What happens if a cluster with this feature enabled is rolled back? What
happens if it is subsequently upgraded again? * **Can the feature be disabled once it has been enabled (i.e. can we rollback
- Are there tests for this? the enablement)?**
* Scalability Describe the consequences on existing workloads (e.g. if this is runtime
- Will enabling / using the feature result in any new API calls? feature, can it break the existing applications?).
Describe them with their impact keeping in mind the [supported limits][]
(e.g. 5000 nodes per cluster, 100 pods/s churn) focusing mostly on: * **What happens if we reenable the feature if it was previously rolled back?**
* **Are there any tests for feature enablement/ disablement?**
At the very least, think about conversion tests if API types are being modified.
#### Scalability
* **Will enabling / using this feature result in any new API calls?**
Describe them, providing:
- API call type (e.g. PATCH pods)
- estimated throughput
- originating component(s) (e.g. Kubelet, Feature-X-controller)
focusing mostly on:
- components listing and/or watching resources they didn't before - components listing and/or watching resources they didn't before
- API calls that may be triggered by changes of some Kubernetes - API calls that may be triggered by changes of some Kubernetes resources
resources (e.g. update object X based on changes of object Y) (e.g. update of object X triggers new updates of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, - periodic API calls to reconcile state (e.g. periodic fetching state,
heartbeats, leader election, etc.) heartbeats, leader election, etc.)
- Will enabling / using the feature result in supporting new API types?
How many objects of that type will be supported (and how that translates * **Will enabling / using this feature result in introducing new API types?**
to limitations for users)? Describe them providing:
- Will enabling / using the feature result in increasing size or count - API type
of the existing API objects? - Supported number of objects per cluster
- Will enabling / using the feature result in increasing time taken - Supported number of objects per namespace (for namespace-scoped objects)
by any operations covered by [existing SLIs/SLOs][] (e.g. by adding
additional work, introducing new steps in between, etc.)? * **Will enabling / using this feature result in any new calls to cloud
Please describe the details if so. provider?**
- Will enabling / using the feature result in non-negligible increase
of resource usage (CPU, RAM, disk IO, ...) in any components? * **Will enabling / using this feature result in increasing size or count
of the existing API objects?**
Describe them providing:
- API type(s):
- Estimated increase in size: (e.g. new annotation of size 32B)
- Estimated amount of new objects: (e.g. new Object X for every existing Pod)
* **Will enabling / using this feature result in increasing time taken by any
operations covered by [existing SLIs/SLOs][]?**
Think about adding additional work or introducing new steps in between
(e.g. need to do X to start a container), etc. Please describe the details.
* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
Things to keep in mind include: additional in-memory state, additional Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased non-trivial computations, excessive access to disks (including increased log
log volume), significant amount of data sent and/or received over volume), significant amount of data send and/or received over network, etc.
network, etc. Think through this in both small and large cases, again This through this both in small and large cases, again with respect to the
with respect to the [supported limits][]. [supported limits][].
* Rollout, Upgrade, and Rollback Planning
* Dependencies #### Rollout, Upgrade and Rollback Planning
- Does this feature depend on any specific services running in the cluster
(e.g., a metrics service)? #### Dependencies
- How does this feature respond to complete failures of the services on
which it depends? * **Does this feature depend on any specific services running in the cluster?**
- How does this feature respond to degraded performance or high error rates Think about both cluster-level services (e.g. metrics-server) as well
from services on which it depends? as node-level agents (e.g. specific version of CRI).
* Monitoring requirements
- How can an operator determine if the feature is in use by workloads? * **How does this feature respond to complete failures of the services on which
- How can an operator determine if the feature is functioning properly? it depends?**
- What are the service level indicators an operator can use to determine the Think about both running and newly created user workloads as well as
health of the service? cluster-level services (e.g. DNS).
- What are reasonable service level objectives for the feature?
* Troubleshooting * **How does this feature respond to degraded performance or high error rates
- What are the known failure modes? from services on which it depends?**
- How can those be detected via metrics or logs?
- What are the mitigations for each of those failure modes? #### Monitoring requirements
- What are the most useful log messages and what logging levels do they require?
- What steps should be taken if SLOs are not being met to determine the * **How can an operator determine if the feature is in use by workloads?**
problem?
* **How can an operator determine if the feature is functioning properly?**
Focus on metrics that cluster operators may gather from different
components and treat other signals as last resort.
* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**
- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
#### Troubleshooting
Troubleshooting section serves the `Playbook` role as of now. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now we leave it here though, with some questions not required until
further stages (e.g. Beta/Ga) of feature lifecycle.
* **What are the known failure modes?**
* **How can those be detected via metrics or logs?**
* **What are the mitigations for each of those failure modes?**
* **What are the most useful log messages and what logging levels to they require?**
Not required until feature graduates to Beta.
* **What steps should be taken if SLOs are not being met to determine the problem?**
[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md [PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md [supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md