Rework PRR questionaire

This commit is contained in:
wojtekt 2019-12-11 10:48:11 +01:00
parent 31025aabc8
commit 322dac026b
1 changed files with 118 additions and 57 deletions

View File

@ -15,63 +15,124 @@ aspects of the process.
## Questionnaire ## Questionnaire
* Feature enablement and rollback #### Feature enablement and rollback
- How can this feature be enabled / disabled in a live cluster?
- Can the feature be disabled once it has been enabled (i.e., can we roll * **How can this feature be enabled / disabled in a live cluster?**
back the enablement)? - [ ] Feature gate
- Will enabling / disabling the feature require downtime for the control - Feature gate name:
plane? - Components depending on the feature gate:
- Will enabling / disabling the feature require downtime or reprovisioning - [ ] Other
of a node? - Describe the mechanism:
- What happens if a cluster with this feature enabled is rolled back? What - Will enabling / disabling the feature require downtime of the control
happens if it is subsequently upgraded again? plane?
- Are there tests for this? - Will enabling / disabling the feature require downtime or reprovisioning
* Scalability of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
- Will enabling / using the feature result in any new API calls?
Describe them with their impact keeping in mind the [supported limits][] * **Can the feature be disabled once it has been enabled (i.e. can we rollback
(e.g. 5000 nodes per cluster, 100 pods/s churn) focusing mostly on: the enablement)?**
- components listing and/or watching resources they didn't before Describe the consequences on existing workloads (e.g. if this is runtime
- API calls that may be triggered by changes of some Kubernetes feature, can it break the existing applications?).
resources (e.g. update object X based on changes of object Y)
- periodic API calls to reconcile state (e.g. periodic fetching state, * **What happens if we reenable the feature if it was previously rolled back?**
heartbeats, leader election, etc.)
- Will enabling / using the feature result in supporting new API types? * **Are there any tests for feature enablement/ disablement?**
How many objects of that type will be supported (and how that translates At the very least, think about conversion tests if API types are being modified.
to limitations for users)?
- Will enabling / using the feature result in increasing size or count #### Scalability
of the existing API objects?
- Will enabling / using the feature result in increasing time taken * **Will enabling / using this feature result in any new API calls?**
by any operations covered by [existing SLIs/SLOs][] (e.g. by adding Describe them, providing:
additional work, introducing new steps in between, etc.)? - API call type (e.g. PATCH pods)
Please describe the details if so. - estimated throughput
- Will enabling / using the feature result in non-negligible increase - originating component(s) (e.g. Kubelet, Feature-X-controller)
of resource usage (CPU, RAM, disk IO, ...) in any components? focusing mostly on:
Things to keep in mind include: additional in-memory state, additional - components listing and/or watching resources they didn't before
non-trivial computations, excessive access to disks (including increased - API calls that may be triggered by changes of some Kubernetes resources
log volume), significant amount of data sent and/or received over (e.g. update of object X triggers new updates of object Y)
network, etc. Think through this in both small and large cases, again - periodic API calls to reconcile state (e.g. periodic fetching state,
with respect to the [supported limits][]. heartbeats, leader election, etc.)
* Rollout, Upgrade, and Rollback Planning
* Dependencies * **Will enabling / using this feature result in introducing new API types?**
- Does this feature depend on any specific services running in the cluster Describe them providing:
(e.g., a metrics service)? - API type
- How does this feature respond to complete failures of the services on - Supported number of objects per cluster
which it depends? - Supported number of objects per namespace (for namespace-scoped objects)
- How does this feature respond to degraded performance or high error rates
from services on which it depends? * **Will enabling / using this feature result in any new calls to cloud
* Monitoring requirements provider?**
- How can an operator determine if the feature is in use by workloads?
- How can an operator determine if the feature is functioning properly? * **Will enabling / using this feature result in increasing size or count
- What are the service level indicators an operator can use to determine the of the existing API objects?**
health of the service? Describe them providing:
- What are reasonable service level objectives for the feature? - API type(s):
* Troubleshooting - Estimated increase in size: (e.g. new annotation of size 32B)
- What are the known failure modes? - Estimated amount of new objects: (e.g. new Object X for every existing Pod)
- How can those be detected via metrics or logs?
- What are the mitigations for each of those failure modes? * **Will enabling / using this feature result in increasing time taken by any
- What are the most useful log messages and what logging levels do they require? operations covered by [existing SLIs/SLOs][]?**
- What steps should be taken if SLOs are not being met to determine the Think about adding additional work or introducing new steps in between
problem? (e.g. need to do X to start a container), etc. Please describe the details.
* **Will enabling / using this feature result in non-negligible increase of
resource usage (CPU, RAM, disk, IO, ...) in any components?**
Things to keep in mind include: additional in-memory state, additional
non-trivial computations, excessive access to disks (including increased log
volume), significant amount of data send and/or received over network, etc.
This through this both in small and large cases, again with respect to the
[supported limits][].
#### Rollout, Upgrade and Rollback Planning
#### Dependencies
* **Does this feature depend on any specific services running in the cluster?**
Think about both cluster-level services (e.g. metrics-server) as well
as node-level agents (e.g. specific version of CRI).
* **How does this feature respond to complete failures of the services on which
it depends?**
Think about both running and newly created user workloads as well as
cluster-level services (e.g. DNS).
* **How does this feature respond to degraded performance or high error rates
from services on which it depends?**
#### Monitoring requirements
* **How can an operator determine if the feature is in use by workloads?**
* **How can an operator determine if the feature is functioning properly?**
Focus on metrics that cluster operators may gather from different
components and treat other signals as last resort.
* **What are the SLIs (Service Level Indicators) an operator can use to
determine the health of the service?**
- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
#### Troubleshooting
Troubleshooting section serves the `Playbook` role as of now. We may consider
splitting it into a dedicated `Playbook` document (potentially with some monitoring
details). For now we leave it here though, with some questions not required until
further stages (e.g. Beta/Ga) of feature lifecycle.
* **What are the known failure modes?**
* **How can those be detected via metrics or logs?**
* **What are the mitigations for each of those failure modes?**
* **What are the most useful log messages and what logging levels to they require?**
Not required until feature graduates to Beta.
* **What steps should be taken if SLOs are not being met to determine the problem?**
[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md [PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md [supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md