Rework PRR questionaire
This commit is contained in:
parent
31025aabc8
commit
322dac026b
|
|
@ -15,63 +15,124 @@ aspects of the process.
|
||||||
|
|
||||||
## Questionnaire
|
## Questionnaire
|
||||||
|
|
||||||
* Feature enablement and rollback
|
#### Feature enablement and rollback
|
||||||
- How can this feature be enabled / disabled in a live cluster?
|
|
||||||
- Can the feature be disabled once it has been enabled (i.e., can we roll
|
* **How can this feature be enabled / disabled in a live cluster?**
|
||||||
back the enablement)?
|
- [ ] Feature gate
|
||||||
- Will enabling / disabling the feature require downtime for the control
|
- Feature gate name:
|
||||||
plane?
|
- Components depending on the feature gate:
|
||||||
- Will enabling / disabling the feature require downtime or reprovisioning
|
- [ ] Other
|
||||||
of a node?
|
- Describe the mechanism:
|
||||||
- What happens if a cluster with this feature enabled is rolled back? What
|
- Will enabling / disabling the feature require downtime of the control
|
||||||
happens if it is subsequently upgraded again?
|
plane?
|
||||||
- Are there tests for this?
|
- Will enabling / disabling the feature require downtime or reprovisioning
|
||||||
* Scalability
|
of a node? (Do not assume `Dynamic Kubelet Config` feature is enabled).
|
||||||
- Will enabling / using the feature result in any new API calls?
|
|
||||||
Describe them with their impact keeping in mind the [supported limits][]
|
* **Can the feature be disabled once it has been enabled (i.e. can we rollback
|
||||||
(e.g. 5000 nodes per cluster, 100 pods/s churn) focusing mostly on:
|
the enablement)?**
|
||||||
- components listing and/or watching resources they didn't before
|
Describe the consequences on existing workloads (e.g. if this is runtime
|
||||||
- API calls that may be triggered by changes of some Kubernetes
|
feature, can it break the existing applications?).
|
||||||
resources (e.g. update object X based on changes of object Y)
|
|
||||||
- periodic API calls to reconcile state (e.g. periodic fetching state,
|
* **What happens if we reenable the feature if it was previously rolled back?**
|
||||||
heartbeats, leader election, etc.)
|
|
||||||
- Will enabling / using the feature result in supporting new API types?
|
* **Are there any tests for feature enablement/ disablement?**
|
||||||
How many objects of that type will be supported (and how that translates
|
At the very least, think about conversion tests if API types are being modified.
|
||||||
to limitations for users)?
|
|
||||||
- Will enabling / using the feature result in increasing size or count
|
#### Scalability
|
||||||
of the existing API objects?
|
|
||||||
- Will enabling / using the feature result in increasing time taken
|
* **Will enabling / using this feature result in any new API calls?**
|
||||||
by any operations covered by [existing SLIs/SLOs][] (e.g. by adding
|
Describe them, providing:
|
||||||
additional work, introducing new steps in between, etc.)?
|
- API call type (e.g. PATCH pods)
|
||||||
Please describe the details if so.
|
- estimated throughput
|
||||||
- Will enabling / using the feature result in non-negligible increase
|
- originating component(s) (e.g. Kubelet, Feature-X-controller)
|
||||||
of resource usage (CPU, RAM, disk IO, ...) in any components?
|
focusing mostly on:
|
||||||
Things to keep in mind include: additional in-memory state, additional
|
- components listing and/or watching resources they didn't before
|
||||||
non-trivial computations, excessive access to disks (including increased
|
- API calls that may be triggered by changes of some Kubernetes resources
|
||||||
log volume), significant amount of data sent and/or received over
|
(e.g. update of object X triggers new updates of object Y)
|
||||||
network, etc. Think through this in both small and large cases, again
|
- periodic API calls to reconcile state (e.g. periodic fetching state,
|
||||||
with respect to the [supported limits][].
|
heartbeats, leader election, etc.)
|
||||||
* Rollout, Upgrade, and Rollback Planning
|
|
||||||
* Dependencies
|
* **Will enabling / using this feature result in introducing new API types?**
|
||||||
- Does this feature depend on any specific services running in the cluster
|
Describe them providing:
|
||||||
(e.g., a metrics service)?
|
- API type
|
||||||
- How does this feature respond to complete failures of the services on
|
- Supported number of objects per cluster
|
||||||
which it depends?
|
- Supported number of objects per namespace (for namespace-scoped objects)
|
||||||
- How does this feature respond to degraded performance or high error rates
|
|
||||||
from services on which it depends?
|
* **Will enabling / using this feature result in any new calls to cloud
|
||||||
* Monitoring requirements
|
provider?**
|
||||||
- How can an operator determine if the feature is in use by workloads?
|
|
||||||
- How can an operator determine if the feature is functioning properly?
|
* **Will enabling / using this feature result in increasing size or count
|
||||||
- What are the service level indicators an operator can use to determine the
|
of the existing API objects?**
|
||||||
health of the service?
|
Describe them providing:
|
||||||
- What are reasonable service level objectives for the feature?
|
- API type(s):
|
||||||
* Troubleshooting
|
- Estimated increase in size: (e.g. new annotation of size 32B)
|
||||||
- What are the known failure modes?
|
- Estimated amount of new objects: (e.g. new Object X for every existing Pod)
|
||||||
- How can those be detected via metrics or logs?
|
|
||||||
- What are the mitigations for each of those failure modes?
|
* **Will enabling / using this feature result in increasing time taken by any
|
||||||
- What are the most useful log messages and what logging levels do they require?
|
operations covered by [existing SLIs/SLOs][]?**
|
||||||
- What steps should be taken if SLOs are not being met to determine the
|
Think about adding additional work or introducing new steps in between
|
||||||
problem?
|
(e.g. need to do X to start a container), etc. Please describe the details.
|
||||||
|
|
||||||
|
* **Will enabling / using this feature result in non-negligible increase of
|
||||||
|
resource usage (CPU, RAM, disk, IO, ...) in any components?**
|
||||||
|
Things to keep in mind include: additional in-memory state, additional
|
||||||
|
non-trivial computations, excessive access to disks (including increased log
|
||||||
|
volume), significant amount of data send and/or received over network, etc.
|
||||||
|
This through this both in small and large cases, again with respect to the
|
||||||
|
[supported limits][].
|
||||||
|
|
||||||
|
#### Rollout, Upgrade and Rollback Planning
|
||||||
|
|
||||||
|
#### Dependencies
|
||||||
|
|
||||||
|
* **Does this feature depend on any specific services running in the cluster?**
|
||||||
|
Think about both cluster-level services (e.g. metrics-server) as well
|
||||||
|
as node-level agents (e.g. specific version of CRI).
|
||||||
|
|
||||||
|
* **How does this feature respond to complete failures of the services on which
|
||||||
|
it depends?**
|
||||||
|
Think about both running and newly created user workloads as well as
|
||||||
|
cluster-level services (e.g. DNS).
|
||||||
|
|
||||||
|
* **How does this feature respond to degraded performance or high error rates
|
||||||
|
from services on which it depends?**
|
||||||
|
|
||||||
|
#### Monitoring requirements
|
||||||
|
|
||||||
|
* **How can an operator determine if the feature is in use by workloads?**
|
||||||
|
|
||||||
|
* **How can an operator determine if the feature is functioning properly?**
|
||||||
|
Focus on metrics that cluster operators may gather from different
|
||||||
|
components and treat other signals as last resort.
|
||||||
|
|
||||||
|
* **What are the SLIs (Service Level Indicators) an operator can use to
|
||||||
|
determine the health of the service?**
|
||||||
|
- [ ] Metrics
|
||||||
|
- Metric name:
|
||||||
|
- [Optional] Aggregation method:
|
||||||
|
- Components exposing the metric:
|
||||||
|
- [ ] Other (treat as last resort)
|
||||||
|
- Details:
|
||||||
|
|
||||||
|
* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?**
|
||||||
|
|
||||||
|
#### Troubleshooting
|
||||||
|
Troubleshooting section serves the `Playbook` role as of now. We may consider
|
||||||
|
splitting it into a dedicated `Playbook` document (potentially with some monitoring
|
||||||
|
details). For now we leave it here though, with some questions not required until
|
||||||
|
further stages (e.g. Beta/Ga) of feature lifecycle.
|
||||||
|
|
||||||
|
* **What are the known failure modes?**
|
||||||
|
|
||||||
|
* **How can those be detected via metrics or logs?**
|
||||||
|
|
||||||
|
* **What are the mitigations for each of those failure modes?**
|
||||||
|
|
||||||
|
* **What are the most useful log messages and what logging levels to they require?**
|
||||||
|
Not required until feature graduates to Beta.
|
||||||
|
|
||||||
|
* **What steps should be taken if SLOs are not being met to determine the problem?**
|
||||||
|
|
||||||
|
|
||||||
[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
|
[PRR KEP]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-architecture/20190731-production-readiness-review-process.md
|
||||||
[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
|
[supported limits]: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue