Merge branch 'master' of https://github.com/kubernetes/community into patch-2

This commit is contained in:
Jianfei Hu 2018-12-07 11:06:44 -08:00
commit 1b775253d3
634 changed files with 36223 additions and 9720 deletions

View File

@ -1,5 +1,5 @@
<!-- Thanks for sending a pull request! Here are some tips for you:
- If this is your first contribution, read our Getting Started guide https://github.com/kubernetes/community#your-first-contribution
- If this is your first contribution, read our Getting Started guide https://github.com/kubernetes/community/blob/master/contributors/guide/README.md
- If you are editing SIG information, please follow these instructions: https://git.k8s.io/community/generator
You will need to follow these steps:
1. Edit sigs.yaml with your change

50
CLA.md
View File

@ -1,28 +1,21 @@
# The Contributor License Agreement
The [Cloud Native Computing Foundation](https://www.cncf.io/community) defines
the legal status of the contributed code in a _Contributor License Agreement_
(CLA).
the legal status of the contributed code in two different types of _Contributor License Agreements_
(CLAs), [individual contributors](https://github.com/cncf/cla/blob/master/individual-cla.pdf) and [corporations](https://github.com/cncf/cla/blob/master/corporate-cla.pdf).
Only original source code from CLA signatories can be accepted into kubernetes.
Kubernetes can only accept original source code from CLA signatories.
This policy does not apply to [third_party](https://git.k8s.io/kubernetes/third_party)
and [vendor](https://git.k8s.io/kubernetes/vendor).
## What am I agreeing to?
There are two versions of the CLA:
1. One for [individual contributors](https://github.com/cncf/cla/blob/master/individual-cla.pdf)
submitting contributions on their own behalf.
1. One for [corporations](https://github.com/cncf/cla/blob/master/corporate-cla.pdf)
to sign for contributions submitted by their employees.
It is important to read and understand this legal agreement.
## How do I sign?
#### 1. Log into the Linux Foundation ID Portal with Github
If your work is done as an employee of your company, contact your company's legal department and ask to be put on the list of approved contributors for the Kubernetes CLA. Below, we have included steps for "Corporation signup" in case your company does not have a company agreement and would like to have one.
#### 1. Log in to the Linux Foundation ID Portal with Github
Click one of:
* [Individual signup](https://identity.linuxfoundation.org/projects/cncf) to
@ -46,8 +39,10 @@ person@organization.domain email address in the CNCF account registration page.
#### 3. Complete signing process
Once you have created your account, follow the instructions to complete the
signing process via Hellosign.
After creating your account, follow the instructions to complete the
signing process through HelloSign.
If you did not receive an email from HelloSign, [then request it here](https://identity.linuxfoundation.org/projects/cncf).
#### 4. Ensure your Github e-mail address matches address used to sign CLA
@ -58,8 +53,8 @@ on setting email addresses.
You must also set your [git e-mail](https://help.github.com/articles/setting-your-email-in-git)
to match this e-mail address as well.
If you've already submitted a PR you can correct your user.name and user.email
and then use use `git commit --amend --reset-author` and then `git push --force` to
If you already submitted a PR you can correct your user.name and user.email
and then use `git commit --amend --reset-author` and then `git push --force` to
correct the PR.
#### 5. Look for an email indicating successful signup.
@ -76,14 +71,27 @@ Once you have this, the CLA authorizer bot will authorize your PRs.
![CNCFCLA3](http://i.imgur.com/C5ZsNN6.png)
## Changing your Affiliation
If you've changed employers and still contribute to Kubernetes, your affiliation
needs to be updated. The Cloud Native Computing Foundation uses [gitdm](https://github.com/cncf/gitdm)
to track who is contributing and from where. Create a pull request to the gitdm
repository with a change to [developers_affiliations.txt](https://github.com/cncf/gitdm/blob/master/developers_affiliations.txt).
Your entry should look similar to this:
```
Jorge O. Castro*: jorge!heptio.com, jorge!ubuntu.com, jorge.castro!gmail.com
Heptio
Canonical until 2017-03-31
```
## Troubleshooting
If you are having problems with signed the CLA send a mail to: `helpdesk@rt.linuxfoundation.org`.
If you have problems signing the CLA, send an email message to: `helpdesk@rt.linuxfoundation.org`.
Someone from the CNCF will respond to your ticket to help.
## Setting up the CNCF CLA check
If you are a Kubernetes GitHub organization or repo owner, and would like to setup
the Linux Foundation CNCF CLA check for your repositories, please
[read the docs on setting up the CNCF CLA check](/setting-up-cla-check.md)
If you are a Kubernetes GitHub organization or repo owner and would like to setup
the Linux Foundation CNCF CLA check for your repositories, [read the docs on setting up the CNCF CLA check](/github-management/setting-up-cla-check.md)

8
Gopkg.lock generated
View File

@ -7,14 +7,14 @@
".",
"cmd/misspell"
]
revision = "59894abde931a32630d4e884a09c682ed20c5c7c"
version = "v0.3.0"
revision = "b90dc15cfd220ecf8bbc9043ecb928cef381f011"
version = "v0.3.4"
[[projects]]
branch = "v2"
name = "gopkg.in/yaml.v2"
packages = ["."]
revision = "eb3733d160e74a9c7e442f435eb3bea458e1d19f"
revision = "5420a8b6744d3b0345ab293f6fcba19c978f1183"
version = "v2.2.1"
[solve-meta]
analyzer-name = "dep"

View File

@ -1,2 +1,6 @@
required = ["github.com/client9/misspell/cmd/misspell"]
[prune]
go-tests = true
unused-packages = true
non-go = true

4
OWNERS
View File

@ -4,10 +4,14 @@ reviewers:
- cblecker
- grodrigues3
- idvoretskyi
- jdumars
- parispittman
approvers:
- calebamiles
- castrojo
- cblecker
- grodrigues3
- idvoretskyi
- jdumars
- parispittman
- steering-committee

View File

@ -9,9 +9,10 @@ aliases:
sig-architecture-leads:
- bgrant0607
- jdumars
- mattfarina
sig-auth-leads:
- ericchiang
- liggitt
- mikedanese
- enj
- tallclair
sig-autoscaling-leads:
- mwielgus
@ -19,20 +20,26 @@ aliases:
sig-aws-leads:
- justinsb
- kris-nova
- countspongebob
- d-nishi
sig-azure-leads:
- slack
- colemickens
- jdumars
- justaugustus
- dstrebel
- khenidak
- feiskyer
sig-big-data-leads:
- foxish
- erikerlandson
- liyinan926
sig-cli-leads:
- soltysh
- seans3
- soltysh
- pwittrock
- AdoHe
sig-cloud-provider-leads:
- andrewsykim
- hogepodge
- jagosan
sig-cluster-lifecycle-leads:
- lukemarsden
- roberthbailey
- luxas
- timothysc
@ -45,11 +52,15 @@ aliases:
- grodrigues3
- cblecker
sig-docs-leads:
- zacharysarah
- chenopis
- jaredbhatti
- zacharysarah
- bradamant3
sig-gcp-leads:
- abgworrall
sig-ibmcloud-leads:
- khahmed
- rtheis
- spzala
sig-instrumentation-leads:
- piosz
- brancz
@ -63,31 +74,29 @@ aliases:
sig-node-leads:
- dchen1107
- derekwaynecarr
sig-on-premise-leads:
- marcoceppi
- dghubble
sig-openstack-leads:
- hogepodge
- dklyle
- rjmorse
sig-product-management-leads:
sig-pm-leads:
- apsinha
- idvoretskyi
- calebamiles
sig-release-leads:
- jdumars
- calebamiles
- justaugustus
- tpepper
sig-scalability-leads:
- wojtek-t
- countspongebob
sig-scheduling-leads:
- bsalamat
- timothysc
- k82cn
sig-service-catalog-leads:
- pmorie
- arschles
- vaikas-google
- carolynvs
- kibbles-n-bytes
- duglin
- jboyd01
sig-storage-leads:
- saad-ali
- childsb
@ -104,52 +113,63 @@ aliases:
- cantbewong
sig-windows-leads:
- michmike
- patricklang
wg-app-def-leads:
- ant31
- bryanl
- garethr
wg-apply-leads:
- lavalamp
wg-cloud-provider-leads:
- wlan0
- jagosan
wg-cluster-api-leads:
- kris-nova
- roberthbailey
wg-container-identity-leads:
- smarterclayton
- destijl
wg-iot-edge-leads:
- cindyxing
- dejanb
- ptone
- cantbewong
wg-kubeadm-adoption-leads:
- luxas
- justinsb
wg-machine-learning-leads:
- vishh
- kow3ns
- balajismaniam
- ConnorDoyle
wg-multitenancy-leads:
- davidopp
- jessfraz
wg-policy-leads:
- hannibalhuang
- tsandall
- davidopp
- smarterclayton
- xuanjia
- easeway
- ericavonb
- mdelder
wg-resource-management-leads:
- vishh
- derekwaynecarr
wg-security-audit-leads:
- aasmall
- joelsmith
- cji
## BEGIN CUSTOM CONTENT
steering-committee:
- bgrant0607
- brendanburns
- derekwaynecarr
- dims
- jbeda
- michelleN
- philips
- pwittrock
- quinton-hoole
- sarahnovotny
- smarterclayton
- spiffxp
- thockin
- timothysc
code-of-conduct-committee:
- jdumars
- parispittman
- eparis
- carolynvs
- bradamant3
## END CUSTOM CONTENT

View File

@ -13,16 +13,27 @@ issues, mailing lists, conferences, etc.
For more specific topics, try a SIG.
## SIGs
## Governance
Kubernetes is a set of subprojects, each shepherded by a Special Interest Group (SIG).
Kubernetes has three types of groups that are officially supported:
A first step to contributing is to pick from the [list of kubernetes SIGs](sig-list.md).
* **Committees** are named sets of people that are chartered to take on sensitive topics.
This group is encouraged to be as open as possible while achieving its mission but, because of the nature of the topics discussed, private communications are allowed.
Examples of committees include the steering committee and things like security or code of conduct.
* **Special Interest Groups (SIGs)** are persistent open groups that focus on a part of the project.
SIGs must have open and transparent proceedings.
Anyone is welcome to participate and contribute provided they follow the Kubernetes Code of Conduct.
The purpose of a SIG is to own and develop a set of **subprojects**.
* **Subprojects** Each SIG can have a set of subprojects.
These are smaller groups that can work independently.
Some subprojects will be part of the main Kubernetes deliverables while others will be more speculative and live in the `kubernetes-sigs` github org.
* **Working Groups** are temporary groups that are formed to address issues that cross SIG boundaries.
Working groups do not own any code or other long term artifacts.
Working groups can report back and act through involved SIGs.
A SIG can have its own policy for contribution,
described in a `README` or `CONTRIBUTING` file in the SIG
folder in this repo (e.g. [sig-cli/CONTRIBUTING](sig-cli/CONTRIBUTING.md)),
and its own mailing list, slack channel, etc.
See the [full governance doc](governance.md) for more details on these groups.
A SIG can have its own policy for contribution, described in a `README` or `CONTRIBUTING` file in the SIG folder in this repo (e.g. [sig-cli/CONTRIBUTING.md](sig-cli/CONTRIBUTING.md)), and its own mailing list, slack channel, etc.
If you want to edit details about a SIG (e.g. its weekly meeting time or its leads),
please follow [these instructions](./generator) that detail how our docs are auto-generated.
@ -34,6 +45,10 @@ lead to many relevant technical topics.
## Contribute
A first step to contributing is to pick from the [list of kubernetes SIGs](sig-list.md).
Start attending SIG meetings, join the slack channel and subscribe to the mailing list.
SIGs will often have a set of "help wanted" issues that can help new contributors get involved.
The [Contributor Guide](contributors/guide/README.md) provides detailed instruction on how to get your ideas and bug fixes seen and accepted, including:
1. How to [file an issue]
1. How to [find something to work on]
@ -55,4 +70,4 @@ contributors/guide/README.md#find-something-to-work-on
contributors/guide/README.md#open-a-pull-request
[Community Membership]:/community-membership.md
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/CONTRIBUTING.md?pixel)]()
![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/CONTRIBUTING.md?pixel)

13
SECURITY_CONTACTS Normal file
View File

@ -0,0 +1,13 @@
# Defined below are the security contacts for this repo.
#
# They are the contact point for the Product Security Team to reach out
# to for triaging and handling of incoming issues.
#
# The below names agree to abide by the
# [Embargo Policy](https://github.com/kubernetes/sig-release/blob/master/security-release-process-documentation/security-release-process.md#embargo-policy)
# and will be removed and replaced if they violate that agreement.
#
# DO NOT REPORT SECURITY VULNERABILITIES DIRECTLY TO THESE NAMES, FOLLOW THE
# INSTRUCTIONS AT https://kubernetes.io/security/
cblecker

View File

@ -1,3 +1,6 @@
# Kubernetes Community Code of Conduct
Kubernetes follows the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md).
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting
the [Kubernetes Code of Conduct Committee](./committee-code-of-conduct) via <conduct@kubernetes.io>.

View File

@ -0,0 +1,6 @@
reviewers:
- code-of-conduct-committee
approvers:
- code-of-conduct-committee
labels:
- committee/conduct

View File

@ -0,0 +1,14 @@
# Kubernetes Code of Conduct Committee (CoCC)
The Kubernetes Code of Conduct Committee (CoCC) is the body that is responsible for enforcing and maintaining the Kubernetes Code of Conduct.
The members and their terms are as follows:
- Jaice Singer Dumars (Google) - 2 years
- Paris Pittman (Google) - 2 years
- Carolyn Van Slyck (Microsoft) - 1 year
- Eric Paris (Red Hat) - 1 year
- Jennifer Rondeau (Heptio) - 1 year
Please see the [bootstrapping document](./bootstrapping-process.md) for more information on how members are picked, their responsibilities, and how the committee will initially function.
_More information on how to contact this committee and learn about its process to come in the near future. For now, any Code of Conduct or Code of Conduct Committee concerns can be directed to <conduct@kubernetes.io>_.

View File

@ -0,0 +1,36 @@
# Bootstrapping Document for Kubernetes Code of Conduct Committee
This document (created by the Kubernetes Steering Committee) outlines what the Code of Conduct Committee (CoCC) is responsible for, how to create it, and the rough processes for them to get started.
## Objectives of the CoCC
* Maintain the Code of Conduct (CoC) document and iterate as needed.
* All CoC revisions must be approved by the Steering Committee.
* Currently, we use the CNCF CoC. Any addendums or changes based off learnings in the Kubernetes community should be owned by this body of people, the Kubernetes Code of Conduct Committee (CoCC).
* Determine and make transparent how CoC issues and incidents are reported and handled in the community.
* Enforce the Code of Conduct within the Kubernetes community.
* Discuss with parties affected
* Come up with resolution
* For moderation on a day-to-day basis, work with the existing community-management subproject under SIG Contributor Experience
* Figure out how to keep and then keep a record of CoC reports and outcomes for precedent _This is something we are still actively discussing on how to do and is not fully hardened._
## Formation of the Code of Conduct Committee (CoCC):
* The CoCC consists of 5 members. In the first election, the top 3 voted people will be appointed 2 year terms and the other 2 members will be appointed for a 1 year term.
* CoCC members appointed for a 1 year term may be elected again the following year.
* The Steering Committee and SIG Chairs are eligible to nominate people for the Code of Conduct Committee (CoCC). _This may change during the next election._
* The Steering Committee votes on nominees for the CoCC
* Characteristics and Guidance for nominating people for the CoCC:
* Do not have to be part of the Kubernetes or CNCF community
* Previous experience on an Ethics Committee or Code of Conduct Committee is appreciated
* Has demonstrated integrity, professionalism, and positive influence within the community
* Experience with the tools which we use to communicate (Zoom, Slack, GitHub, etc.) within the Kubernetes community is appreciated
* Is generally a responsible human
* The members of the Code of Conduct Committee (CoCC) will be public so the people who report CoC issues know exactly who they are working with.
## Upon appointment, the CoCC will:
* Review the current Code of Conduct and propose revisions or addendums if necessary.
* There is already work being done around making the CoC more enforceable, so this work will be carried forward in the CoCC.
* Add an addendum to the CoC with instructions on how to contact the CoCC
* Be briefed on past and current code of conduct issues within the Kubernetes community by people who have been dealing with these issues thus far
* Revisit how code of conduct issues are reported and escalated and ensure these processes are visible and known to the community
* Determine how they envision their process going forward. The Steering Committee will make the recommendation to the Code of Conduct Committee that they have a standing hour long meeting every other week for the first twelve months after they are appointed, but the CoCC can figure out what works for them.
* Propose a method for electing/appointing future members of the committee to the Steering Committee if they choose.

View File

@ -1,44 +1,71 @@
# SIG Governance Template
# SIG Charter Guide
## Goals
All Kubernetes SIGs must define a charter defining the scope and governance of the SIG.
The following documents outline recommendations and requirements for SIG governance structure and provide
template documents for SIGs to adapt. The goals are to define the baseline needs for SIGs to self govern
and organize in a way that addresses the needs of the core Kubernetes project.
- The scope must define what areas the SIG is responsible for directing and maintaining.
- The governance must outline the responsibilities within the SIG as well as the roles
owning those responsibilities.
The documents are focused on:
## Steps to create a SIG charter
- Outlining organizational responsibilities
- Outlining organizational roles
- Outlining processes and tools
1. Copy [the template][Short Template] into a new file under community/sig-*YOURSIG*/charter.md ([sig-architecture example])
2. Read the [Recommendations and requirements] so you have context for the template
3. Fill out the template for your SIG
4. Update [sigs.yaml] with the individuals holding the roles as defined in the template.
5. Add subprojects owned by your SIG in the [sigs.yaml]
5. Create a pull request with a draft of your charter.md and sigs.yaml changes. Communicate it within your SIG
and get feedback as needed.
6. Send the SIG Charter out for review to steering@kubernetes.io. Include the subject "SIG Charter Proposal: YOURSIG"
and a link to the PR in the body.
7. Typically expect feedback within a week of sending your draft. Expect longer time if it falls over an
event such as KubeCon/CloudNativeCon or holidays. Make any necessary changes.
8. Once accepted, the steering committee will ratify the PR by merging it.
Specific attention has been given to:
## Steps to update an existing SIG charter
- The role of technical leadership
- The role of operational leadership
- Process for agreeing upon technical decisions
- Process for ensuring technical assets remain healthy
- For significant changes, or any changes that could impact other SIGs, such as the scope, create a
PR and send it to the steering committee for review with the subject: "SIG Charter Update: YOURSIG"
- For minor updates to that only impact issues or areas within the scope of the SIG the SIG Chairs should
facilitate the change.
## SIG Charter approval process
When introducing a SIG charter or modification of a charter the following process should be used.
As part of this we will define roles for the [OARP] process (Owners, Approvers, Reviewers, Participants)
- Identify a small set of Owners from the SIG to drive the changes.
Most typically this will be the SIG chairs.
- Work with the rest of the SIG in question (Reviewers) to craft the changes.
Make sure to keep the SIG in the loop as discussions progress with the Steering Committee (next step).
Including the SIG mailing list in communications with the steering committee would work for this.
- Work with the steering committee (Approvers) to gain approval.
This can simply be submitting a PR and sending mail to [steering@kubernetes.io].
If more substantial changes are desired it is advisable to socialize those before drafting a PR.
- The steering committee will be looking to ensure the scope of the SIG as represented in the charter is reasonable (and within the scope of Kubernetes) and that processes are fair.
- For large changes alert the rest of the Kubernetes community (Participants) as the scope of the changes becomes clear.
Sending mail to [kubernetes-dev@googlegroups.com] and/or announcing at the community meeting are a good ways to do this.
If there are questions about this process please reach out to the steering committee at [steering@kubernetes.io].
## How to use the templates
When developing or modifying a SIG governance doc, the intention is for SIGs to use the templates (*under development*)
as a common set of options SIGs may choose to incorporate into their own governance structure. It is recommended that
SIGs start by looking at the [Recommendations and requirements] for SIG governance docs and consider what structure
they think will work best for them before pulling items from the templates.
SIGs should use [the template][Short Template] as a starting point. This document links to the recommended [SIG Governance][sig-governance] but SIGs may optionally record deviations from these defaults in their charter.
The expectation is that SIGs will pull and adapt the options in the templates to best meet the needs of the both the SIG
and project.
- [Recommendations and requirements]
## Goals
## Templates
- [Short Template]
The primary goal of the charters is to define the scope of the SIG within Kubernetes and how the SIG leaders exercise ownership of these areas by taking care of their responsibilities. A majority of the effort should be spent on these concerns.
## FAQ
See [frequently asked questions]
[OARP]: https://stumblingabout.com/tag/oarp/
[Recommendations and requirements]: sig-governance-requirements.md
[Short Template]: sig-governance-template-short.md
[sig-governance]: sig-governance.md
[Short Template]: sig-charter-template.md
[frequently asked questions]: FAQ.md
[sigs.yaml]: https://github.com/kubernetes/community/blob/master/sigs.yaml
[sig-architecture example]: ../../sig-architecture/charter.md
[steering@kubernetes.io]: mailto:steering@kubernetes.io
[kubernetes-dev@googlegroups.com]: mailto:kubernetes-dev@googlegroups.com

View File

@ -0,0 +1,64 @@
# SIG YOURSIG Charter
This charter adheres to the conventions described in the [Kubernetes Charter README] and uses
the Roles and Organization Management outlined in [sig-governance].
## Scope
Include a 2-3 sentence summary of what work SIG TODO does. Imagine trying to
explain your work to a colleague who is familiar with Kubernetes but not
necessarily all of the internals.
### In scope
#### Code, Binaries and Services
- list of what qualifies a piece of code, binary or service
- as falling into the scope of this SIG
- e.g. *clis for working with Kubernetes APIs*,
- *CI for kubernetes repos*, etc
- **This is NOT** a list of specific code locations,
- or projects those go in [SIG Subprojects][sig-subprojects]
#### Cross-cutting and Externally Facing Processes
- list of the non-internal processes
- that are owned by this SIG
- e.g. qualifying and cutting a Kubernetes release,
- organizing mentorship programs, etc
### Out of scope
Outline of things that could be confused as falling into this SIG but don't or don't right now.
## Roles and Organization Management
This sig follows adheres to the Roles and Organization Management outlined in [sig-governance]
and opts-in to updates and modifications to [sig-governance].
### Additional responsibilities of Chairs
- list of any additional responsibilities
- of Chairs
### Additional responsibilities of Tech Leads
- list of any additional responsibilities
- of Tech Leads
### Deviations from [sig-governance]
- list of other ways this SIG's roles and governance differ from
- the outline
- **If the SIG doesn't have either Chairs or Tech Leads specify that here.**
### Subproject Creation
Pick one:
1. SIG Technical Leads
2. Federation of Subprojects
[sig-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/sig-governance.md
[sig-subprojects]: https://github.com/kubernetes/community/blob/master/sig-YOURSIG/README.md#subprojects
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md

View File

@ -64,7 +64,7 @@ All technical assets *MUST* be owned by exactly 1 SIG subproject. The following
- *SHOULD* define a level of commitment for decisions that have gone through the formal process
(e.g. when is a decision revisited or reversed)
- *MUST* How technical assets of project remain healthy and can be released
- *MUST* define how technical assets of project remain healthy and can be released
- Publicly published signals used to determine if code is in a healthy and releasable state
- Commitment and process to *only* release when signals say code is releasable
- Commitment and process to ensure assets are in a releasable state for milestones / releases
@ -72,7 +72,7 @@ All technical assets *MUST* be owned by exactly 1 SIG subproject. The following
- *SHOULD* define target metrics for health signal (e.g. broken tests fixed within N days)
- *SHOULD* define process for meeting target metrics (e.g. all tests run as presubmit, build cop, etc)
[lazy-consensus]: http://communitymgt.wikia.com/wiki/Lazy_consensus
[lazy-consensus]: http://en.osswiki.info/concepts/lazy_consensus
[super-majority]: https://en.wikipedia.org/wiki/Supermajority#Two-thirds_vote
[warnocks-dilemma]: http://communitymgt.wikia.com/wiki/Warnock%27s_Dilemma
[slo]: https://en.wikipedia.org/wiki/Service_level_objective

View File

@ -1,122 +0,0 @@
# SIG Governance Template (Short Version)
## Roles
Membership for roles tracked in: <link to OWNERS file>
- Chair
- Run operations and processes governing the SIG
- Seed members established at SIG founding
- Chairs *MAY* decide to step down at anytime and propose a replacement. Use lazy consensus amongst
chairs with fallback on majority vote to accept proposal. This *SHOULD* be supported by a majority of
SIG Members.
- Chairs *MAY* select additional chairs through a [super-majority] vote amongst chairs. This
*SHOULD* be supported by a majority of SIG Members.
- Chairs *MUST* remain active in the role and are automatically removed from the position if they are
unresponsive for > 3 months and *MAY* be removed if not proactively working with other chairs to fulfill
responsibilities.
- Number: 2-3
- Defined in [sigs.yaml]
- *Optional Role*: SIG Technical Leads
- Establish new subprojects
- Decommission existing subprojects
- Resolve X-Subproject technical issues and decisions
- Subproject Owners
- Scoped to a subproject defined in [sigs.yaml]
- Seed members established at subproject founding
- *MUST* be an escalation point for technical discussions and decisions in the subproject
- *MUST* set milestone priorities or delegate this responsibility
- *MUST* remain active in the role and are automatically removed from the position if they are unresponsive
for > 3 months.
- *MAY* be removed if not proactively working with other Subproject Owners to fulfill responsibilities.
- *MAY* decide to step down at anytime and propose a replacement. Use [lazy-consensus] amongst subproject owners
with fallback on majority vote to accept proposal. This *SHOULD* be supported by a majority of subproject
contributors (those having some role in the subproject).
- *MAY* select additional subproject owners through a [super-majority] vote amongst subproject owners. This
*SHOULD* be supported by a majority of subproject contributors (through [lazy-consensus] with fallback on voting).
- Number: 3-5
- Defined in [sigs.yaml] [OWNERS] files
- Members
- *MUST* maintain health of at least one subproject or the health of the SIG
- *MUST* show sustained contributions to at least one subproject or to the SIG
- *SHOULD* hold some documented role or responsibility in the SIG and / or at least one subproject
(e.g. reviewer, approver, etc)
- *MAY* build new functionality for subprojects
- *MAY* participate in decision making for the subprojects they hold roles in
- Includes all reviewers and approvers in [OWNERS] files for subprojects
## Organizational management
- SIG meets bi-weekly on zoom with agenda in meeting notes
- *SHOULD* be facilitated by chairs unless delegated to specific Members
- SIG overview and deep-dive sessions organized for Kubecon
- *SHOULD* be organized by chairs unless delegated to specific Members
- Contributing instructions defined in the SIG CONTRIBUTING.md
### Project management
#### Subproject creation
---
Option 1: by SIG Technical Leads
- Subprojects may be created by [KEP] proposal and accepted by [lazy-consensus] with fallback on majority vote of
SIG Technical Leads. The result *SHOULD* be supported by the majority of SIG members.
- KEP *MUST* establish subproject owners
- [sigs.yaml] *MUST* be updated to include subproject information and [OWNERS] files with subproject owners
- Where subprojects processes differ from the SIG governance, they must document how
- e.g. if subprojects release separately - they must document how release and planning is performed
Option 2: by federation of subprojects
- Subprojects may be created by [KEP] proposal and accepted by [lazy-consensus] with fallback on majority vote of
subproject owners in the SIG. The result *SHOULD* be supported by the majority of members.
- KEP *MUST* establish subproject owners
- [sigs.yaml] *MUST* be updated to include subproject information and [OWNERS] files with subproject owners
- Where subprojects processes differ from the SIG governance, they must document how
- e.g. if subprojects release separately - they must document how release and planning is performed
---
- Subprojects must define how releases are performed and milestones are set. Example:
> - Release milestones
> - Follows the kubernetes/kubernetes release milestones and schedule
> - Priorities for upcoming release are discussed during the SIG meeting following the preceding release and
> shared through a PR. Priorities are finalized before feature freeze.
> - Code and artifacts are published as part of the kubernetes/kubernetes release
### Technical processes
Subprojects of the SIG *MUST* use the following processes unless explicitly following alternatives
they have defined.
- Proposing and making decisions
- Proposals sent as [KEP] PRs and published to googlegroup as announcement
- Follow [KEP] decision making process
- Test health
- Canonical health of code published to <link to dashboard>
- Consistently broken tests automatically send an alert to <link to google group>
- SIG members are responsible for responding to broken tests alert. PRs that break tests should be rolled back
if not fixed within 24 hours (business hours).
- Test dashboard checked and reviewed at start of each SIG meeting. Owners assigned for any broken tests.
and followed up during the next SIG meeting.
Issues impacting multiple subprojects in the SIG should be resolved by either:
- Option 1: SIG Technical Leads
- Option 2: Federation of Subproject Owners
[lazy-consensus]: http://communitymgt.wikia.com/wiki/Lazy_consensus
[super-majority]: https://en.wikipedia.org/wiki/Supermajority#Two-thirds_vote
[KEP]: https://github.com/kubernetes/community/blob/master/keps/0000-kep-template.md
[sigs.yaml]: https://github.com/kubernetes/community/blob/master/sigs.yaml#L1454
[OWNERS]: contributors/devel/owners.md

View File

@ -0,0 +1,165 @@
# SIG Roles and Organizational Governance
This charter adheres to the conventions described in the [Kubernetes Charter README].
This document will be updated as needed to meet the current needs of the Kubernetes project.
## Roles
### Notes on Roles
Unless otherwise stated, individuals are expected to be responsive and active within
their roles. Within this section "member" refers to a member of a Chair, Tech Lead or
Subproject Owner Role. (this different from a SIG or Organization Member).
- Initial members are defined at the founding of the SIG or Subproject as part of the acceptance
of that SIG or Subproject.
- Members *SHOULD* remain active and responsive in their Roles.
- Members taking an extended leave of 1 or more months *SHOULD*
coordinate with other members to ensure the
role is adequately staffed during the leave.
- Members going on leave for 1-3 months *MAY* work with other
members to identify a temporary replacement.
- Members of a role *SHOULD* remove any other members that have not communicated a
leave of absence and either cannot be reached for more than 1 month or are not
fulfilling their documented responsibilities for more than 1 month.
This may be done through a [super-majority] vote of members, or if there are not
enough *active* members to get a super-majority of votes cast, then removal may occur
through a [super-majority] vote between Chairs, Tech Leads and Subproject Owners.
- Membership disagreements may be escalated to the SIG Chairs. SIG Chair membership
disagreements may be escalated to the Steering Committee.
- Members *MAY* decide to step down at anytime and propose a replacement. Use lazy consensus amongst
other members with fallback on majority vote to accept proposal. The candidate *SHOULD* be supported by a
majority of SIG Members or Subproject Contributors (as applicable).
- Members *MAY* select additional members through a [super-majority] vote amongst members. This
*SHOULD* be supported by a majority of SIG Members or Subproject Contributors (as applicable).
### Chair
- Chair
- Run operations and processes governing the SIG
- Number: 2-3
- Membership tracked in [sigs.yaml]
### Tech Lead
- *Optional Role*: SIG Technical Leads
- Establish new subprojects
- Decommission existing subprojects
- Resolve X-Subproject technical issues and decisions
- Number: 2-3
- Membership tracked in [sigs.yaml]
### Subproject Owner
- Subproject Owners
- Scoped to a subproject defined in [sigs.yaml]
- Seed members established at subproject founding
- *SHOULD* be an escalation point for technical discussions and decisions in the subproject
- *SHOULD* set milestone priorities or delegate this responsibility
- Number: 2-3
- Membership tracked in [sigs.yaml]
### Member
- Members
- *SHOULD* maintain health of at least one subproject or the health of the SIG
- *SHOULD* show sustained contributions to at least one subproject or to the SIG
- *SHOULD* hold some documented role or responsibility in the SIG and / or at least one subproject
(e.g. reviewer, approver, etc)
- *MAY* build new functionality for subprojects
- *MAY* participate in decision making for the subprojects they hold roles in
- Includes all reviewers and approvers in [OWNERS] files for subprojects
### Security Contact
- Security Contact
- *MUST* be a contact point for the Product Security Team to reach out to for
triaging and handling of incoming issues
- *MUST* accept the [Embargo Policy]
- Defined in `SECURITY_CONTACTS` files, this is only relevant to the root file in
the repository. Template [SECURITY_CONTACTS]
## Organizational Management
- SIG meets bi-weekly on zoom with agenda in meeting notes
- *SHOULD* be facilitated by chairs unless delegated to specific Members
- SIG overview and deep-dive sessions organized for KubeCon/CloudNativeCon
- *SHOULD* be organized by chairs unless delegated to specific Members
- SIG updates to Kubernetes community meeting on a regular basis
- *SHOULD* be presented by chairs unless delegated to specific Members
- Contributing instructions defined in the SIG CONTRIBUTING.md
### Project Management
#### Subproject Creation
---
Option 1: by SIG Technical Leads
- Subprojects may be created by [KEP] proposal and accepted by [lazy-consensus] with fallback on majority vote of
SIG Technical Leads. The result *SHOULD* be supported by the majority of SIG members.
- KEP *MUST* establish subproject owners
- [sigs.yaml] *MUST* be updated to include subproject information and [OWNERS] files with subproject owners
- Where subprojects processes differ from the SIG governance, they must document how
- e.g. if subprojects release separately - they must document how release and planning is performed
Option 2: by Federation of Subprojects
- Subprojects may be created by [KEP] proposal and accepted by [lazy-consensus] with fallback on majority vote of
subproject owners in the SIG. The result *SHOULD* be supported by the majority of members.
- KEP *MUST* establish subproject owners
- [sigs.yaml] *MUST* be updated to include subproject information and [OWNERS] files with subproject owners
- Where subprojects processes differ from the SIG governance, they must document how
- e.g. if subprojects release separately - they must document how release and planning is performed
Subprojects may create repos under *github.com/kubernetes-sigs* through [lazy-consensus] of subproject owners.
---
- Subprojects must define how releases are performed and milestones are set. Example:
> - Release milestones
> - Follows the kubernetes/kubernetes release milestones and schedule
> - Priorities for upcoming release are discussed during the SIG meeting following the preceding release and
> shared through a PR. Priorities are finalized before feature freeze.
> - Code and artifacts are published as part of the kubernetes/kubernetes release
### Technical processes
Subprojects of the SIG *MUST* use the following processes unless explicitly following alternatives
they have defined.
- Proposing and making decisions
- Proposals sent as [KEP] PRs and published to googlegroup as announcement
- Follow [KEP] decision making process
- Test health
- Canonical health of code published to <link to dashboard>
- Consistently broken tests automatically send an alert to <link to google group>
- SIG members are responsible for responding to broken tests alert. PRs that break tests should be rolled back
if not fixed within 24 hours (business hours).
- Test dashboard checked and reviewed at start of each SIG meeting. Owners assigned for any broken tests.
and followed up during the next SIG meeting.
Issues impacting multiple subprojects in the SIG should be resolved by either:
- Option 1: SIG Technical Leads
- Option 2: Federation of Subproject Owners
### SIG Retirement
- In the event that the SIG is unable to regularly establish consistent quorum
or otherwise fulfill its Organizational Management responsibilities
- after 3 or more months it *SHOULD* be retired
- after 6 or more months it *MUST* be retired
[lazy-consensus]: http://en.osswiki.info/concepts/lazy_consensus
[super-majority]: https://en.wikipedia.org/wiki/Supermajority#Two-thirds_vote
[KEP]: https://github.com/kubernetes/community/blob/master/keps/0000-kep-template.md
[sigs.yaml]: https://github.com/kubernetes/community/blob/master/sigs.yaml#L1454
[OWNERS]: contributors/devel/owners.md
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md
[Embargo Policy]: https://github.com/kubernetes/sig-release/blob/master/security-release-process-documentation/security-release-process.md#embargo-policy
[SECURITY_CONTACTS]: https://github.com/kubernetes/kubernetes-template-project/blob/master/SECURITY_CONTACTS

View File

@ -0,0 +1,121 @@
# Kubernetes Working Group Formation and Disbandment
## Process Overview and Motivations
Working Groups provide a formal avenue for disparate groups to collaborate around a common problem, craft a balanced
position, and disband. Because they represent the interests of multiple groups, they are a vehicle for consensus
building. If code is developed as part of collaboration within the Working Group, that code will be housed in an
appropriate repository as described in the [repositories document]. The merging of this code into the repository
will be governed by the standard policies regarding submitting code to that repository (e.g. developed within one or
more Subprojects owned by SIGs).
Because a working group is an official part of the Kubernetes project it is subject to steering committee oversight
over its formation and disbanding.
## Goals of the process
- An easy-to-navigate process for those wishing to establish and eventually disband a new Working Group
- Simple guidance and differentiation on where a Working Group makes sense, and does not
- Clear understanding that no authority is vested in a Working Group
- Ensure all future Working Groups conform with this process
## Non-goals of the process
- Documenting other governance bodies such as sub-projects or SIGs
- Changing the status of existing Working Groups/SIGs/Sub-projects
## Working Group Relationship To SIGs
Assets owned by the Kubernetes project (e.g. code, docs, blogs, processes, etc) are owned and
managed by [SIGs](sig-governance.md). The exception to this is specific assets that may be owned
by Working Groups, as outlined below.
Working Groups provide structure for governance and communication channels, and as such may
own the following types of assets:
- Calendar Events
- Slack Channels
- Discussion Forum Groups
Working Groups are distinct from SIGs in that they are intend to:
- facilitate collaboration across SIGs
- facilitate an exploration of a problem / solution through a group with minimal governmental overhead
Working Groups will typically have stake holders whose participation is in the
context of one or more SIGs. These SIGs should be documented as stake holders of the Working Group
(see Creation Process).
## Is it a Working Group? Yes, if...
- It does not own any code
- It has a clear goal measured through a specific deliverable or deliverables
- It is temporary in nature and will be disbanded after reaching its stated goal(s)
## Creation Process Description
Working Group formation is less tightly-controlled than SIG formation since they:
- Do not own code
- Have a clear entry and exit criteria
- Do not have any organizational authority, only influence
Therefore, Working Group formation begins with the organizers asking themselves some important questions that
should eventually be reflected in a pull request on sigs.yaml:
1. What is the exact problem this group is trying to solve?
1. What is the artifact that this group will deliver, and to whom?
1. How does the group know when the problem solving process is completed, and it is time for the Working Group to
dissolve?
1. Who are all of the stakeholders involved in this problem this group is trying to solve (SIGs, steering committee,
other Working Groups)?
1. What are the meeting mechanics (frequency, duration, roles)?
1. Does the goal of the Working Group represent the needs of the project as a whole, or is it focused on the interests
of a narrow set of contributors or companies?
1. Who will chair the group, and ensure it continues to meet these requirements?
1. Is diversity well-represented in the Working Group?
Once the above questions have been answered, the pull request against sigs.yaml can be created. Once the generator
is run, this will in turn create the OWNERS_ALIASES file, readme files, and the main SIGs list. The minimum
requirements for that are:
- name
- directory
- mission statement
- chair information
- meeting information
- contact methods
- any [sig](sig-governance.md) stakeholders
The pull request should be labeled with any SIG stakeholders and committee/steering. And since GitHub notifications
are not a reliable means to contact people, an email should be sent to the mailing lists for the stakeholder SIGs,
and the steering committee with a link to the PR. A member of the community admin team will place a /hold on it
until it has an LGTM from at least one chair from each of the stakeholder SIGs, and a simple majority of the steering
committee.
Once merged, the Working Group is officially chartered until it either completes its stated goal, or disbands
voluntarily (e.g. due to new facts, member attrition, change in direction, etc). Working groups should strive to
make regular reports to stakeholder SIGs in order to ensure the mission is still aligned with the current state.
Deliverables (documents, diagrams) for the group should be stored in the directory created for the Working Group.
If the deliverable is a KEP, it would be helpful to link it in the closed formation/charter PR for future reference.
## Disbandment Process Description
Working Groups will be disbanded if either of the following is true:
- There is no long a Chair
- (with a 4 week grace period)
- None of the communication channels for the Working Group have been in use for the goals outlined at the founding of
the Working Group
- (with a 3 month grace period)
- Slack
- Email Discussion Forum
- Zoom
The current Chair may step down at any time. When they do so, a new Chair may be selected through lazy consensus
within the Working Group, and [sigs.yaml](/sigs.yaml) should be updated.
References
- [1] https://github.com/kubernetes/community/pull/1994
- [2] https://groups.google.com/a/kubernetes.io/d/msg/steering/zEY93Swa_Ss/C0ziwjkGCQAJ
[repository document]: https://github.com/kubernetes/community/blob/master/github-management/kubernetes-repositories.md

View File

@ -40,4 +40,4 @@ Collaboration should simplify things for everyone, but with privilege comes resp
Your community managers are happy to help with any questions that you may have and will do their best to help if anything goes wrong. Please get in touch via [SIG Contributor Experience](https://git.kubernetes.io/community/sig-contributor-experience).
- Check the [centralized list of administrators](./moderators.md) for contact information.

View File

@ -10,7 +10,7 @@ The Kubernetes community abides by the [CNCF code of conduct]. Here is an excer
## SIGs
Kubernetes encompasses many projects, organized into [SIGs](sig-list.md).
Kubernetes encompasses many projects, organized into [SIGs](/sig-list.md).
Some communication has moved into SIG-specific channels - see
a given SIG subdirectory for details.
@ -22,13 +22,25 @@ and meetings devoted to Kubernetes.
* [Twitter]
* [Blog]
* Pose questions and help answer them on [Stack Overflow].
* [Slack] - sign up
Real time discussion at kubernetes.slack.io:
Discussions on most channels are archived at [kubernetes.slackarchive.io].
Start archiving by inviting the _slackarchive_ bot to a
channel via `/invite @slackarchive`.
To add new channels, contact one of the admins in the #slack-admins channel. Our guidelines are [here](/communication/slack-guidelines.md).
## Slack
[Join Slack] - sign up and join channels on topics that interest you, but please read our [Slack Guidelines] before participating.
If you want to add a new channel, contact one of the admins in the #slack-admins channel.
## Mailing lists
Kubernetes mailing lists are hosted through Google Groups. To
receive these lists' emails,
[join](https://support.google.com/groups/answer/1067205) the groups
relevant to you, as you would any other Google Group.
* [kubernetes-announce] broadcasts major project announcements such as releases and security issues
* [kubernetes-dev] hosts development announcements and discussions around developing kubernetes itself
* [Discuss Kubernetes] is where kubernetes users trade notes
* Additional Google groups exist and can be joined for discussion related to each SIG and Working Group. These are linked from the [SIG list](/sig-list.md).
## Issues
@ -38,15 +50,6 @@ please start with the [troubleshooting guide].
If that doesn't answer your questions, or if you think you found a bug,
please [file an issue].
## Mailing lists
Development announcements and discussions appear on the Google group
[kubernetes-dev] (send mail to `kubernetes-dev@googlegroups.com`).
Users trade notes on the Google group
[kubernetes-users] (send mail to `kubernetes-users@googlegroups.com`).
## Accessing community documents
In order to foster real time collaboration there are many working documents
@ -63,9 +66,9 @@ Office hours are held once a month. Please refer to [this document](/events/offi
## Weekly Meeting
We have PUBLIC and RECORDED [weekly meeting] every Thursday at 10am US Pacific Time over Zoom.
We have a public and recorded [weekly meeting] every Thursday at 10am US Pacific Time over Zoom.
Map that to your local time with this [timezone table].
Convert it to your local time using the [timezone table].
See it on the web at [calendar.google.com], or paste this [iCal url] into any iCal client.
@ -78,10 +81,10 @@ please propose a specific date on the [Kubernetes Community Meeting Agenda].
## Conferences
Kubernetes is the main focus of CloudNativeCon/KubeCon, held every spring in Europe and winter in North America. Information about these and other community events is available on the CNCF [events] pages.
Kubernetes is the main focus of KubeCon + CloudNativeCon, held every spring in Europe, summer in China, and winter in North America. Information about these and other community events is available on the CNCF [events] pages.
[Blog]: http://blog.kubernetes.io
[Blog]: https://kubernetes.io/blog/
[calendar.google.com]: https://calendar.google.com/calendar/embed?src=cgnt364vd8s86hr2phapfjc6uk%40group.calendar.google.com&ctz=America/Los_Angeles
[CNCF code of conduct]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
[communication]: /communication.md
@ -92,15 +95,16 @@ Kubernetes is the main focus of CloudNativeCon/KubeCon, held every spring in Eur
[iCal url]: https://calendar.google.com/calendar/ical/cgnt364vd8s86hr2phapfjc6uk%40group.calendar.google.com/public/basic.ics
[Kubernetes Community Meeting Agenda]: https://docs.google.com/document/d/1VQDIAB0OqiSjIHI8AWMvSdceWhnz56jNpZrLs6o7NJY/edit#
[kubernetes-community-video-chat]: https://groups.google.com/forum/#!forum/kubernetes-community-video-chat
[kubernetes-announce]: https://groups.google.com/forum/#!forum/kubernetes-announce
[kubernetes-dev]: https://groups.google.com/forum/#!forum/kubernetes-dev
[kubernetes-users]: https://groups.google.com/forum/#!forum/kubernetes-users
[kubernetes.slackarchive.io]: https://kubernetes.slackarchive.io
[Discuss Kubernetes]: https://discuss.kubernetes.io
[kubernetes.slack.com]: https://kubernetes.slack.com
[Slack]: http://slack.k8s.io
[Join Slack]: http://slack.k8s.io
[Slack Guidelines]: /communication/slack-guidelines.md
[Special Interest Group]: /README.md#SIGs
[Stack Overflow]: http://stackoverflow.com/questions/tagged/kubernetes
[Stack Overflow]: https://stackoverflow.com/questions/tagged/kubernetes
[timezone table]: https://www.google.com/search?q=1000+am+in+pst
[troubleshooting guide]: http://kubernetes.io/docs/troubleshooting
[troubleshooting guide]: https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/
[Twitter]: https://twitter.com/kubernetesio
[weekly meeting]: https://zoom.us/my/kubernetescommunity

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,81 @@
# Moderation on Kubernetes Communications Channels
This page describes the rules and best practices for people chosen to moderate Kubernetes communications channels.
This includes, Slack and the mailing lists and _any communication tool_ used in an official manner by the project.
- Check the [centralized list of administrators](./moderators.md) for contact information.
## Roles and Responsibilities
As part of volunteering to become a moderator you are now representative of the Kubernetes community and it is your responsibility to remain aware of your contributions in this space.
These responsibilities apply to all Kubernetes official channels.
Moderators _MUST_:
- Take action as specified by these Kubernetes Moderator Guidelines.
- You are empowered to take _immediate action_ when there is a violation. You do not need to wait for review or approval if an egregious violation has occurred. Make a judgement call based on our Code of Conduct and Values (see below).
- Removing a bad actor or content from the medium is preferable to letting it sit there.
- Abide by the documented tasks and actions required of moderators.
- Ensure that the [CNCF Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md) is in effect on all official Kubernetes communication channels.
- Become familiar with the [Kubernetes Community Values](https://github.com/kubernetes/steering/blob/master/values.md).
- Take care of spam as soon as possible, which may mean taking action by removing a member from that resource.
- Foster a safe and productive environment by being aware of potential multiple cultural differences between Kubernetes community members.
- Understand that you might be contacted by moderators, community managers, and other users via private email or a direct message.
- Report violations of the Code of Conduct to <conduct@kubernetes.io>.
Moderators _SHOULD_:
- Exercise compassion and empathy when communicating and collaborating with other community members.
- Understand the difference between a user abusing the resource or just having difficulty expressing comments and questions in English.
- Be an example and role model to others in the community.
- Remember to check and recognize if you need take a break when you become frustrated or find yourself in a heated debate.
- Help your colleagues if you recognize them in one of the [stages of burnout](https://opensource.com/business/15/12/avoid-burnout-live-happy).
- Be helpful and have fun!
## Violations
The Kubernetes [Code of Conduct Committee](https://git.k8s.io/community/committee-code-of-conduct) will have the final authority regarding escalated moderation matters. Violations of the Code of Conduct will be handled on a case by case basis. Depending on severity this can range up to and including removal of the person from the community, though this is extremely rare.
## Specific Guidelines
These guidelines are for tool-specific policies that don't fit under a general umbrella.
### Mailing Lists
### Moderating a SIG/WG list
- SIG and Working Group mailing list should have parispittman@google.com and jorge@heptio.com as a coowner so that administrative functions can be managed centrally across the project.
- Moderation of the SIG/WG lists is up to that individual SIG/WG, these admins are there to help facilitate leadership changes, reset lost passwords, etc.
- Users who are violating the Code of Conduct or other negative activities (like spamming) should be moderated.
- [Lock the thread immediately](https://support.google.com/groups/answer/2466386?hl=en#) so that people cannot reply to the thread.
- [Delete the post](https://support.google.com/groups/answer/1046523?hl=en) -
- In some cases you might need to ban a user from the group, follow [these instructions](https://support.google.com/groups/answer/2646833?hl=en&ref_topic=2458761#) on how stop a member from being able to post to the group.
For more technical help on how to use Google Groups, check the [Groups Help](https://support.google.com/groups/answer/2466386?hl=en&ref_topic=2458761) page.
### New users posting to a SIG/WG list
New members who post to a group will automatically have their messages put in a queue and be sent the following message automatically: "Since you're a new subscriber you're in a moderation queue, sorry for the inconvenience, a moderator will check your message shortly."
Moderators will receive emails when messages are in this queue and will process them accordingly.
### Slack
- [Slack Guidelines](./slack-guidelines.md)
### Zoom
- [Zoom Guidelines](./zoom-guidelines.md)
### References and Resources
Thanks to the following projects for making their moderation guidelines public, allowing us to build on the shoulders of giants.
Moderators are encouraged to learn how other projects moderate and learn from them in order to improve our guidelines:
- Mozilla's [Forum Moderation Guidelines](https://support.mozilla.org/en-US/kb/moderation-guidelines)
- OASIS [How to Moderate a Mailing List](https://www.oasis-open.org/khelp/kmlm/user_help/html/mailing_list_moderation.html)
- Community Spark's [How to effectively moderate forums](http://www.communityspark.com/how-to-effectively-moderate-forums/)
- [5 tips for more effective community moderation](https://www.socialmediatoday.com/social-business/5-tips-more-effective-community-moderation)
- [8 Helpful Moderation Tips for Community Managers](https://sproutsocial.com/insights/tips-community-managers/)
- [Setting Up Community Guidelines for Moderation](https://www.getopensocial.com/blog/community-management/setting-community-guidelines-moderation)

View File

@ -0,0 +1,62 @@
# Community Moderators
The following people are responsible for moderating/administrating Kubernetes communication channels and their home time zone.
See our [moderation guidelines](./moderating.md) for policies and recommendations.
## Mailing Lists
### kubernetes-dev
### Administrators
- Sarah Novotny (@sarahnovotny) - PT
- Brian Grant (@bgrant0607) - PT
### Moderators
- Paris Pittman (@parispittman) - PT
- Jorge Castro (@castrojo) - ET
- Jaice Singer DuMars - (@jdumars) - PT
- Louis Taylor (@kragniz)- CET
- Nikhita Raghunath (@nikhita) - IT
## GitHub
- [GitHub Administration Team](https://github.com/kubernetes/community/tree/master/github-management#github-administration-team)
## discuss.kubernetes.io
### Administrators
- Paris Pittman (@parispittman) - PT
- Jorge Castro (@castrojo) - ET
- Bob Killen (@mrbobbytables) - ET
- Jeffrey Sica (@jeefy) - ET
### Additional Moderators
- Ihor Dvoretskyi (@idvoretskyi) - CET
## YouTube Channel
- Paris Pittman (@parispittman) - PT
- Sarah Novotny (@sarahnovotny) - PT
- Bob Hrdinsky - PT
- Ihor Dvoretskyi (@idvoretskyi) - CET
- Jeffrey Sica (@jeefy) - ET
- Jorge Castro (@castrojo) - ET
- Joe Beda - (@joebeda) - PT
- Jaice Singer DuMars - (@jdumars) - PT
## Slack
- Chris Aniszczyk (@caniszczyk) - CT
- Ihor Dvoretskyi (@idvoretskyi) - CET
- Jaice Singer DuMars (@jdumars) - PT
- Jorge Castro (@castrojo) - ET
- Paris Pittman (@parispittman) - PT
## Zoom
- Paris Pittman (@parispittman) - PT
- Jorge Castro (@castrojo) - ET

View File

@ -0,0 +1,77 @@
# Kubernetes Resources
> A collection of resources organized by medium (e.g. audio, text, video)
## Table of Contents
<!-- vim-markdown-toc GFM -->
- [Contributions](#contributions)
- [Resources](#resources)
- [Audio](#audio)
- [Text](#text)
- [Video](#video)
- [Learning Resources](#learning-resources)
<!-- vim-markdown-toc -->
## Contributions
If you would like to contribute to this list, please submit a PR and add `/sig contributor-experience` and `/assign @petermbenjamin`.
The criteria for contributions are simple:
- The resource must be related to Kubernetes.
- The resource must be free.
- Avoid undifferentiated search links (e.g. `https://example.com/search?q=kubernetes`), unless you can ensure the most relevant results (e.g. `https://example.com/search?q=kubernetes&category=technology`)
## Resources
### Audio
- [PodCTL](https://twitter.com/PodCTL)
- [Kubernetes Podcast](https://kubernetespodcast.com)
- [The New Stack Podcasts](https://thenewstack.io/podcasts/)
### Text
- [Awesome Kubernetes](https://github.com/ramitsurana/awesome-kubernetes)
- [CNCF Blog](https://www.cncf.io/newsroom/blog/)
- [Dev.To](https://dev.to/t/kubernetes)
- [Heptio Blog](https://blog.heptio.com)
- [KubeTips](http://kubetips.com)
- [KubeWeekly](https://twitter.com/kubeweekly)
- [Kubedex](https://kubedex.com/category/blog/)
- [Kubernetes Blog](https://kubernetes.io/blog/)
- [Kubernetes Enhancements Repo](https://github.com/kubernetes/enhancements)
- [Kubernetes Forum](https://discuss.kubernetes.io)
- [Last Week in Kubernetes Development](http://lwkd.info)
- [Medium](https://medium.com/tag/kubernetes)
- [Reddit](https://www.reddit.com/r/kubernetes)
- [The New Stack: CI/CD With Kubernetes](https://thenewstack.io/ebooks/kubernetes/ci-cd-with-kubernetes/)
- [The New Stack: Kubernetes Deployment & Security Patterns](https://thenewstack.io/ebooks/kubernetes/kubernetes-deployment-and-security-patterns/)
- [The New Stack: Kubernetes Solutions Directory](https://thenewstack.io/ebooks/kubernetes/kubernetes-solutions-directory/)
- [The New Stack: State of Kubernetes Ecosystem](https://thenewstack.io/ebooks/kubernetes/state-of-kubernetes-ecosystem/)
- [The New Stack: Use-Cases for Kubernetes](https://thenewstack.io/ebooks/use-cases/use-cases-for-kubernetes/)
- [Weaveworks Blog](https://www.weave.works/blog/category/kubernetes/)
### Video
- [BrightTALK Webinars](https://www.brighttalk.com/search/?q=kubernetes)
- [Ceph YouTube Channel](https://www.youtube.com/channel/UCno-Fry25FJ7B4RycCxOtfw)
- [CNCF YouTube Channel](https://www.youtube.com/channel/UCvqbFHwN-nwalWPjPUKpvTA)
- [Heptio YouTube Channel](https://www.youtube.com/channel/UCjQU5ZI2mHswy7OOsii_URg)
- [Joe Hobot YouTube Channel](https://www.youtube.com/channel/UCdxEoi9hB617EDLEf8NWzkA)
- [Kubernetes YouTube Channel](https://www.youtube.com/channel/UCZ2bu0qutTOM0tHYa_jkIwg)
- [Lachlan Evenson YouTube Channel](https://www.youtube.com/channel/UCC5NsnXM2lE6kKfJKdQgsRQ)
- [Rancher YouTube Channel](https://www.youtube.com/channel/UCh5Xtp82q8wjijP8npkVTBA)
- [Rook YouTube Channel](https://www.youtube.com/channel/UCa7kFUSGO4NNSJV8MJVlJAA)
- [Tigera YouTube Channel](https://www.youtube.com/channel/UC8uN3yhpeBeerGNwDiQbcgw)
- [Weaveworks YouTube Channel](https://www.youtube.com/channel/UCmIz9ew1lA3-XDy5FqY-mrA/featured)
### Learning Resources
- [edx Courses](https://www.edx.org/course?search_query=kubernetes)
- [Katacoda Interactive Tutorials](https://www.katacoda.com)
- [Udacity Course](https://www.udacity.com/course/scalable-microservices-with-kubernetes--ud615)
- [Udemy Courses](https://www.udemy.com/courses/search/?courseLabel=&sort=relevance&q=kubernetes&price=price-free)

View File

@ -1,6 +1,6 @@
# SLACK GUIDELINES
Slack is the main communication platform for Kubernetes outside of our mailing lists. Its important that conversation stays on topic in each channel, and that everyone abides by the Code of Conduct. We have over 30,000 members who should all expect to have a positive experience.
Slack is the main communication platform for Kubernetes outside of our mailing lists. Its important that conversation stays on topic in each channel, and that everyone abides by the Code of Conduct. We have over 50,000 members who should all expect to have a positive experience.
Chat is searchable and public. Do not make comments that you would not say on a video recording or in another public space. Please be courteous to others.
@ -10,12 +10,8 @@ Chat is searchable and public. Do not make comments that you would not say on a
Kubernetes adheres to Cloud Native Compute Foundation's [Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md) throughout the project, and includes all communication mediums.
## ADMINS
(by Slack ID and timezone)
* caniszczyk - CT
* idvoretskyi - CET
* jdumars - ET
* jorge - CT
* paris - PT
- Check the [centralized list of administrators](./moderators.md) for contact information.
Slack Admins should make sure to mention this in the “What I do” section of their Slack profile, as well as for which time zone.
@ -37,7 +33,7 @@ Please reach out to the #slack-admins group with your request to create a new ch
Channels are dedicated to [SIGs, WGs](/sig-list.md), sub-projects, community topics, and related Kubernetes programs/projects.
Channels are not:
* company specific; cloud providers are ok with product names as the channel. Discourse will be about Kubernetes-related topics and not proprietary information of the provider.
* private unless there is an exception: code of conduct matters, mentoring, security/vulnerabilities, or steering committee.
* private unless there is an exception: code of conduct matters, mentoring, security/vulnerabilities, github management, or steering committee.
Typical naming conventions:
#kubernetes-foo #sig-foo #meetup-foo #location-users #projectname
@ -51,11 +47,13 @@ Join the #slack-admins channel or contact one of the admins in the closest timez
What if you have a problem with an admin?
Send a DM to another listed Admin and describe the situation OR
If its a code of conduct issue, please send an email to steering-private@kubernetes.io and describe the situation
If its a code of conduct issue, please send an email to conduct@kubernetes.io and describe the situation
## BOTS, TOKENS, WEBHOOKS, OH MY
Bots, tokens, and webhooks are reviewed on a case-by-case basis with most requests being rejected due to security, privacy, and usability concerns.. Bots and the like tend to make a lot of noise in channels. Our Slack instance has over 30,000 people and we want everyone to have a great experience. Please join #Slack-admins and have a discussion about your request before requesting the access. GitHub workflow alerts into certain channels and requests from CNCF are typically OK.
Bots, tokens, and webhooks are reviewed on a case-by-case basis with most requests being rejected due to security, privacy, and usability concerns. Bots and the like tend to make a lot of noise in channels. Our Slack instance has over 50,000 people and we want everyone to have a great experience. Please join #slack-admins and have a discussion about your request before requesting the access.
Typically OK: GitHub, CNCF requests, and tools/platforms that we use to contribute to Kubernetes
## ADMIN MODERATION
@ -78,8 +76,15 @@ For reasons listed below, admins may inactivate individual Slack accounts. Due t
In the case that certain channels have rules or guidelines, they will be listed in the purpose or pinned docs of that channel.
#kubernetes-dev = questions and discourse around upstream contributions and development to kubernetes
#kubernetes-careers = job openings for positions working with/on/around Kubernetes. Postings should include contact details.
#kubernetes-careers = job openings for positions working with/on/around Kubernetes. Post the job once and pin it. Pins expire after 30 days. Postings must include:
- A link to the posting or job description
- The business name that will employ the Kubernetes hire
- The location of the role or if remote is OK
## DM (Direct Message) Conversations
Please do not engage in proprietary company specific conversations in the Kubernetes Slack instance. This is meant for conversations around related Kubernetes open source topics and community. Proprietary conversations should occur in your company Slack and/or communication platforms. As with all communication, please be mindful of appropriateness, professionalism, and applicability to the Kubernetes community.
Note:
We archive the entire workgroup's slack data in zip files when we have time. [The latest archive is from June 2016-November 2018.](https://drive.google.com/drive/folders/1idJkWcDuSfs8nFUm-1BgvzZxCqPMpDCb?usp=sharing)

View File

@ -0,0 +1,114 @@
# Zoom Guidelines
Zoom is the main video communication platform for Kubernetes.
It is used for running the [community meeting](/events/community-meeting.md), [SIG/WG meetings](/sig-list.md), [Office Hours](/events/office-hours.md), [Meet Our Contributors](/mentoring/meet-our-contributors.md) and many other Kubernetes online events.
Since the Zoom meetings are open to the general public, a Zoom host or cohost has to moderate a meeting in all senses of the word from starting and stopping the meeting to acting on code of conduct issues.
These guidelines are meant as a tool to help Kubernetes members manage their Zoom resources.
Check the main [moderation](./moderation.md) page for more information on other tools and general moderation guidelines.
## Current State
Zoom licenses are managed by the [CNCF Service Desk](https://github.com/cncf/servicedesk) through the Zoom Admins listed below. At the time of this update, we have 41 paid pro user licenses with 38 accounted for.
## Code of Conduct
Kubernetes adheres to Cloud Native Compute Foundation's [Code of Conduct](https://github.com/cncf/foundation/blob/master/code-of-conduct.md) throughout the project, and includes all communication mediums.
## Obtaining a Zoom License
Each SIG should have a paid Zoom account that all leads/chairs/necessary trusted owners have access to through their k-sig-foo-leads@googlegroups.com account
See the [SIG Creation procedure](/sig-governance.md#sig-creation-procedure) document on how to set up an initial account.
## Setting Up Your Meeting and Moderation
Do not share your zoom link on social media.
Moderation will not be available if you are not following this list:
- [latest version](https://zoom.us/download)
- logged in as the leads of that meeting OR have the host key. (Example: you need to use the leads account for sig arch if you are running that meeting or have their meeting key with join before host enabled)
- using a meeting that was set up through the "Meeting" tab in the zoom account and NOT the personal meeting ID
After the meeting has started:
- Assign a cohost to help with moderation. It should never be your notetaker unless it's a very small group.
- Turn off screen sharing for everyone and indicate only host. If you have others that need to share their screen, the host can enable that on the fly. (via the ^ next to Share Screen)
If you're dealing with a troll or bad actor:
- You can put an attendee on hold. The participant will be put into a 'waiting room' and not have ability to chat or discuss written or verbally until the host undoes the hold.
- Remove the participant. Please be cautious when testing or using this feature, as it is permanent. They will never be able to come back into that meeting ID on that device. Do not joke around with this feature; it's better to use the hold first and then remove.
- After an action has been taken, use the 'lock meeting' feature so that no others come into the meeting and you can resume. If that fails, end the call immediately. Contact Zoom Admins after the meeting to report.
You can find these actions when clicking on the 'more' or '...' options after scrolling over the participants name/information.
It is required that a host be comfortable with how to use these moderation tools and the zoom settings in general. Make sure whoever is running your meeting is equipped with the right knowledge and skills.
### Other Related Documentation
Zoom has documentation on how to use their moderation tools:
- https://support.zoom.us/hc/en-us/articles/201362603-Host-Controls-in-a-Meeting
We created an extensive [best practices doc](https://docs.google.com/document/d/1fudC_diqhN2TdclGKnQ4Omu4mwom83kYbZ5uzVRI07w/edit?usp=sharing) with screenshots. Those who belong to kubernetes-sig-leads@ have access.
## Meeting Archive Videos
If a violation has been addressed by a host and it has been recorded by Zoom, the video should be edited before being posted on the [Kubernetes channel](https://www.youtube.com/c/kubernetescommunity).
Contact [SIG Contributor Experience](/sig-contributor-experience) if you need help to edit a video before posting it to the public.
Chairs and TLs are responsible for posting all update meetings to their playlist on YouTube. [Please follow this guideline for more details.](K8sYoutubeCollaboration.md)
## Zoom Admins
Check the [centralized list of administrators](./moderators.md) for contact information.
### Escalating and/Reporting a Problem
Issues that cannot be handle via normal moderation with the Zoom Admins above and there has been a clear code of conduct violation, please escalate to the Kubernetes Code of Conduct Committee at conduct@kubernetes.io.
## Screen sharing guidelines and recommendations
Zoom has a documentation on how to use their screen sharing feature:
- https://support.zoom.us/hc/en-us/articles/201362153-How-Do-I-Share-My-Screen-
Recommendations:
- Turn off notification to prevent any interference.
- Close all sensitive documents and unrelated programs before sharing the screen eg. Emails.
- Test your presentation before hand to make sure everything goes smoothly.
- Keep your desktop clean. Make sure there is no offensive or/and distracting background.
## Audio/Video Quality Recommendations
While video conferencing has been a real boon to productivity there are still [lots of things that can go wrong](https://www.youtube.com/watch?v=JMOOG7rWTPg) during a conference video call.
There are some things that are just plain out of your control, but there are some things that you can control.
Here are some tips if you're just getting into remote meetings.
Keep in mind that sometimes things just break and sometimes it's just plain bad luck, so these aren't hard rules, more of a set of loose guidelines on how to tip the odds in your favor.
### Recommended Hardware to Have
- A dedicated microphone - This is the number one upgrade you can do. Sound is one of those things that can immediately change the quality of your call. If you plan on being here for the long haul something like a [Blue Yeti](https://www.bluedesigns.com/products/yeti/) will work great due to the simplicity of using USB audio and having a hardware mute button. Consider a [pop filter](https://en.wikipedia.org/wiki/Pop_filter) as well if necessary.
- A Video Camera - A bad image can be worked around if the audio is good. Certain models have noise cancelling dual-microphones, which are a great backup for a dedicated microphone or if you are travelling.
- A decent set of headphones - Personal preference, these cut down on the audio feedback when in larger meetings.
What about an integrated headset and microphone? This totally depends on the type. We recommend testing it with a friend or asking around for recommendations for which models work best.
### Hardware we don't recommend
- Earbuds. Generally speaking they are not ideal, and while they might sound fine to you when 50 people are on a call the ambient noise adds up. Some people join with earbuds and it sounds excellent, some people join and it sounds terrible. Practicing with someone ahead of time can help you determine how well your earbuds work.
### Pro-tips
- [Join on muted audio and video](https://support.zoom.us/hc/en-us/articles/203024649-Video-Or-Microphone-Off-By-Attendee) in order to prevent noise to those already in a call.
- If you don't have anything to say at that moment, MUTE. This is a common problem, you can help out a teammate by mentioning it on Zoom chat or asking them to mute on the call itself. Hopefully the meeting co-host can help mute before this is too disruptive. Don't feel bad if this happens to you, it's a common occurrence.
- Try to find a quiet meeting place to join from; some coworking spaces and coffee shops have a ton of ambient noise that won't be obvious to you but will be to other people in the meeting. When presenting to large groups consider delegating to another person who is in a quieter environment.
- Using your computer's built in microphone and speakers might work in a pinch, but in general won't work as well as a dedicated headset/microphone.
- Consider using visual signals to agree to points so that you don't have to mute/unmute often during a call. This can be an especially useful technique when people are asking for lazy consensus. A simple thumbs up can go a long ways!
- It is common for people to step on each other when there's an audio delay, and both parties are trying to communicate something, so don't sweat it, just remember to try and pause before speaking, or consider raising your hand (if your video is on) to help the host determine who should speak first.
Thanks for making Kubernetes meetings work great!

View File

@ -2,8 +2,9 @@
**Note:** This document is in progress
This doc outlines the various responsibilities of contributor roles in Kubernetes. The Kubernetes
project is subdivided into subprojects under SIGs. Responsibilities for most roles is scoped to these subprojects.
This doc outlines the various responsibilities of contributor roles in
Kubernetes. The Kubernetes project is subdivided into subprojects under SIGs.
Responsibilities for most roles are scoped to these subprojects.
| Role | Responsibilities | Requirements | Defined by |
| -----| ---------------- | ------------ | -------|
@ -14,21 +15,24 @@ project is subdivided into subprojects under SIGs. Responsibilities for most ro
## New contributors
[New contributors] should be welcomed to the community by existing members, helped with PR workflow, and directed to
relevant documentation and communication channels.
[New contributors] should be welcomed to the community by existing members,
helped with PR workflow, and directed to relevant documentation and
communication channels.
## Established community members
Established community members are expected to demonstrate their adherence to the principles in this
document, familiarity with project organization, roles, policies, procedures, conventions, etc.,
and technical and/or writing ability. Role-specific expectations, responsibilities, and requirements
are enumerated below.
Established community members are expected to demonstrate their adherence to the
principles in this document, familiarity with project organization, roles,
policies, procedures, conventions, etc., and technical and/or writing ability.
Role-specific expectations, responsibilities, and requirements are enumerated
below.
## Member
Members are continuously active contributors in the community. They can have issues and PRs assigned to them,
participate in SIGs through GitHub teams, and pre-submit tests are automatically run for their PRs.
Members are expected to remain active contributors to the community.
Members are continuously active contributors in the community. They can have
issues and PRs assigned to them, participate in SIGs through GitHub teams, and
pre-submit tests are automatically run for their PRs. Members are expected to
remain active contributors to the community.
**Defined by:** Member of the Kubernetes GitHub organization
@ -41,6 +45,7 @@ Members are expected to remain active contributors to the community.
- Contributing to SIG, subproject, or community discussions (e.g. meetings, Slack, email discussion
forums, Stack Overflow)
- Subscribed to [kubernetes-dev@googlegroups.com]
- Have read the [contributor guide]
- Actively contributing to 1 or more subprojects.
- Sponsored by 2 reviewers. **Note the following requirements for sponsors**:
- Sponsors must have close interactions with the prospective member - e.g. code/design/proposal review, coordinating
@ -48,39 +53,24 @@ Members are expected to remain active contributors to the community.
- Sponsors must be reviewers or approvers in at least 1 OWNERS file (in any repo in the Kubernetes GitHub
organization)
- Sponsors must be from multiple member companies to demonstrate integration across community.
- Send an email to *kubernetes-membership@googlegroups.com* with:
- CC: your sponsors on the message
- Subject: `REQUEST: New membership for <your-GH-handle>`
- Body: Confirm that you have joined kubernetes-dev@googlegroups.com (e.g. `I have joined
kubernetes-dev@googlegroups.com`)
- Body: GitHub handles of sponsors
- Body: List of contributions (PRs authored / reviewed, Issues responded to, etc)
- **[Open an issue][membership request] against the kubernetes/org repo**
- Ensure your sponsors are @mentioned on the issue
- Complete every item on the checklist ([preview the current version of the template][membership template])
- Make sure that the list of contributions included is representative of your work on the project.
- Have your sponsoring reviewers reply confirmation of sponsorship: `+1`
- Wait for response to the message
- Have read the [developer guide]
- Once your sponsors have responded, your request will be reviewed by the [Kubernetes GitHub Admin team], in accordance with their [SLO]. Any missing information will be requested.
Example message:
### Kubernetes Ecosystem
```
To: kubernetes-membership@googlegroups.com
CC: <sponsor1>, <sponsor2>
Subject: REQUEST: New membership for <your-GH-handle>
Body:
I have joined kubernetes-dev@googlegroups.com.
Sponsors:
- <GH handle> / <email>
- <GH handle> / <email>
List of contributions:
- <PR reviewed / authored>
- <PR reviewed / authored>
- <PR reviewed / authored>
- <Issue responded to>
- <Issue responded to>
```
There are related [Kubernetes GitHub organizations], such as [kubernetes-sigs].
We are currently working on automation that would transfer membership in the
Kubernetes organization to any related orgs automatically, but such is not the
case currently. If you are a Kubernetes org member, you are implicitly eligible
for membership in related orgs, and can request membership when it becomes
relevant, by [opening an issue][membership request] against the kubernetes/org
repo, as above. However, if you are a member of any of the related
[Kubernetes GitHub organizations] but not of the [Kubernetes org],
you will need explicit sponsorship for your membership request.
### Responsibilities and privileges
@ -95,24 +85,28 @@ List of contributions:
- Tests can be run against their PRs automatically. No `/ok-to-test` needed.
- Members can do `/ok-to-test` for PRs that have a `needs-ok-to-test` label, and use commands like `/close` to close PRs as well.
**Note:** members who frequently contribute code are expected to proactively perform code reviews and work towards
becoming a primary *reviewer* for the subproject that they are active in.
**Note:** members who frequently contribute code are expected to proactively
perform code reviews and work towards becoming a primary *reviewer* for the
subproject that they are active in.
## Reviewer
Reviewers are able to review code for quality and correctness on some part of a subproject.
They are knowledgeable about both the codebase and software engineering principles.
Reviewers are able to review code for quality and correctness on some part of a
subproject. They are knowledgeable about both the codebase and software
engineering principles.
**Defined by:** *reviewers* entry in an OWNERS file in a repo owned by the Kubernetes project.
**Defined by:** *reviewers* entry in an OWNERS file in a repo owned by the
Kubernetes project.
Reviewer status is scoped to a part of the codebase.
**Note:** Acceptance of code contributions requires at least one approver in addition to the assigned reviewers.
**Note:** Acceptance of code contributions requires at least one approver in
addition to the assigned reviewers.
### Requirements
The following apply to the part of codebase for which one would be a reviewer in an [OWNERS] file
(for repos using the bot).
The following apply to the part of codebase for which one would be a reviewer in
an [OWNERS] file (for repos using the bot).
- member for at least 3 months
- Primary reviewer for at least 5 PRs to the codebase
@ -125,8 +119,8 @@ The following apply to the part of codebase for which one would be a reviewer in
### Responsibilities and privileges
The following apply to the part of codebase for which one would be a reviewer in an [OWNERS] file
(for repos using the bot).
The following apply to the part of codebase for which one would be a reviewer in
an [OWNERS] file (for repos using the bot).
- Tests are automatically run for PullRequests from members of the Kubernetes GitHub organization
- Code reviewer status may be a precondition to accepting large code contributions
@ -141,19 +135,21 @@ The following apply to the part of codebase for which one would be a reviewer in
## Approver
Code approvers are able to both review and approve code contributions. While code review is focused on
code quality and correctness, approval is focused on holistic acceptance of a contribution including:
backwards / forwards compatibility, adhering to API and flag conventions, subtle performance and correctness issues,
interactions with other parts of the system, etc.
Code approvers are able to both review and approve code contributions. While
code review is focused on code quality and correctness, approval is focused on
holistic acceptance of a contribution including: backwards / forwards
compatibility, adhering to API and flag conventions, subtle performance and
correctness issues, interactions with other parts of the system, etc.
**Defined by:** *approvers* entry in an OWNERS file in a repo owned by the Kubernetes project.
**Defined by:** *approvers* entry in an OWNERS file in a repo owned by the
Kubernetes project.
Approver status is scoped to a part of the codebase.
### Requirements
The following apply to the part of codebase for which one would be an approver in an [OWNERS] file
(for repos using the bot).
The following apply to the part of codebase for which one would be an approver
in an [OWNERS] file (for repos using the bot).
- Reviewer of the codebase for at least 3 months
- Primary reviewer for at least 10 substantial PRs to the codebase
@ -164,8 +160,8 @@ The following apply to the part of codebase for which one would be an approver i
### Responsibilities and privileges
The following apply to the part of codebase for which one would be an approver in an [OWNERS] file
(for repos using the bot).
The following apply to the part of codebase for which one would be an approver
in an [OWNERS] file (for repos using the bot).
- Approver status may be a precondition to accepting large code contributions
- Demonstrate sound technical judgement
@ -178,21 +174,24 @@ The following apply to the part of codebase for which one would be an approver i
## Subproject Owner
**Note:** This is a generalized high-level description of the role, and the specifics of the subproject owner role's
responsibilities and related processes *MUST* be defined for individual SIGs or subprojects.
**Note:** This is a generalized high-level description of the role, and the
specifics of the subproject owner role's responsibilities and related
processes *MUST* be defined for individual SIGs or subprojects.
Subproject Owners are the technical authority for a subproject in the Kubernetes project. They *MUST* have demonstrated
both good judgement and responsibility towards the health of that subproject. Subproject Owners *MUST* set technical
direction and make or approve design decisions for their subproject - either directly or through delegation
of these responsibilities.
Subproject Owners are the technical authority for a subproject in the Kubernetes
project. They *MUST* have demonstrated both good judgement and responsibility
towards the health of that subproject. Subproject Owners *MUST* set technical
direction and make or approve design decisions for their subproject - either
directly or through delegation of these responsibilities.
**Defined by:** *owners* entry in subproject [OWNERS] files as defined by [sigs.yaml] *subproject.owners*
### Requirements
The process for becoming an subproject Owner should be defined in the SIG charter of the SIG owning
the subproject. Unlike the roles outlined above, the Owners of a subproject are typically limited
to a relatively small group of decision makers and updated as fits the needs of the subproject.
The process for becoming an subproject Owner should be defined in the SIG
charter of the SIG owning the subproject. Unlike the roles outlined above, the
Owners of a subproject are typically limited to a relatively small group of
decision makers and updated as fits the needs of the subproject.
The following apply to the subproject for which one would be an owner.
@ -222,13 +221,20 @@ The following apply to the subproject for which one would be an owner.
**Status:** Removed
The Maintainer role has been removed and replaced with a greater focus on [owner](#owner)s.
The Maintainer role has been removed and replaced with a greater focus on [OWNERS].
[code reviews]: contributors/devel/collab.md
[community expectations]: contributors/guide/community-expectations.md
[developer guide]: contributors/devel/README.md
[two-factor authentication]: https://help.github.com/articles/about-two-factor-authentication
[code reviews]: /contributors/devel/collab.md
[community expectations]: /contributors/guide/community-expectations.md
[contributor guide]: /contributors/guide/README.md
[Kubernetes GitHub Admin team]: /github-management/README.md#github-administration-team
[Kubernetes GitHub organizations]: /github-management#actively-used-github-organizations
[Kubernetes org]: https://github.com/kubernetes
[kubernetes-dev@googlegroups.com]: https://groups.google.com/forum/#!forum/kubernetes-dev
[sigs.yaml]: sigs.yaml
[New contributors]: https://github.com/kubernetes/community/blob/master/CONTRIBUTING.md
[OWNERS]: contributors/guide/owners.md
[kubernetes-sigs]: https://github.com/kubernetes-sigs
[membership request]: https://github.com/kubernetes/org/issues/new?template=membership.md&title=REQUEST%3A%20New%20membership%20for%20%3Cyour-GH-handle%3E
[membership template]: https://git.k8s.io/org/.github/ISSUE_TEMPLATE/membership.md
[New contributors]: /CONTRIBUTING.md
[OWNERS]: /contributors/guide/owners.md
[sigs.yaml]: /sigs.yaml
[SLO]: /github-management/org-owners-guide.md#slos
[two-factor authentication]: https://help.github.com/articles/about-two-factor-authentication

View File

@ -8,4 +8,4 @@ Note that a number of these documents are historical and may be out of date or u
TODO: Add the current status to each document and clearly indicate which are up to date.
TODO: Document the [proposal process](../devel/faster_reviews.md#1-dont-build-a-cathedral-in-one-pr).
TODO: Document the [proposal process](../guide/pull-requests.md#best-practices-for-faster-reviews).

View File

@ -31,7 +31,7 @@ aggregated servers.
* Developers should be able to write their own API server and cluster admins
should be able to add them to their cluster, exposing new APIs at runtime. All
of this should not require any change to the core kubernetes API server.
* These new APIs should be seamless extension of the core kubernetes APIs (ex:
* These new APIs should be seamless extensions of the core kubernetes APIs (ex:
they should be operated upon via kubectl).
## Non Goals

View File

@ -89,7 +89,7 @@ Implementations that cannot offer consistent ranging (returning a set of results
#### etcd3
For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
For etcd3 the continue token would contain a resource version (the snapshot that we are reading that is consistent across the entire LIST) and the start key for the next set of results. Upon receiving a valid continue token the apiserver would instruct etcd3 to retrieve the set of results at a given resource version, beginning at the provided start key, limited by the maximum number of requests provided by the continue token (or optionally, by a different limit specified by the client). If more results remain after reading up to the limit, the storage should calculate a continue token that would begin at the next possible key, and the continue token set on the returned list.
The storage layer in the apiserver must apply consistency checking to the provided continue token to ensure that malicious users cannot trick the server into serving results outside of its range. The storage layer must perform defensive checking on the provided value, check for path traversal attacks, and have stable versioning for the continue token.

View File

@ -35,7 +35,7 @@ while
## Constraints and Assumptions
* it is not the goal to implement all output formats one can imagine. The main goal is to be extensible with a clear golang interface. Implementations of e.g. CADF must be possible, but won't be discussed here.
* it is not the goal to implement all output formats one can imagine. The main goal is to be extensible with a clear golang interface. Implementations of e.g. CADF must be possible, but won't be discussed here.
* dynamic loading of backends for new output formats are out of scope.
## Use Cases
@ -243,7 +243,7 @@ type PolicyRule struct {
// An empty list implies every user.
Users []string
// The user groups this rule applies to. If a user is considered matching
// if the are a member of any of these groups
// if they are a member of any of these groups
// An empty list implies every user group.
UserGroups []string

View File

@ -0,0 +1,889 @@
# CRD Conversion Webhook
Status: Approved
Version: Alpha
Implementation Owner: @mbohlool
Authors: @mbohlool, @erictune
Thanks: @dbsmith, @deads2k, @sttts, @liggit, @enisoc
### Summary
This document proposes a detailed plan for adding support for version-conversion of Kubernetes resources defined via Custom Resource Definitions (CRD). The API Server is extended to call out to a webhook at appropriate parts of the handler stack for CRDs.
No new resources are added; the [CRD resource](https://github.com/kubernetes/kubernetes/blob/34383aa0a49ab916d74ea897cebc79ce0acfc9dd/staging/src/k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/types.go#L187) is extended to include conversion information as well as multiple schema definitions, one for each apiVersion that is to be served.
## Definitions
**Webhook Resource**: a Kubernetes resource (or portion of a resource) that informs the API Server that it should call out to a Webhook Host for certain operations.
**Webhook Host**: a process / binary which accepts HTTP connections, intended to be called by the Kubernetes API Server as part of a Webhook.
**Webhook**: In Kubernetes, refers to the idea of having the API server make an HTTP request to another service at a point in its request processing stack. Examples are [Authentication webhooks](https://kubernetes.io/docs/reference/access-authn-authz/webhook/) and [Admission Webhooks](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/). Usually refers to the system of Webhook Host and Webhook Resource together, but occasionally used to mean just Host or just Resource.
**Conversion Webhook**: Webhook that can convert an object from one version to another.
**Custom Resource**: In the context of this document, it refers to resources defined as Custom Resource Definition (in contrast with extension API servers resources).
**CRD Package**: CRD definition, plus associated controller deployment, RBAC roles, etc, which is released by a developer who uses CRDs to create new APIs.
## Motivation
Version conversion is, in our experience, the most requested improvement to CRDs. Prospective CRD users want to be certain they can evolve their API before they start down the path of developing a CRD + controller.
## Requirements
* As an existing author of a CRD, I can update my API's schema, without breaking existing clients. To that end, I can write a CRD(s) that supports one kind with two (or more) versions. Users of this API can access an object via either version (v1 or v2), and are accessing the same underlying storage (assuming that I have properly defined how to convert between v1 and v2.)
* As a prospective user of CRDs, I don't know what schema changes I may need in the future, but I want to know that they will be possible before I chose CRDs (over EAS, or over a non-Kubernetes API).
* As an author of a CRD Package, my users can upgrade to a new version of my package, and can downgrade to a prior version of my package (assuming that they follow proper upgrade and downgrade procedures; these should not require direct etcd access.)
* As a user, I should be able to request CR in any supported version defined by CRD and get an object has been properly converted to the requested version (assuming the CRD Package Author has properly defined how to convert).
* As an author of a CRD that does not use validation, I can still have different versions which undergo conversion.
* As a user, when I request an object, and webhook-conversion fails, I get an error message that helps me understand the problem.
* As an API machinery code maintainer, this change should not make the API machinery code harder to maintain
* As a cluster owner, when I upgrade to the version of Kubernetes that supports CRD multiple versions, but I don't use the new feature, my existing CRDs work fine. I can roll back to the previous version without any special action.
## Summary of Changes
1. A CRD object now represents a group/kind with one or more versions.
2. The CRD API (CustomResourceDefinitionSpec) is extended as follows:
1. It has a place to register 1 webhook.
2. it holds multiple "versions".
3. Some fields which were part of the .spec are now per-version; namely Schema, Subresources, and AdditionalPrinterColumns.
3. A Webhook Host is used to do conversion for a CRD.
4. CRD authors will need to write a Webhook Host that accepts any version and returns any version.
5. Toolkits like kube-builder and operator-sdk are expected to provide flows to assist users to generate Webhook Hosts.
## Detailed Design
### CRD API Changes
The CustomResourceDefinitionSpec is extended to have a new section where webhooks are defined:
```golang
// CustomResourceDefinitionSpec describes how a user wants their resource to appear
type CustomResourceDefinitionSpec struct {
Group string
Version string
Names CustomResourceDefinitionNames
Scope ResourceScope
// Optional, can only be provided if per-version schema is not provided.
Validation *CustomResourceValidation
// Optional, can only be provided if per-version subresource is not provided.
Subresources *CustomResourceSubresources
Versions []CustomResourceDefinitionVersion
// Optional, can only be provided if per-version additionalPrinterColumns is not provided.
AdditionalPrinterColumns []CustomResourceColumnDefinition
Conversion *CustomResourceConversion
}
type CustomResourceDefinitionVersion struct {
Name string
Served Boolean
Storage Boolean
// Optional, can only be provided if top level validation is not provided.
Schema *JSONSchemaProp
// Optional, can only be provided if top level subresource is not provided.
Subresources *CustomResourceSubresources
// Optional, can only be provided if top level additionalPrinterColumns is not provided.
AdditionalPrinterColumns []CustomResourceColumnDefinition
}
Type CustomResourceConversion struct {
// Conversion strategy, either "nop” or "webhook”. If webhook is set, Webhook field is required.
Strategy string
// Additional information for external conversion if strategy is set to external
// +optional
Webhook *CustomResourceConversionWebhook
}
type CustomResourceConversionWebhook {
// ClientConfig defines how to communicate with the webhook. This is the same config used for validating/mutating webhooks.
ClientConfig WebhookClientConfig
}
```
### Top level fields to Per-Version fields
In *CRD v1beta1* (apiextensions.k8s.io/v1beta1) there are per-version schema, additionalPrinterColumns or subresources (called X in this section) defined and these validation rules will be applied to them:
* Either top level X or per-version X can be set, but not both. This rule applies to individual Xs not the whole set. E.g. top level schema can be set while per-version subresources are set.
* per-version X cannot be the same. E.g. if all per-version schema are the same, the CRD object will be rejected with an error message asking the user to use the top level schema.
in *CRD v1* (apiextensions.k8s.io/v1), there will be only version list with no top level X. The second validation guarantees a clean moving to v1. These are conversion rules:
*v1beta1->v1:*
* If top level X is set in v1beta1, then it will be copied to all versions in v1.
* If per-version X are set in v1beta1, then they will be used for per-version X in v1.
*v1->v1beta1:*
* If all per-version X are the same in v1, they will be copied to top level X in v1beta1
* Otherwise, they will be used as per-version X in v1beta1
#### Alternative approaches considered
First a defaulting approach is considered which per-version fields would be defaulted to top level fields. but that breaks backward incompatible change; Quoting from API [guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/api_changes.md#backward-compatibility-gotchas):
> A single feature/property cannot be represented using multiple spec fields in the same API version simultaneously
Hence the defaulting either implicit or explicit has the potential to break backward compatibility as we have two sets of fields representing the same feature.
There are other solution considered that does not involved defaulting:
* Field Discriminator: Use `Spec.Conversion.Strategy` as discriminator to decide which set of fields to use. This approach would work but the proposed solution is keeping the mutual excusivity in a broader sense and is preferred.
* Per-version override: If a per-version X is specified, use it otherwise use the top level X if provided. While with careful validation and feature gating, this solution is also backward compatible, the overriding behaviour need to be kept in CRD v1 and that looks too complicated and not clean to keep for a v1 API.
Refer to [this document](http://bit.ly/k8s-crd-per-version-defaulting) for more details and discussions on those solutions.
### Support Level
The feature will be alpha in the first implementation and will have a feature gate that is defaulted to false. The roll-back story with a feature gate is much more clear. if we have the features as alpha in kubernetes release Y (>X where the feature is missing) and we make it beta in kubernetes release Z, it is not safe to use the feature and downgrade from Y to X but the feature is alpha in Y which is fine. It is safe to downgrade from Z to Y (given that we enable the feature gate in Y) and that is desirable as the feature is beta in Z.
On downgrading from a Z to Y, stored CRDs can have per-version fields set. While the feature gate can be off on Y (alpha cluster), it is dangerous to disable per-version Schema Validation or Status subresources as it makes the status field mutable and validation on CRs will be disabled. Thus the feature gate in Y only protects adding per-version fields not the actual behaviour. Thus if the feature gate is off in Y:
* Per-version X cannot be set on CRD create (per-version fields are auto-cleared).
* Per-version X can only be set/changed on CRD update *if* the existing CRD object already has per-version X set.
This way even if we downgrade from Z to Y, per-version validations and subresources will be honored. This will not be the case for webhook conversion itself. The feature gate will also protect the implementation of webhook conversion and alpha cluster with disabled feature gate will return error for CRDs with webhook conversion (that are created with a future version of the cluster).
### Rollback
Users that need to rollback to version X (but may currently be running version Y > X) of apiserver should not use CRD Webhook Conversion if X is not a version that supports these features. If a user were to create a CRD that uses CRD Webhook Conversion and then rolls back to version X that does not support conversion then the following would happen:
1. The stored custom resources in etcd will not be deleted.
2. Any clients that try to get the custom resources will get a 500 (internal server error). this is distinguishable from a deleted object for get and the list operation will also fail. That means the CRD is not served at all and Clients that try to garbage collect related resources to missing CRs should be aware of this.
3. Any client (e.g. controller) that tries to list the resource (in preparation for watching it) will get a 500 (this is distinguishable from an empty list or a 404).
4. If the user rolls forward again, then custom resources will be served again.
If a user does not use the webhook feature but uses the versioned schema, additionalPrinterColumns, and/or subresources and rollback to a version that does not support them per-version, any value set per-version will be ignored and only values in top level spec.* will be honor.
Please note that any of the fields added in this design that is not supported in previous kubernetes releases can be removed on an update operation (e.g. status update). The kubernetes release where defined the types but gate them with an alpha feature gate, however, can keep these fields but ignore there value.
### Webhook Request/Response
The Conversion request and response would be similar to [Admission webhooks](https://github.com/kubernetes/kubernetes/blob/951962512b9cfe15b25e9c715a5f33f088854f97/staging/src/k8s.io/api/admission/v1beta1/types.go#L29). The AdmissionReview seems to be redundant but used by other Webhook APIs and added here for consistency.
```golang
// ConversionReview describes a conversion request/response.
type ConversionReview struct {
metav1.TypeMeta
// Request describes the attributes for the conversion request.
// +optional
Request *ConversionRequest
// Response describes the attributes for the conversion response.
// +optional
Response *ConversionResponse
}
type ConversionRequest struct {
// UID is an identifier for the individual request/response. Useful for logging.
UID types.UID
// The version to convert given object to. E.g. "stable.example.com/v1"
APIVersion string
// Object is the CRD object to be converted.
Object runtime.RawExtension
}
type ConversionResponse struct {
// UID is an identifier for the individual request/response.
// This should be copied over from the corresponding ConversionRequest.
UID types.UID
// ConvertedObject is the converted version of request.Object.
ConvertedObject runtime.RawExtension
}
```
If the conversion is failed, the webhook should fail the HTTP request with a proper error code and message that will be used to create a status error for the original API caller.
### Monitorability
There should be prometheus variables to show:
* CRD conversion latency
* Overall
* By webhook name
* By request (sum of all conversions in a request)
* By CRD
* Conversion Failures count
* Overall
* By webhook name
* By CRD
* Timeout failures count
* Overall
* By webhook name
* By CRD
Adding a webhook dynamically adds a key to a map-valued prometheus metric. Webhook host process authors should consider how to make their webhook host monitorable: while eventually we hope to offer a set of best practices around this, for the initial release we wont have requirements here.
### Error Messages
When a conversion webhook fails, e.g. for the GET operation, then the error message from the apiserver to its client should reflect that conversion failed and include additional information to help debug the problem. The error message and HTTP error code returned by the webhook should be included in the error message API server returns to the user. For example:
```bash
$ kubectl get mykind somename
error on server: conversion from stored version v1 to requested version v2 for somename: "408 request timeout" while calling service "mywebhookhost.somens.cluster.local:443"
```
For operations that need more than one conversion (e.g. LIST), no partial result will be returned. Instead the whole operation will fail the same way with detailed error messages. To help debugging these kind of operations, the UID of the first failing conversion will also be included in the error message.
### Caching
No new caching is planned as part of this work, but the API Server may in the future cache webhook POST responses.
Most API operations are reads. The most common kind of read is a watch. All watched objects are cached in memory. For CRDs, the cache
is per-version. That is the result of having one [REST store object](https://github.com/kubernetes/kubernetes/blob/3cb771a8662ae7d1f79580e0ea9861fd6ab4ecc0/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/etcd.go#L72) per-version which
was an arbitrary design choice but would be required for better caching with webhook conversion. In this model, each GVK is cached, regardless of whether some GVKs share storage. Thus, watches do not cause conversion. So, conversion webhooks will not add overhead to the watch path. Watch cache is per api server and eventually consistent.
Non-watch reads are also cached (if requested resourceVersion is 0 which is true for generated informers by default, but not for calls like `kubectl get ...`, namespace cleanup, etc). The cached objects are converted and per-version (TODO: fact check). So, conversion webhooks will not add overhead here too.
If in the future this proves to be a performance problem, we might need to add caching later. The Authorization and Authentication webhooks already use a simple scheme with APIserver-side caching and a single TTL for expiration. This has worked fine, so we can repeat this process. It does not require Webhook hosts to be aware of the caching.
## Examples
### Example of Writing Conversion Webhook
Data model for v1:
|data model for v1|
|-----------------|
```yaml
properties:
spec:
properties:
cronSpec:
type: string
image:
type: string
```
|data model for v2|
|-----------------|
```yaml
properties:
spec:
properties:
min:
type: string
hour:
type: string
dayOfMonth:
type: string
month:
type: string
dayOfWeek:
type: string
image:
type: string
```
Both schemas can hold the same data (assuming the string format for V1 was a valid format).
|crontab_conversion.go|
|---------------------|
```golang
import .../types/v1
import .../types/v2
// Actual conversion methods
func convertCronV1toV2(cronV1 *v1.Crontab) (*v2.Crontab, error) {
items := strings.Split(cronV1.spec.cronSpec, " ")
if len(items) != 5 {
return nil, fmt.Errorf("invalid spec string, needs five parts: %s", cronV1.spec.cronSpec)
}
return &v2.Crontab{
ObjectMeta: cronV1.ObjectMeta,
TypeMeta: metav1.TypeMeta{
APIVersion: "stable.example.com/v2",
Kind: cronV1.Kind,
},
spec: v2.CrontabSpec{
image: cronV1.spec.image,
min: items[0],
hour: items[1],
dayOfMonth: items[2],
month: items[3],
dayOfWeek: items[4],
},
}, nil
}
func convertCronV2toV1(cronV2 *v2.Crontab) (*v1.Crontab, error) {
cronspec := cronV2.spec.min + " "
cronspec += cronV2.spec.hour + " "
cronspec += cronV2.spec.dayOfMonth + " "
cronspec += cronV2.spec.month + " "
cronspec += cronV2.spec.dayOfWeek
return &v1.Crontab{
ObjectMeta: cronV2.ObjectMeta,
TypeMeta: metav1.TypeMeta{
APIVersion: "stable.example.com/v1",
Kind: cronV2.Kind,
},
spec: v1.CrontabSpec{
image: cronV2.spec.image,
cronSpec: cronspec,
},
}, nil
}
// The rest of the file can go into an auto generated framework
func serveCronTabConversion(w http.ResponseWriter, r *http.Request) {
request, err := readConversionRequest(r)
if err != nil {
reportError(w, err)
}
response := ConversionResponse{}
response.UID = request.UID
converted, err := convert(request.Object, request.APIVersion)
if err != nil {
reportError(w, err)
}
response.ConvertedObject = *converted
writeConversionResponse(w, response)
}
func convert(in runtime.RawExtension, version string) (*runtime.RawExtension, error) {
inApiVersion, err := extractAPIVersion(in)
if err != nil {
return nil, err
}
switch inApiVersion {
case "stable.example.com/v1":
var cronV1 v1Crontab
if err := json.Unmarshal(in.Raw, &cronV1); err != nil {
return nil, err
}
switch version {
case "stable.example.com/v1":
// This should not happened as API server will not call the webhook in this case
return &in, nil
case "stable.example.com/v2":
cronV2, err := convertCronV1toV2(&cronV1)
if err != nil {
return nil, err
}
raw, err := json.Marshal(cronV2)
if err != nil {
return nil, err
}
return &runtime.RawExtension{Raw: raw}, nil
}
case "stable.example.com/v2":
var cronV2 v2Crontab
if err := json.Unmarshal(in.Raw, &cronV2); err != nil {
return nil, err
}
switch version {
case "stable.example.com/v2":
// This should not happened as API server will not call the webhook in this case
return &in, nil
case "stable.example.com/v1":
cronV1, err := convertCronV2toV1(&cronV2)
if err != nil {
return nil, err
}
raw, err := json.Marshal(cronV1)
if err != nil {
return nil, err
}
return &runtime.RawExtension{Raw: raw}, nil
}
default:
return nil, fmt.Errorf("invalid conversion fromVersion requested: %s", inApiVersion)
}
return nil, fmt.Errorf("invalid conversion toVersion requested: %s", version)
}
func extractAPIVersion(in runtime.RawExtension) (string, error) {
object := unstructured.Unstructured{}
if err := object.UnmarshalJSON(in.Raw); err != nil {
return "", err
}
return object.GetAPIVersion(), nil
}
```
Note: not all code is shown for running a web server.
Note: some of this is boilerplate that we expect tools like Kubebuilder will handle for the user.
Also some appropriate tests, most importantly round trip test:
|crontab_conversion_test.go|
|-|
```golang
func TestRoundTripFromV1ToV2(t *testing.T) {
testObj := v1.Crontab{
ObjectMeta: metav1.ObjectMeta{
Name: "my-new-cron-object",
},
TypeMeta: metav1.TypeMeta{
APIVersion: "stable.example.com/v1",
Kind: "CronTab",
},
spec: v1.CrontabSpec{
image: "my-awesome-cron-image",
cronSpec: "* * * * */5",
},
}
testRoundTripFromV1(t, testObj)
}
func testRoundTripFromV1(t *testing.T, v1Object v1.CronTab) {
v2Object, err := convertCronV1toV2(v1Object)
if err != nil {
t.Fatalf("failed to convert v1 crontab to v2: %v", err)
}
v1Object2, err := convertCronV2toV1(v2Object)
if err != nil {
t.Fatalf("failed to convert v2 crontab to v1: %v", err)
}
if !reflect.DeepEqual(v1Object, v1Object2) {
t.Errorf("round tripping failed for v1 crontab. v1Object: %v, v2Object: %v, v1ObjectConverted: %v",
v1Object, v2Object, v1Object2)
}
}
```
## Example of Updating CRD from one to two versions
This example uses some files from previous section.
**Step 1**: Start from a CRD with only one version
|crd1.yaml|
|-|
```yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: true
schema:
properties:
spec:
properties:
cronSpec:
type: string
image:
type: string
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
```
And create it:
```bash
Kubectl create -f crd1.yaml
```
(If you have an existing CRD installed prior to the version of Kubernetes that supports the "versions" field, then you may need to move version field to a single item in the list of versions or just try to touch the CRD after upgrading to the new Kubernetes version which will result in the versions list being defaulted to a single item equal to the top level spec values)
**Step 2**: Create a CR within that one version:
|cr1.yaml|
|-|
```yaml
apiVersion: "stable.example.com/v1"
kind: CronTab
metadata:
name: my-new-cron-object
spec:
cronSpec: "* * * * */5"
image: my-awesome-cron-image
```
And create it:
```bash
Kubectl create -f cr1.yaml
```
**Step 3**: Decide to introduce a new version of the API.
**Step 3a**: Write a new OpenAPI data model for the new version (see previous section). Use of a data model is not required, but it is recommended.
**Step 3b**: Write conversion webhook and deploy it as a service named `crontab_conversion`
See the "crontab_conversion.go" file in the previous section.
**Step 3c**: Update the CRD to add the second version.
Do this by adding a new item to the "versions" list, containing the new data model:
|crd2.yaml|
|-|
```yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: false
schema:
properties:
spec:
properties:
cronSpec:
type: string
image:
type: string
- name: v2
served: true
storage: true
schema:
properties:
spec:
properties:
min:
type: string
hour:
type: string
dayOfMonth:
type: string
month:
type: string
dayOfWeek:
type: string
image:
type: string
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
conversion:
strategy: external
webhook:
client_config:
namespace: crontab
service: crontab_conversion
Path: /crontab_convert
```
And apply it:
```bash
Kubectl apply -f crd2.yaml
```
**Step 4**: add a new CR in v2:
|cr2.yaml|
|-|
```yaml
apiVersion: "stable.example.com/v2"
kind: CronTab
metadata:
name: my-second-cron-object
spec:
min: "*"
hour: "*"
day_of_month: "*"
dayOfWeek: "*/5"
month: "*"
image: my-awesome-cron-image
```
And create it:
```bash
Kubectl create -f cr2.yaml
```
**Step 5**: storage now has two custom resources in two different versions. To downgrade to previous CRD, one can apply crd1.yaml but that will fail as the status.storedVersions has both v1 and v2 and those cannot be removed from the spec.versions list. To downgrade, first create a crd2-b.yaml file that sets v1 as storage version and apply it, then follow "*Upgrade existing objects to a new stored version*“ in [this document](https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definition-versioning/). After all CRs in the storage has v1 version, you can apply crd1.yaml.
**Step 5 alternative**: create a crd1-b.yaml that has v2 but not served.
|crd1-b.yaml|
|-|
```yaml
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: crontabs.stable.example.com
spec:
group: stable.example.com
versions:
- name: v1
served: true
storage: true
schema:
properties:
spec:
properties:
cronSpec:
type: string
image:
type: string
- name: v2
served: false
storage: false
scope: Namespaced
names:
plural: crontabs
singular: crontab
kind: CronTab
shortNames:
- ct
conversion:
strategy: external
webhook:
client_config:
namespace: crontab
service: crontab_conversion
Path: /crontab_convert
```
## Alternatives Considered
Other than webhook conversion, a declarative conversion also considered and discussed. The main operator that being discussed was Rename/Move. This section explains why Webhooks are chosen over declarative conversion. This does not mean the declarative approach will not be supported by the webhook would be first conversion method kubernetes supports.
### Webhooks vs Declarative
The table below compares webhook vs declarative in details.
<table>
<tr>
<td></td>
<td>Webhook</td>
<td>Declarative</td>
</tr>
<tr>
<td>1. Limitatisons</td>
<td>There is no limitation on the type of conversion CRD author can do.</td>
<td>Very limited set of conversions will be provided.</td>
</tr>
<tr>
<td>2. User Complexity</td>
<td>Harder to implement and the author needs to run an http server. This can be made simpler using tools such as kube-builder.</td>
<td>Easy to use as they are in yaml configuration file.</td>
</tr>
<tr>
<td>3. Design Complexity</td>
<td>Because the API server calls into an external webhook, there is no need to design a specific conversions.</td>
<td>Designing of declarative conversions can be tricky, especially if they are changing the value of fields. Challenges are: Meeting the round-trip-ability requirement, arguing the usefulness of the operator and keeping it simple enough for a declarative system.</td>
</tr>
<tr>
<td>4. Performance</td>
<td>Several calls to webhook for one operation (e.g. Apply) might hit performance issues. A monitoring metric helps measure this for later improvements that can be done through batch conversion.</td>
<td>Implemented in API Server directly thus there is no performance concerns.</td>
</tr>
<tr>
<td>5. User mistakes</td>
<td>Users have freedom to implement any kind of conversion which may not conform with our API convention (e.g. round-tripability. If the conversion is not revertible, old clients may fail and downgrade will also be at risk).</td>
<td>Keeping the conversion operators sane and sound would not be users problem. For things like rename/move there is already a design that keeps round-tripp-ability but that could be tricky for other operations.</td>
</tr>
<tr>
<td>6. Popularity</td>
<td>Because of the freedom in conversion of webhooks, they probably would be more popular</td>
<td>Limited set of declarative operators make it a safer but less popular choice at least in the early stages of CRD development</td>
</tr>
<tr>
<td>7. CRD Development Cycles</td>
<td>Fit well into the story of CRD development of starting with blob store CRDs, then add Schema, then Add webhook conversions for the freedom of conversion the move as much possible to declarative for safer production.</td>
<td>Comes after Webhooks in the development cycles of CRDs</td>
</tr>
</table>
Webhook conversion has less limitation for the authors of APIs using CRD which is desirable especially in the early stages of development. Although there is a chance of user mistakes and also it may look more complex to implement a webhook, those can be relieved using sets of good tools/libraries such as kube-builder. Overall, Webhook conversion is the clear winner here. Declarative approach may be considered at a later stage as an alternative but need to be carefully designed.
### Caching
* use HTTP caching conventions with Cache-Control, Etags, and a unique URL for each different request). This requires more complexity for the webhook author. This change could be considered as part of an update to all 5 or so kinds of webhooks, but not justified for just this one kind of webhook.
* The CRD object could have a "conversionWebhookVersion" field which the user can increment/change when upgrading/downgrading the webhook to force invalidation of cached objects.
## Advice to Users
* A proper webhook host implementation should accept every supported version as input and as output version.
* It should also be able to round trip between versions. E.g. converting an object from v1 to v2 and back to v1 should yield the same object.
* Consider testing your conversion webhook with a fuzz tester that generates random valid objects.
* The webhook should always give the same response with the same request that allows API server to potentially cache the responses in future (modulo bug fixes; when an update is pushed that fixes a bug in the conversion operation it might not take effect for a few minutes.
* If you need to add a new field, just add it. You don't need new schema to add a field.
* Webhook Hosts should be side-effect free.
* Webhook Hosts should not expect to see every conversion operation. Some may be cached in the future.
* Toolkits like KubeBuilder and OperatorKit may assist users in using this new feature by:
* having a place in their file hierarchy to define multiple schemas for the same kind.
* having a place in their code templates to define a conversion function.
* generating a full Webhook Host from a conversion function.
* helping users create tests by writing directories containing sample yamls of an object in various versions.
* using fuzzing to generate random valid objects and checking if they convert.
## Test and Documentation Plan
* Test the upgrade/rollback scenario below.
* Test conversion, refer to the test case section.
* Document CRD conversion and best practices for webhook conversion
* Document to CRD users how to upgrade and downgrade (changing storage version dance, and changes to CRD stored tags).
### Upgrade/Rollback Scenarios
Scenario 1: Upgrading an Operator to have more versions.
* Detect if the cluster version supports webhook conversion
* Helm chart can require e.g. v1.12 of a Kubernetes API Server.
Scenario 2: Rolling back to a previous version of API Server that does not support CRD Conversions
* I have a cluster
* I use apiserver v1.11.x, which supports multiple no-conversion-versions of a CRD
* I start to use CRDs
* I install helm chart "Foo-Operator", which installs a CRD for resource Foo, with 1 version called v1beta1.
* This uses the old "version" and "
* I create some Foo resources.
* I upgrade apiserver to v1.12.x
* version-conversion now supported.
* I upgrade the Foo-Operator chart.
* This changes the CRD to have two versions, v1beta1 and v1beta2.
* It installs a Webhook Host to convert them.
* Assume: v1beta1 is still the storage version.
* I start using multiple versions, so that the CRs are now stored in a mix of versions.
* I downgrade kube-apiserver
* Emergency happens, I need to downgrade to v1.11.x. Conversion won't be possible anymore.
* Downgrade
* Any call needs conversion should fail at this stage (we need to patch 1.11 for this, see issue [#65790](https://github.com/kubernetes/kubernetes/issues/65790)
### Test Cases
* Updating existing CRD to use multiple versions with conversion
* Define a CRD with one version.
* Create stored CRs.
* Update the CRD object to add another (non-storage) version with a conversion webhook
* Existing CRs are not harmed
* Can get existing CRs via new api, conversion webhook should be called
* Can create new CRs with new api, conversion webhook should be called
* Access new CRs with new api, conversion webhook should not be called
* Access new CRs with old api, conversion webhook should be called
## Development Plan
Google able to staff development, test, review, and documentation. Help welcome, too, esp. Reviewing.
Not in scope for this work:
* Including CRDs to aggregated OpenAPI spec (fka swagger.json).
* Apply for CRDs
* Make CRDs powerful enough to convert any or all core types to CRDs (in line with that goal, but this is just a step towards it).
### Work items
* Add APIs for conversion webhooks in CustomResourceDefinition type.
* Support multi-version (used to be called validation) Schema
* Support multi-version subresources and AdditionalPrintColumns
* Add a Webhook converter call as a CRD converter (refactor conversion code as needed)
* Ensure able to monitor latency from webhooks. See Monitorability section
* Add Upgrade/Downgrade tests
* Add public documentation

View File

@ -0,0 +1,100 @@
CRD Versioning
=============
The objective of this design document is to provide a machinery for Custom Resource Definition authors to define different resource version and a conversion mechanism between them.
# **Background**
Custom Resource Definitions ([CRDs](https://kubernetes.io/docs/concepts/api-extension/custom-resources/)) are a popular mechanism for extending Kubernetes, due to their ease of use compared with the main alternative of building an Aggregated API Server. They are, however, lacking a very important feature that all other kubernetes objects support: Versioning. Today, each CR can only have one version and there is no clear way for authors to advance their resources to a newer version other than creating a completely new CRD and converting everything manually in sync with their client software.
This document proposes a mechanism to support multiple CRD versions. A few alternatives are also explored in [this document](https://docs.google.com/document/d/1Ucf7JwyHpy7QlgHIN2Rst_q6yT0eeN9euzUV6kte6aY).
**Goals:**
* Support versioning on API level
* Support conversion mechanism between versions
* Support ability to change storage version
* Support Validation/OpenAPI schema for all versions: All versions should have a schema. This schema can be provided by user or derived from a single schema.
**Non-Goals:**
* Support cohabitation (i.e. no group/kind move)
# **Proposed Design**
The basis of the design is a system that supports versioning and no conversion. The APIs here, is designed in a way that can be extended with conversions later.
The summary is to support a list of versions that will include current version. One of these versions can be flagged as the storage version and all versions ever marked as storage version will be listed in a stored_version field in the Status object to enable authors to plan a migration for their stored objects.
The current `Version` field is planned to be deprecated in a later release and will be used to pre-populate the `Versions` field (The `Versions` field will be defaulted to a single version, constructed from top level `Version` field). The `Version` field will be also mutable to give a way to the authors to remove it from the list.
```golang
// CustomResourceDefinitionSpec describes how a user wants their resource to appear
type CustomResourceDefinitionSpec struct {
// Group is the group this resource belongs in
Group string
// Version is the version this resource belongs in
// must be always the first item in Versions field if provided.
Version string
// Names are the names used to describe this custom resource
Names CustomResourceDefinitionNames
// Scope indicates whether this resource is cluster or namespace scoped. Default is namespaced
Scope ResourceScope
// Validation describes the validation methods for CustomResources
Validation *CustomResourceValidation
// ***************
// ** New Field **
// ***************
// Versions is the list of all supported versions for this resource.
// Validation: All versions must use the same validation schema for now. i.e., top
// level Validation field is applied to all of these versions.
// Order: The order of these versions is used to determine the order in discovery API
// (preferred version first).
// The versions in this list may not be removed if they are in
// CustomResourceDefinitionStatus.StoredVersions list.
Versions []CustomResourceDefinitionVersion
}
// ***************
// ** New Type **
// ***************
type CustomResourceDefinitionVersion {
// Name is the version name, e.g. "v1", “v2beta1”, etc.
Name string
// Served is a flag enabling/disabling this version from being served via REST APIs
Served Boolean
// Storage flags the release as a storage version. There must be exactly one version
// flagged as Storage.
Storage Boolean
}
```
The Status object will have a list of potential stored versions. This data is necessary to do a storage migration in future (the author can choose to do the migration themselves but there is [a plan](https://docs.google.com/document/d/1eoS1K40HLMl4zUyw5pnC05dEF3mzFLp5TPEEt4PFvsM/edit) to solve the problem of migration, potentially for both standard and custom types).
```golang
// CustomResourceDefinitionStatus indicates the state of the CustomResourceDefinition
type CustomResourceDefinitionStatus struct {
...
// StoredVersions are all versions ever marked as storage in spec. Tracking these
// versions allow a migration path for stored version in etcd. The field is mutable
// so the migration controller can first make sure a version is certified (i.e. all
// stored objects is that version) then remove the rest of the versions from this list.
// None of the versions in this list can be removed from the spec.Versions field.
StoredVersions []string
}
```
# **Validation**
Basic validations needed for the `version` field are:
* `Spec.Version` field exists in `Spec.Versions` field.
* The version defined in `Spec.Version` field should point to a `Served` Version in `Spec.Versions` list except when we do not serve any version (i.e. all versions in `Spec.Versions` field are disabled by `Served` set to `False`). This is for backward compatibility. An old controller expect that version to be served but only the whole CRD is served. CRD Registration controller should unregister a CRD with no serving version.
* None of the `Status.StoredVersion` can be removed from `Spec.Versions` list.
* Only one of the versions in `spec.Versions` can flag as `Storage` version.

View File

@ -20,7 +20,7 @@ admission controller that uses code, rather than configuration, to map the
resource requests and limits of a pod to QoS, and attaches the corresponding
annotation.)
We anticipate a number of other uses for `MetadataPolicy`, such as defaulting
We anticipate a number of other uses for `MetadataPolicy`, such as defaulting
for labels and annotations, prohibiting/requiring particular labels or
annotations, or choosing a scheduling policy within a scheduler. We do not
discuss them in this doc.

View File

@ -267,7 +267,7 @@ ControllerRevisions, this approach is reasonable.
- A revision is considered to be live while any generated Object labeled
with its `.Name` is live.
- This method has the benefit of providing visibility, via the label, to
users with respect to the historical provenance of a generated Object.
users with respect to the historical provenance of a generated Object.
- The primary drawback is the lack of support for using garbage collection
to ensure that only non-live version snapshots are collected.
1. Controllers may also use the `OwnerReferences` field of the
@ -390,7 +390,7 @@ the following command.
### Rollback
For future work, `kubeclt rollout undo` can be implemented in the general case
For future work, `kubectl rollout undo` can be implemented in the general case
as an extension of the [above](#viewing-history ).
```bash

View File

@ -42,7 +42,7 @@ Here are some potential requirements that haven't been covered by this proposal:
- Uptime is critical for each pod of a DaemonSet during an upgrade (e.g. the time
from a DaemonSet pods being killed to recreated and healthy should be < 5s)
- Each DaemonSet pod can still fit on the node after being updated
- Some DaemonSets require the node to be drained before the DeamonSet's pod on it
- Some DaemonSets require the node to be drained before the DaemonSet's pod on it
is updated (e.g. logging daemons)
- DaemonSet's pods are implicitly given higher priority than non-daemons
- DaemonSets can only be operated by admins (i.e. people who manage nodes)

View File

@ -10,11 +10,17 @@
- [Example](#example)
- [Support in Deployment](#support-in-deployment)
- [Deployment Status](#deployment-status)
- [Deployment Version](#deployment-version)
- [Deployment Revision](#deployment-revision)
- [Pause Deployments](#pause-deployments)
- [Perm-failed Deployments](#perm-failed-deployments)
- [Failed Deployments](#failed-deployments)
# Deployment rolling update design proposal
**Author**: @janetkuo
**Status**: implemented
# Deploy through CLI
## Motivation
@ -33,10 +39,10 @@ So, instead, this document proposes another way to support easier deployment man
The followings are operations we need to support for the users to easily managing deployments:
- **Create**: To create deployments.
- **Rollback**: To restore to an earlier version of deployment.
- **Rollback**: To restore to an earlier revision of deployment.
- **Watch the status**: To watch for the status update of deployments.
- **Pause/resume**: To pause a deployment mid-way, and to resume it. (A use case is to support canary deployment.)
- **Version information**: To record and show version information that's meaningful to users. This can be useful for rollback.
- **Revision information**: To record and show revision information that's meaningful to users. This can be useful for rollback.
## Related `kubectl` Commands
@ -51,12 +57,11 @@ Users may use `kubectl scale` or `kubectl autoscale` to scale up and down Deploy
### `kubectl rollout`
`kubectl rollout` supports both Deployment and DaemonSet. It has the following subcommands:
- `kubectl rollout undo` works like rollback; it allows the users to rollback to a previous version of deployment.
- `kubectl rollout undo` works like rollback; it allows the users to rollback to a previous revision of deployment.
- `kubectl rollout pause` allows the users to pause a deployment. See [pause deployments](#pause-deployments).
- `kubectl rollout resume` allows the users to resume a paused deployment.
- `kubectl rollout status` shows the status of a deployment.
- `kubectl rollout history` shows meaningful version information of all previous deployments. See [development version](#deployment-version).
- `kubectl rollout retry` retries a failed deployment. See [perm-failed deployments](#perm-failed-deployments).
- `kubectl rollout history` shows meaningful revision information of all previous deployments. See [development revision](#deployment-revision).
### `kubectl set`
@ -88,7 +93,7 @@ $ kubectl run nginx --image=nginx --replicas=2 --generator=deployment/v1beta1
$ kubectl rollout status deployment/nginx
# Update the Deployment
$ kubectl set image deployment/nginx --container=nginx --image=nginx:<some-version>
$ kubectl set image deployment/nginx --container=nginx --image=nginx:<some-revision>
# Pause the Deployment
$ kubectl rollout pause deployment/nginx
@ -96,11 +101,11 @@ $ kubectl rollout pause deployment/nginx
# Resume the Deployment
$ kubectl rollout resume deployment/nginx
# Check the change history (deployment versions)
# Check the change history (deployment revisions)
$ kubectl rollout history deployment/nginx
# Rollback to a previous version.
$ kubectl rollout undo deployment/nginx --to-version=<version>
# Rollback to a previous revision.
$ kubectl rollout undo deployment/nginx --to-revision=<revision>
```
## Support in Deployment
@ -108,33 +113,39 @@ $ kubectl rollout undo deployment/nginx --to-version=<version>
### Deployment Status
Deployment status should summarize information about Pods, which includes:
- The number of pods of each version.
- The number of pods of each revision.
- The number of ready/not ready pods.
See issue [#17164](https://github.com/kubernetes/kubernetes/issues/17164).
### Deployment Version
### Deployment Revision
We store previous deployment version information in annotations `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` of replication controllers of the deployment, to support rolling back changes as well as for the users to view previous changes with `kubectl rollout history`.
- `rollout.kubectl.kubernetes.io/change-source`, which is optional, records the kubectl command of the last mutation made to this rollout. Users may use `--record` in `kubectl` to record current command in this annotation.
- `rollout.kubectl.kubernetes.io/version` records a version number to distinguish the change sequence of a deployment's
replication controllers. A deployment obtains the largest version number from its replication controllers and increments the number by 1 upon update or creation of the deployment, and updates the version annotation of its new replication controller.
We store previous deployment revision information in annotations `kubernetes.io/change-cause` and `deployment.kubernetes.io/revision` of ReplicaSets of the Deployment, to support rolling back changes as well as for the users to view previous changes with `kubectl rollout history`.
- `kubernetes.io/change-cause`, which is optional, records the kubectl command of the last mutation made to this rollout. Users may use `--record` in `kubectl` to record current command in this annotation.
- `deployment.kubernetes.io/revision` records a revision number to distinguish the change sequence of a Deployment's
ReplicaSets. A Deployment obtains the largest revision number from its ReplicaSets and increments the number by 1 upon update or creation of the Deployment, and updates the revision annotation of its new ReplicaSet.
When the users perform a rollback, i.e. `kubectl rollout undo`, the deployment first looks at its existing replication controllers, regardless of their number of replicas. Then it finds the one with annotation `rollout.kubectl.kubernetes.io/version` that either contains the specified rollback version number or contains the second largest version number among all the replication controllers (current new replication controller should obtain the largest version number) if the user didn't specify any version number (the user wants to rollback to the last change). Lastly, it
starts scaling up that replication controller it's rolling back to, and scaling down the current ones, and then update the version counter and the rollout annotations accordingly.
When the users perform a rollback, i.e. `kubectl rollout undo`, the Deployment first looks at its existing ReplicaSets, regardless of their number of replicas. Then it finds the one with annotation `deployment.kubernetes.io/revision` that either contains the specified rollback revision number or contains the second largest revision number among all the ReplicaSets (current new ReplicaSet should obtain the largest revision number) if the user didn't specify any revision number (the user wants to rollback to the last change). Lastly, it
starts scaling up that ReplicaSet it's rolling back to, and scaling down the current ones, and then update the revision counter and the rollout annotations accordingly.
Note that a deployment's replication controllers use PodTemplate hashes (i.e. the hash of `.spec.template`) to distinguish from each others. When doing rollout or rollback, a deployment reuses existing replication controller if it has the same PodTemplate, and its `rollout.kubectl.kubernetes.io/change-source` and `rollout.kubectl.kubernetes.io/version` annotations will be updated by the new rollout. At this point, the earlier state of this replication controller is lost in history. For example, if we had 3 replication controllers in
deployment history, and then we do a rollout with the same PodTemplate as version 1, then version 1 is lost and becomes version 4 after the rollout.
Note that ReplicaSets are distinguished by PodTemplate (i.e. `.spec.template`). When doing a rollout or rollback, a Deployment reuses existing ReplicaSet if it has the same PodTemplate, and its `kubernetes.io/change-cause` and `deployment.kubernetes.io/revision` annotations will be updated by the new rollout. All previous of revisions of this ReplicaSet will be kept in the annotation `deployment.kubernetes.io/revision-history`. For example, if we had 3 ReplicaSets in
Deployment history, and then we do a rollout with the same PodTemplate as revision 1, then revision 1 is lost and becomes revision 4 after the rollout, and the ReplicaSet that once represented revision 1 will then have an annotation `deployment.kubernetes.io/revision-history=1`.
To make deployment versions more meaningful and readable for the users, we can add more annotations in the future. For example, we can add the following flags to `kubectl` for the users to describe and record their current rollout:
To make Deployment revisions more meaningful and readable for users, we can add more annotations in the future. For example, we can add the following flags to `kubectl` for the users to describe and record their current rollout:
- `--description`: adds `description` annotation to an object when it's created to describe the object.
- `--note`: adds `note` annotation to an object when it's updated to record the change.
- `--commit`: adds `commit` annotation to an object with the commit id.
### Pause Deployments
Users sometimes need to temporarily disable a deployment. See issue [#14516](https://github.com/kubernetes/kubernetes/issues/14516).
Users sometimes need to temporarily disable a Deployment. See issue [#14516](https://github.com/kubernetes/kubernetes/issues/14516).
### Perm-failed Deployments
For more details, see [pausing and resuming a
Deployment](https://kubernetes.io/docs/user-guide/deployments/#pausing-and-resuming-a-deployment).
The deployment could be marked as "permanently failed" for a given spec hash so that the system won't continue thrashing on a doomed deployment. The users can retry a failed deployment with `kubectl rollout retry`. See issue [#14519](https://github.com/kubernetes/kubernetes/issues/14519).
### Failed Deployments
The Deployment could be marked as "failed" when it gets stuck trying to deploy
its newest ReplicaSet without completing within the given deadline (specified
with `.spec.progressDeadlineSeconds`), see document about
[failed Deployment](https://kubernetes.io/docs/user-guide/deployments/#failed-deployment).

View File

@ -143,7 +143,7 @@ For each creation or update for a Deployment, it will:
is the one that the new RS uses and collisionCount is a counter in the DeploymentStatus
that increments every time a [hash collision](#hashing-collisions) happens (hash
collisions should be rare with fnv).
- If the RSs and pods dont already have this label and selector:
- If the RSs and pods don't already have this label and selector:
- We will first add this to RS.PodTemplateSpec.Metadata.Labels for all RSs to
ensure that all new pods that they create will have this label.
- Then we will add this label to their existing pods
@ -197,7 +197,7 @@ For example, consider the following case:
Users can pause/cancel a rollout by doing a non-cascading deletion of the Deployment
before it is complete. Recreating the same Deployment will resume it.
For example, consider the following case:
- User creats a Deployment to perform a rolling-update for 10 pods from image:v1 to
- User creates a Deployment to perform a rolling-update for 10 pods from image:v1 to
image:v2.
- User then deletes the Deployment while the old and new RSs are at 5 replicas each.
User will end up with 2 RSs with 5 replicas each.

View File

@ -61,7 +61,7 @@ think about it.
about uniqueness, just labeling for user's own reasons.
- Defaulting logic sets `job.spec.selector` to
`matchLabels["controller-uid"]="$UIDOFJOB"`
- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`.
- Defaulting logic appends 2 labels to the `.spec.template.metadata.labels`.
- The first label is controller-uid=$UIDOFJOB.
- The second label is "job-name=$NAMEOFJOB".

View File

@ -304,7 +304,7 @@ as follows.
should be consistent with the version indicated by `Status.UpdateRevision`.
1. If the Pod does not meet either of the prior two conditions, and if
ordinal is in the sequence `[0, .Spec.UpdateStrategy.Partition.Ordinal)`,
it should be consistent with the version indicated by
it should be consistent with the version indicated by
`Status.CurrentRevision`.
1. Otherwise, the Pod should be consistent with the version indicated
by `Status.UpdateRevision`.
@ -446,7 +446,7 @@ object if any of the following conditions are true.
1. `.Status.UpdateReplicas` is negative or greater than `.Status.Replicas`.
## Kubectl
Kubectl will use the `rollout` command to control and provide the status of
Kubectl will use the `rollout` command to control and provide the status of
StatefulSet updates.
- `kubectl rollout status statefulset <StatefulSet-Name>`: displays the status
@ -648,7 +648,7 @@ spec:
### Phased Roll Outs
Users can create a canary using `kubectl apply`. The only difference between a
[canary](#canaries) and a phased roll out is that the
`.Spec.UpdateStrategy.Partition.Ordinal` is set to a value less than
`.Spec.UpdateStrategy.Partition.Ordinal` is set to a value less than
`.Spec.Replicas-1`.
```yaml
@ -747,7 +747,7 @@ kubectl rollout undo statefulset web
### Rolling Forward
Rolling back is usually the safest, and often the fastest, strategy to mitigate
deployment failure, but rolling forward is sometimes the only practical solution
for stateful applications (e.g. A users has a minor configuration error but has
for stateful applications (e.g. A user has a minor configuration error but has
already modified the storage format for the application). Users can use
sequential `kubectl apply`'s to update the StatefulSet's current
[target state](#target-state). The StatefulSet's `.Spec.GenerationPartition`
@ -810,7 +810,7 @@ intermittent compaction as a form of garbage collection. Applications that use
log structured merge trees with size tiered compaction (e.g Cassandra) or append
only B(+/*) Trees (e.g Couchbase) can temporarily double their storage requirement
during compaction. If there is insufficient space for compaction
to progress, these applications will either fail or degrade until
to progress, these applications will either fail or degrade until
additional capacity is added. While, if the user is using AWS EBS or GCE PD,
there are valid manual workarounds to expand the size of a PD, it would be
useful to automate the resize via updates to the StatefulSet's

View File

@ -65,7 +65,7 @@ The project is committed to the following (aspirational) [design ideals](princip
approach is key to the systems self-healing and autonomic capabilities.
* _Advance the state of the art_. While Kubernetes intends to support non-cloud-native
applications, it also aspires to advance the cloud-native and DevOps state of the art, such as
in the [participation of applications in their own management](http://blog.kubernetes.io/2016/09/cloud-native-application-interfaces.html).
in the [participation of applications in their own management](https://kubernetes.io/blog/2016/09/cloud-native-application-interfaces/).
However, in doing
so, we strive not to force applications to lock themselves into Kubernetes APIs, which is, for
example, why we prefer configuration over convention in the [downward API](https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api).
@ -221,7 +221,7 @@ Kubelet does not link in the base container runtime. Instead, we're defining a
underlying runtime and facilitate pluggability of that layer.
This decoupling is needed in order to maintain clear component boundaries, facilitate testing, and facilitate pluggability.
Runtimes supported today, either upstream or by forks, include at least docker (for Linux and Windows),
[rkt](https://kubernetes.io/docs/getting-started-guides/rkt/),
[rkt](https://github.com/rkt/rkt),
[cri-o](https://github.com/kubernetes-incubator/cri-o), and [frakti](https://github.com/kubernetes/frakti).
#### Kube Proxy

View File

@ -0,0 +1,395 @@
# Declarative application management in Kubernetes
> This article was authored by Brian Grant (bgrant0607) on 8/2/2017. The original Google Doc can be found here: [https://goo.gl/T66ZcD](https://goo.gl/T66ZcD)
Most users will deploy a combination of applications they build themselves, also known as **_bespoke_** applications, and **common off-the-shelf (COTS)** components. Bespoke applications are typically stateless application servers, whereas COTS components are typically infrastructure (and frequently stateful) systems, such as databases, key-value stores, caches, and messaging systems.
In the case of the latter, users sometimes have the choice of using hosted SaaS products that are entirely managed by the service provider and are therefore opaque, also known as **_blackbox_** *services*. However, they often run open-source components themselves, and must configure, deploy, scale, secure, monitor, update, and otherwise manage the lifecycles of these **_whitebox_** *COTS applications*.
This document proposes a unified method of managing both bespoke and off-the-shelf applications declaratively using the same tools and application operator workflow, while leveraging developer-friendly CLIs and UIs, streamlining common tasks, and avoiding common pitfalls. The approach is based on observations of several dozen configuration projects and hundreds of configured applications within Google and in the Kubernetes ecosystem, as well as quantitative analysis of Borg configurations and work on the Kubernetes [system architecture](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.md), [API](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md), and command-line tool ([kubectl](https://github.com/kubernetes/community/wiki/Roadmap:-kubectl)).
The central idea is that a toolbox of composable configuration tools should manipulate configuration data in the form of declarative API resource specifications, which serve as a [declarative data model](https://docs.google.com/document/d/1RmHXdLhNbyOWPW_AtnnowaRfGejw-qlKQIuLKQWlwzs/edit#), not express configuration as code or some other representation that is restrictive, non-standard, and/or difficult to manipulate.
## Declarative configuration
Why the heavy emphasis on configuration in Kubernetes? Kubernetes supports declarative control by specifying users desired intent. The intent is carried out by asynchronous control loops, which interact through the Kubernetes API. This declarative approach is critical to the systems self-healing, autonomic capabilities, and application updates. This approach is in contrast to manual imperative operations or flowchart-like orchestration.
This is aligned with the industry trend towards [immutable infrastructure](http://thenewstack.io/a-brief-look-at-immutable-infrastructure-and-why-it-is-such-a-quest/), which facilitates predictability, reversibility, repeatability, scalability, and availability. Repeatability is even more critical for containers than for VMs, because containers typically have lifetimes that are measured in days, hours, even minutes. Production container images are typically built from configurable/scripted processes and have parameters overridden by configuration rather than modifying them interactively.
What form should this configuration take in Kubernetes? The requirements are as follows:
* Perhaps somewhat obviously, it should support **bulk** management operations: creation, deletion, and updates.
* As stated above, it should be **universal**, usable for both bespoke and off-the-shelf applications, for most major workload categories, including stateless and stateful, and for both development and production environments. It also needs to be applicable to use cases outside application definition, such as policy configuration and component configuration.
* It should **expose** the full power of Kubernetes (all CRUD APIs, API fields, API versions, and extensions), be **consistent** with concepts and properties presented by other tools, and should **teach** Kubernetes concepts and API, while providing a **bridge** for application developers that prefer imperative control or that need wizards and other tools to provide an onramp for beginners.
* It should feel **native** to Kubernetes. There is a place for tools that work across multiple platforms but which are native to another platform and for tools that are designed to work across multiple platforms but are native to none, but such non-native solutions would increase complexity for Kubernetes users by not taking full advantage of Kubernetes-specific mechanisms and conventions.
* It should **integrate** with key user tools and workflows, such as continuous deployment pipelines and application-level configuration formats, and **compose** with built-in and third-party API-based automation, such as [admission control](https://kubernetes.io/docs/admin/admission-controllers/), autoscaling, and [Operators](https://coreos.com/operators). In order to do this, it needs to support **separation of concerns** by supporting multiple distinct configuration sources and preserving declarative intent while allowing automatically set attributes.
* In particular, it should be straightforward (but not required) to manage declarative intent under **version control**, which is [standard industry best practice](http://martinfowler.com/bliki/InfrastructureAsCode.html) and what Google does internally. Version control facilitates reproducibility, reversibility, and an audit trail. Unlike generated build artifacts, configuration is primary human-authored, or at least it is desirable to be human-readable, and it is typically changed with a human in the loop, as opposed to fully automated processes, such as autoscaling. Version control enables the use of familiar tools and processes for change control, review, and conflict resolution.
* Users need the ability to **customize** off-the-shelf configurations and to instantiate multiple **variants**, without crossing the [line into the ecosystem](https://docs.google.com/presentation/d/1oPZ4rznkBe86O4rPwD2CWgqgMuaSXguIBHIE7Y0TKVc/edit#slide=id.g21b1f16809_5_86) of [configuration domain-specific languages, platform as a service, functions as a service](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not), and so on, though users should be able to [layer such tools/systems on top](https://kubernetes.io/blog/2017/02/caas-the-foundation-for-next-gen-paas/) of the mechanism, should they choose to do so.
* We need to develop clear **conventions**, **examples**, and mechanisms that foster **structure**, to help users understand how to combine Kubernetess flexible mechanisms in an effective manner.
## Configuration customization and variant generation
The requirement that drives the most complexity in typical configuration solutions is the need to be able to customize configurations of off-the-shelf components and/or to instantiate multiple variants.
Deploying an application generally requires customization of multiple categories of configuration:
* Frequently customized
* Context: namespaces, [names, labels](https://github.com/kubernetes/kubernetes/issues/1698), inter-component references, identity
* Image: repository/registry (source), tag (image stream/channel), digest (specific image)
* Application configuration, overriding default values in images: command/args, env, app config files, static data
* Resource parameters: replicas, cpu, memory, volume sources
* Consumed services: coordinates, credentials, and client configuration
* Less frequently customized
* Management parameters: probe intervals, rollout constraints, utilization targets
* Customized per environment
* Environmental adapters: lifecycle hooks, helper sidecars for configuration, monitoring, logging, network/auth proxies, etc
* Infrastructure mapping: scheduling constraints, tolerations
* Security and other operational policies: RBAC, pod security policy, network policy, image provenance requirements
* Rarely customized
* Application topology, which makes up the basic structure of the application: new/replaced components
In order to make an application configuration reusable, users need to be able to customize each of those categories of configuration. There are multiple approaches that could be used:
* Fork: simple to understand; supports arbitrary changes and updates via rebasing, but hard to automate in a repeatable fashion to maintain multiple variants
* Overlay / patch: supports composition and useful for standard transformations, such as setting organizational defaults or injecting environment-specific configuration, but can be fragile with respect to changes in the base configuration
* Composition: useful for orthogonal concerns
* Pull: Kubernetes provides APIs for distribution of application secrets (Secret) and configuration data (ConfigMap), and there is a [proposal open](http://issues.k8s.io/831) to support application data as well
* the resource identity is fixed, by the object reference, but the contents are decoupled
* the explicit reference makes it harder to consume a continuously updated stream of such resources, and harder to generate multiple variants
* can give the PodSpec author some degree of control over the consumption of the data, such as environment variable names and volume paths (though service accounts are at conventional locations rather than configured ones)
* Push: facilitates separation of concerns and late binding
* can be explicit, such as with kubectl set or HorizontalPodAutoscaler
* can be implicit, such as with LimitRange, PodSecurityPolicy, PodPreset, initializers
* good for attaching policies to selected resources within a scope (namespace and/or label selector)
* Transformation: useful for common cases (e.g., names and labels)
* Generation: useful for static decisions, like "if this is a Java app…", which can be integrated into the declarative specification
* Automation: useful for dynamic adaptation, such as horizontal and vertical auto-scaling, improves ease of use and aids encapsulation (by not exposing those details), and can mitigate phase-ordering problems
* Parameterization: natural for small numbers of choices the user needs to make, but there are many pitfalls, discussed below
Rather than relying upon a single approach, we should combine these techniques such that disadvantages are mitigated.
Tools used to customize configuration [within Google](http://queue.acm.org/detail.cfm?id=2898444) have included:
* Many bespoke domain-specific configuration languages ([DSLs](http://flabbergast.org))
* Python-based configuration DSLs (e.g., [Skylark](https://github.com/google/skylark))
* Transliterate configuration DSLs into structured data models/APIs, layered over and under existing DSLs in order to provide a form that is more amenable to automatic manipulation
* Configuration overlay systems, override mechanisms, and template inheritance
* Configuration generators, manipulation CLIs, IDEs, and wizards
* Runtime config databases and spreadsheets
* Several workflow/push/reconciliation engines
* Autoscaling and resource-planning tools
Note that forking/branching generally isnt viable in Googles monorepo.
Despite many projects over the years, some of which have been very widely used, the problem is still considered to be not solved satisfactorily. Our experiences with these tools have informed this proposal, however, as well as the design of Kubernetes itself.
A non-exhaustive list of tools built by the Kubernetes community (see [spreadsheet](https://docs.google.com/spreadsheets/d/1FCgqz1Ci7_VCz_wdh8vBitZ3giBtac_H8SBw4uxnrsE/edit#gid=0) for up-to-date list), in no particular order, follows:
* [Helm](https://github.com/kubernetes/helm)
* [OC new-app](https://docs.openshift.com/online/dev_guide/application_lifecycle/new_app.html)
* [Kompose](https://github.com/kubernetes-incubator/kompose)
* [Spread](https://github.com/redspread/spread)
* [Draft](https://github.com/Azure/draft)
* [Ksonnet](https://github.com/ksonnet/ksonnet-lib)/[Kubecfg](https://github.com/ksonnet/kubecfg)
* [Databricks Jsonnet](https://databricks.com/blog/2017/06/26/declarative-infrastructure-jsonnet-templating-language.html)
* [Kapitan](https://github.com/deepmind/kapitan)
* [Konfd](https://github.com/kelseyhightower/konfd)
* [Templates](https://docs.openshift.com/online/dev_guide/templates.html)/[Ktmpl](https://github.com/InQuicker/ktmpl)
* [Fabric8 client](https://github.com/fabric8io/kubernetes-client)
* [Kubegen](https://github.com/errordeveloper/kubegen)
* [kenv](https://github.com/thisendout/kenv)
* [Ansible](https://docs.ansible.com/ansible/kubernetes_module.html)
* [Puppet](https://forge.puppet.com/garethr/kubernetes/readme)
* [KPM](https://github.com/coreos/kpm)
* [Nulecule](https://github.com/projectatomic/nulecule)
* [Kedge](https://github.com/kedgeproject/kedge) ([OpenCompose](https://github.com/redhat-developer/opencompose) is deprecated)
* [Chartify](https://github.com/appscode/chartify)
* [Podex](https://github.com/kubernetes/contrib/tree/master/podex)
* [k8sec](https://github.com/dtan4/k8sec)
* [kb80r](https://github.com/UKHomeOffice/kb8or)
* [k8s-kotlin-dsl](https://github.com/fkorotkov/k8s-kotlin-dsl)
* [KY](https://github.com/stellaservice/ky)
* [Kploy](https://github.com/kubernauts/kploy)
* [Kdeploy](https://github.com/flexiant/kdeploy)
* [Kubernetes-deploy](https://github.com/Shopify/kubernetes-deploy)
* [Generator-kubegen](https://www.sesispla.net/blog/language/en/2017/07/introducing-generator-kubegen-a-kubernetes-configuration-file-booster-tool/)
* [K8comp](https://github.com/cststack/k8comp)
* [Kontemplate](https://github.com/tazjin/kontemplate)
* [Kexpand](https://github.com/kopeio/kexpand)
* [Forge](https://github.com/datawire/forge/)
* [Psykube](https://github.com/CommercialTribe/psykube)
* [Koki](http://koki.io)
* [Deploymentizer](https://github.com/InVisionApp/kit-deploymentizer)
* [generator-kubegen](https://github.com/sesispla/generator-kubegen)
* [Broadway](https://github.com/namely/broadway)
* [Srvexpand](https://github.com/kubernetes/kubernetes/pull/1980/files)
* [Rok8s-scripts](https://github.com/reactiveops/rok8s-scripts)
* [ERB-Hiera](https://roobert.github.io/2017/08/16/Kubernetes-Manifest-Templating-with-ERB-and-Hiera/)
* [k8s-icl](https://github.com/archipaorg/k8s-icl)
* [sed](https://stackoverflow.com/questions/42618087/how-to-parameterize-image-version-when-passing-yaml-for-container-creation)
* [envsubst](https://github.com/fabric8io/envsubst)
* [Jinja](https://github.com/tensorflow/ecosystem/tree/master/kubernetes)
* [spiff](https://github.com/cloudfoundry-incubator/spiff)
Additionally, a number of continuous deployment systems use their own formats and/or schemas.
The number of tools is a signal of demand for a customization solution, as well as lack of awareness of and/or dissatisfaction with existing tools. [Many prefer](https://news.ycombinator.com/item?id=15029086) to use the simplest tool that meets their needs. Most of these tools support customization via simple parameter substitution or a more complex configuration domain-specific language, while not adequately supporting the other customization strategies. The pitfalls of parameterization and domain-specific languages are discussed below.
### Parameterization pitfalls
After simply forking (or just cut&paste), parameterization is the most commonly used customization approach. We have [previously discussed](https://github.com/kubernetes/kubernetes/issues/11492) requirements for parameterization mechanisms, such as explicit declaration of parameters for easy discovery, documentation, and validation (e.g., for [form generation](https://github.com/kubernetes/kubernetes/issues/6487)). It should also be straightforward to provide multiple sets of parameter values in support of variants and to manage them under version control, though many tools do not facilitate that.
Some existing template examples:
* [Openshift templates](https://github.com/openshift/library/tree/master/official) ([MariaDB example](https://github.com/luciddreamz/library/blob/master/official/mariadb/templates/mariadb-persistent.json))
* [Helm charts](https://github.com/kubernetes/charts/) ([Jenkins example](https://github.com/kubernetes/charts/blob/master/stable/jenkins/templates/jenkins-master-deployment.yaml))
* not Kubernetes, but a [Kafka Mesosphere Universe example](https://github.com/mesosphere/universe/blob/version-3.x/repo/packages/C/confluent-kafka/5/marathon.json.mustache)
Parameterization solutions are easy to implement and to use at small scale, but parameterized templates tend to become complex and difficult to maintain. Syntax-oblivious macro substitution (e.g., sed, jinja, envsubst) can be fragile, and parameter substitution sites generally have to be identified manually, which is tedious and error-prone, especially for the most common use cases, such as resource name prefixing.
Additionally, performing all customization via template parameters erodes template encapsulation. Some prior configuration-language design efforts made encapsulation a non-goal due to the widespread desire of users to override arbitrary parts of configurations. If used by enough people, someone will want to override each value in a template. Parameterizing every value in a template creates an alternative API schema that contains an out-of-date subset of the full API, and when [every value is a parameter](https://github.com/kubernetes/charts/blob/e002378c13e91bef4a3b0ba718c191ec791ce3f9/stable/artifactory/templates/artifactory-deployment.yaml), a template combined with its parameters is considerably less readable than the expanded result, and less friendly to data-manipulation scripts and tools.
### Pitfalls of configuration domain-specific languages (DSLs)
Since parameterization and file imports are common features of most configuration domain-specific languages (DSLs), they inherit the pitfalls of parameterization. The complex custom syntax (and/or libraries) of more sophisticated languages also tends to be more opaque, hiding information such as application topology from humans. Users generally need to understand the input language, transformations applied, and output generated, which is more complex for users to learn. Furthermore, custom-built languages [typically lack good tools](http://mikehadlow.blogspot.com/2012/05/configuration-complexity-clock.html) for refactoring, validation, testing, debugging, etc., and hard-coded translations are hard to maintain and keep up to date. And such syntax typically isnt friendly to tools, for example [hiding information](https://github.com/kubernetes/kubernetes/issues/13241#issuecomment-233731291) about parameters and source dependencies, and is hostile to composition with other tools, configuration sources, configuration languages, runtime automation, and so on. The configuration source must be modified in order to customize additional properties or to add additional resources, which fosters closed, monolithic, fat configuration ecosystems and obstructs separation of concerns. This is especially true of tools and libraries that dont facilitate post-processing of their output between pre-processing the DSL and actuation of the resulting API resources.
Additionally, the more powerful languages make it easy for users to shoot themselves in their feet. For instance, it can be easy to mix computation and data. Among other problems, embedded code renders the configuration unparsable by other tools (e.g., extraction, injection, manipulation, validation, diff, interpretation, reconciliation, conversion) and clients. Such languages also make it easy to reduce boilerplate, which can be useful, but when taken to the extreme, impairs readability and maintainability. Nested/inherited templates are seductive, for those languages that enable them, but very hard to make reusable and maintainable in practice. Finally, it can be tempting to use these capabilities for many purposes, such as changing defaults or introducing new abstractions, but this can create different and surprising behavior compared to direct API usage through CLIs, libraries, UIs, etc., and create accidental pseudo-APIs rather than intentional, actual APIs. If common needs can only be addressed using the configuration language, then the configuration transformer must be invoked by most clients, as opposed to using the API directly, which is contrary to the design of Kubernetes as an API-centric system.
Such languages are powerful and can perform complex transformations, but we found that to be a [mixed blessing within Google](http://research.google.com/pubs/pub44843.html). For instance, there have been many cases where users needed to generate configuration, manipulate configuration, backport altered API field settings into templates, integrate some kind of dynamic automation with declarative configuration, and so on. All of these scenarios were painful to implement with DSL templates in the way. Templates also created new abstractions, changed API default values, and diverged from the API in other ways that disoriented new users.
A few DSLs are in use in the Kubernetes community, including Go templates (used by Helm, discussed more below), [fluent DSLs](https://github.com/fabric8io/kubernetes-client), and [jsonnet](http://jsonnet.org/), which was inspired by [Googles Borg configuration language](https://research.google.com/pubs/pub43438.html) ([more on its root language, GCL](http://alexandria.tue.nl/extra1/afstversl/wsk-i/bokharouss2008.pdf)). [Ksonnet-lib](https://github.com/ksonnet/ksonnet-lib) is a community project aimed at building Kubernetes-specific jsonnet libraries. Unfortunately, the examples (e.g., [nginx](https://github.com/ksonnet/ksonnet-lib/blob/master/examples/readme/hello-nginx.jsonnet)) appear more complex than the raw Kubernetes API YAML, so while it may provide more expressive power, it is less approachable. Databricks looks like [the biggest success case](https://databricks.com/blog/2017/06/26/declarative-infrastructure-jsonnet-templating-language.html) with jsonnet to date, and uses an approach that is admittedly more readable than ksonnet-lib, as is [Kubecfg](https://github.com/ksonnet/kubecfg). However, they all encourage users to author and manipulate configuration code written in a DSL rather than configuration data written in a familiar and easily manipulated format, and are unnecessarily complex for most use cases.
Helm is discussed below, with package management.
In case its not clear from the above, I do not consider configuration schemas expressed using common data formats such as JSON and YAML (sans use of substitution syntax) to be configuration DSLs.
## Configuration using REST API resource specifications
Given the pitfalls of parameterization and configuration DSLs, as mentioned at the beginning of this document, configuration tooling should manipulate configuration **data**, not convert configuration to code nor other marked-up syntax, and, in the case of Kubernetes, this data should primarily contain specifications of the **literal Kubernetes API resources** required to deploy the application in the manner desired by the user. The Kubernetes API and CLI (kubectl) were designed to support this model, and our documentation and examples use this approach.
[Kubernetess API](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture.md#cluster-control-plane-aka-master) provides IaaS-like container-centric primitives such as Pods, Services, and Ingress, and also lifecycle controllers to support orchestration (self-healing, scaling, updates, termination) of common types of workloads, such as ReplicaSet (simple fungible/stateless app manager), Deployment (orchestrates updates of stateless apps), Job (batch), CronJob (cron), DaemonSet (cluster services), StatefulSet (stateful apps), and [custom third-party controllers/operators](https://coreos.com/blog/introducing-operators.html). The workload controllers, such as Deployment, support declarative upgrades using production-grade strategies such as rolling update, so that the client doesnt need to perform complex orchestration in the common case. (And were moving [proven kubectl features to controllers](https://github.com/kubernetes/kubernetes/issues/12143), generally.) We also deliberately decoupled service naming/discovery and load balancing from application implementation in order to maximize deployment flexibility, which should be preserved by the configuration mechanism.
[Kubectl apply](https://github.com/kubernetes/kubernetes/issues/15894) [was designed](https://github.com/kubernetes/kubernetes/issues/1702) ([original proposal](https://github.com/kubernetes/kubernetes/issues/1178)) to support declarative updates without clobbering operationally and/or automatically set desired state. Properties not explicitly specified by the user are free to be changed by automated and other out-of-band mechanisms. Apply is implemented as a 3-way merge of the users previous configuration, the new configuration, and the live state.
We [chose this simple approach of using literal API resource specifications](https://github.com/kubernetes/kubernetes/pull/1007/files) for the following reasons:
* KISS: It was simple and natural, given that we designed the API to support CRUD on declarative primitives, and Kubernetes uses the API representation in all scenarios where API resources need to be serialized (e.g., in persistent cluster storage).
* It didnt require users to learn multiple different schemas, the API and another configuration format. We believe many/most production users will eventually want to use the API, and knowledge of the API transfers to other clients and tools. It doesnt obfuscate the API, which is relatively easy to read.
* It automatically stays up to date with the API, automatically supports all Kubernetes resources, versions, extensions, etc., and can be automatically converted to new API versions.
* It could share mechanisms with other clients (e.g., Swagger/OpenAPI, which is used for schema validation), which are now supported in several languages: Go, Python, Java, …
* Declarative configuration is only one interface to the system. There are also CLIs (e.g., kubectl), UIs (e.g., dashboard), mobile apps, chat bots, controllers, admission controllers, Operators, deployment pipelines, etc. Those clients will (and should) target the API. The user will need to interact with the system in terms of the API in these other scenarios.
* The API serves as a well defined intermediate representation, pre- and post-creation, with a documented deprecation policy. Tools, libraries, controllers, UI wizards, etc. can be built on top, leaving room for exploration and innovation within the community. Example API-based transformations include:
* Overlay application: kubectl patch
* Generic resource tooling: kubectl label, kubectl annotate
* Common-case tooling: kubectl set image, kubectl set resources
* Dynamic pod transformations: LimitRange, PodSecurityPolicy, PodPreset
* Admission controllers and initializers
* API-based controllers, higher-level APIs, and controllers driven by custom resources
* Automation: horizontal and [vertical pod autoscaling](https://github.com/kubernetes/community/pull/338)
* It is inherently composable: just add more resource manifests, in the same file or another file. No embedded imports required.
Of course, there are downsides to the approach:
* Users need to learn some API schema details, though we believe operators will want to learn them, anyway.
* The API schema does contain a fair bit of boilerplate, though it could be auto-generated and generally increases clarity.
* The API introduces a significant number of concepts, though they exist for good reasons.
* The API has no direct representation of common generation steps (e.g., generation of ConfigMap or Secret resources from source data), though these can be described in a declarative format using API conventions, as we do with component configuration in Kubernetes.
* It is harder to fix warts in the API than to paper over them. Fixing "bugs" may break compatibility (e.g., as with changing the default imagePullPolicy). However, the API is versioned, so it is not impossible, and fixing the API benefits all clients, tools, UIs, etc.
* JSON is cumbersome and some users find YAML to be error-prone to write. It would also be nice to support a less error-prone data syntax than YAML, such as [Relaxed JSON](https://github.com/phadej/relaxed-json), [HJson](https://hjson.org/), [HCL](https://github.com/hashicorp/hcl), [StrictYAML](https://github.com/crdoconnor/strictyaml/blob/master/FAQ.rst), or [YAML2](https://github.com/yaml/YAML2/wiki/Goals). However, one major disadvantage would be the lack of library support in multiple languages. HCL also wouldnt directly map to our API schema due to our avoidance of maps. Perhaps there are there YAML conventions that could result in less error-prone specifications.
## What needs to be improved?
While the basic mechanisms for this approach are in place, a number of common use cases could be made easier. Most user complaints are around discovering what features exist (especially annotations), documentation of and examples using those features, generating/finding skeleton resource specifications (including boilerplate and commonly needed features), formatting and validation of resource specifications, and determining appropriate cpu and memory resource requests and limits. Specific user scenarios are discussed below.
### Bespoke application deployment
Deployment of bespoke applications involves multiple steps:
1. Build the container image
2. Generate and/or modify Kubernetes API resource specifications to use the new image
3. Reconcile those resources with a Kubernetes cluster
Step 1, building the image, is out of scope for Kubernetes. Step 3 is covered by kubectl apply. Some tools in the ecosystem, such as [Draft](https://github.com/Azure/draft), combine the 3 steps.
Kubectl contains ["generator" commands](https://github.com/kubernetes/community/blob/master/contributors/devel/kubectl-conventions.md#generators), such as [kubectl run](https://kubernetes.io/docs/user-guide/kubectl/v1.7/#run), expose, various create commands, to create commonly needed Kubernetes resource configurations. However, they also dont help users understand current best practices and conventions, such as proper label and annotation usage. This is partly a matter of updating them and partly one of making the generated resources suitable for consumption by new users. Options supporting declarative output, such as dry run, local, export, etc., dont currently produce clean, readable, reusable resource specifications ([example](https://blog.heptio.com/using-kubectl-to-jumpstart-a-yaml-file-heptioprotip-6f5b8a63a3ea))**.** We should clean them up.
Openshift provides a tool, [oc new-app](https://docs.openshift.com/enterprise/3.1/dev_guide/new_app.html), that can pull source-code templates, [detect](https://github.com/kubernetes/kubernetes/issues/14801)[ application types](https://github.com/kubernetes/kubernetes/issues/14801) and create Kubernetes resources for applications from source and from container images. [podex](https://github.com/kubernetes/contrib/tree/master/podex) was built to extract basic information from an image to facilitate creation of default Kubernetes resources, but hasnt been kept up to date. Similar resource generation tools would be useful for getting started, and even just [validating that the image really exists](https://github.com/kubernetes/kubernetes/issues/12428) would reduce user error.
For updating the image in an existing deployment, kubectl set image works both on the live state and locally. However, we should [make the image optional](https://github.com/kubernetes/kubernetes/pull/47246) in controllers so that the image could be updated independently of kubectl apply, if desired. And, we need to [automate image tag-to-digest translation](https://github.com/kubernetes/kubernetes/issues/33664) ([original issue](https://github.com/kubernetes/kubernetes/issues/1697)), which is the approach wed expect users to use in production, as opposed to just immediately re-pulling the new image and restarting all existing containers simultaneously. We should keep the original tag in an imageStream annotation, which could eventually become a field.
### Continuous deployment
In addition to PaaSes, such as [Openshift](https://blog.openshift.com/openshift-3-3-pipelines-deep-dive/) and [Deis Workflow](https://github.com/deis/workflow), numerous continuous deployment systems have been integrated with Kubernetes, such as [Google Container Builder](https://github.com/GoogleCloudPlatform/cloud-builders/tree/master/kubectl), [Jenkins](https://github.com/GoogleCloudPlatform/continuous-deployment-on-kubernetes), [Gitlab](https://about.gitlab.com/2016/11/14/idea-to-production/), [Wercker](http://www.wercker.com/integrations/kubernetes), [Drone](https://open.blogs.nytimes.com/2017/01/12/continuous-deployment-to-google-cloud-platform-with-drone/), [Kit](https://invisionapp.github.io/kit/), [Bitbucket Pipelines](https://confluence.atlassian.com/bitbucket/deploy-to-kubernetes-892623297.html), [Codeship](https://blog.codeship.com/continuous-deployment-of-docker-apps-to-kubernetes/), [Shippable](https://www.shippable.com/kubernetes.html), [SemaphoreCI](https://semaphoreci.com/community/tutorials/continuous-deployment-with-google-container-engine-and-kubernetes), [Appscode](https://appscode.com/products/cloud-deployment/), [Kontinuous](https://github.com/AcalephStorage/kontinuous), [ContinuousPipe](https://continuouspipe.io/), [CodeFresh](https://docs.codefresh.io/docs/kubernetes#section-deploy-to-kubernetes), [CloudMunch](https://www.cloudmunch.com/continuous-delivery-for-kubernetes/), [Distelli](https://www.distelli.com/kubernetes/), [AppLariat](https://www.applariat.com/ci-cd-applariat-travis-gke-kubernetes/), [Weave Flux](https://github.com/weaveworks/flux), and [Argo](https://argoproj.github.io/argo-site/#/). Developers usually favor simplicity, whereas operators have more requirements, such as multi-stage deployment pipelines, deployment environment management (e.g., staging and production), and canary analysis. In either case, users need to be able to deploy both updated images and configuration updates, ideally using the same workflow. [Weave Flux](https://github.com/weaveworks/flux) and [Kube-applier](https://blog.box.com/blog/introducing-kube-applier-declarative-configuration-for-kubernetes/) support unified continuous deployment of this style. In other CD systems a unified flow may be achievable by making the image deployment step perform a local kubectl set image (or equivalent) and commit the change to the configuration, and then use another build/deployment trigger on the configuration repository to invoke kubectl apply --prune.
### Migrating from Docker Compose
Some developers like Dockers Compose format as a simplified all-in-one configuration schema, or are at least already familiar with it. Kubernetes supports the format using the [Kompose tool](https://github.com/kubernetes/kompose), which provides an easy migration path for these developers by translating the format to Kubernetes resource specifications.
The Compose format, even with extensions (e.g., replica counts, pod groupings, controller types), is inherently much more limited in expressivity than Kubernetes-native resource specifications, so users would not want to use it forever in production. But it provides a useful onramp, without introducing [yet another schema](https://github.com/kubernetes/kubernetes/pull/1980#issuecomment-60457567) to the community. We could potentially increase usage by including it in a [client-tool release bundle](https://github.com/kubernetes/release/issues/3).
### Reconciliation of multiple resources and multiple files
Most applications require multiple Kubernetes resources. Although kubectl supports multiple resources in a single file, most users store the resource specifications using one resource per file, for a number of reasons:
* It was the approach used by all of our early application-stack examples
* It provides more control by making it easier to specify which resources to operate on
* Its inherently composable -- just add more files
The control issue should be addressed by adding support to select resources to mutate by label selector, name, and resource types, which has been planned from the beginning but hasnt yet been fully implemented. However, we should also [expand and improve kubectls support for input from multiple files](https://github.com/kubernetes/kubernetes/issues/24649).
### Declarative updates
Kubectl apply (and strategic merge patch, upon which apply is built) has a [number of bugs and shortcomings](https://github.com/kubernetes/kubernetes/issues/35234), which we are fixing, since it is the underpinning of many things (declarative configuration, add-on management, controller diffs). Eventually we need [true API support](https://github.com/kubernetes/kubernetes/issues/17333) for apply so that clients can simply PUT their resource manifests and it can be used as the fundamental primitive for declarative updates for all clients. One of the trickier issues we should address with apply is how to handle [controller selector changes](https://github.com/kubernetes/kubernetes/issues/26202). We are likely to forbid changes for now, as we do with resource name changes.
Kubectl should also operate on resources in an intelligent order when presented with multiple resources. While weve tried to avoid creation-order dependencies, they do exist in a few places, such as with namespaces, custom resource definitions, and ownerReferences.
### ConfigMap and Secret updates
We need a declarative syntax for regenerating [Secrets](https://github.com/kubernetes/kubernetes/issues/24744) and [ConfigMaps](https://github.com/kubernetes/kubernetes/issues/30337) from their source files that could be used with apply, and provide easier ways to [roll out new ConfigMaps and garbage collect unneeded ones](https://github.com/kubernetes/kubernetes/issues/22368). This could be embedded in a manifest file, which we need for "package" metadata (see [Addon manager proposal](https://docs.google.com/document/d/1Laov9RCOPIexxTMACG6Ffkko9sFMrrZ2ClWEecjYYVg/edit) and [Helm chart.yaml](https://github.com/kubernetes/helm/blob/master/docs/charts.md)). There also needs to be an easier way to [generate names of the new resources](https://github.com/kubernetes/kubernetes/pull/49961) and to update references to ConfigMaps and Secrets, such as in env and volumes. This could be done via new kubectl set commands, but users primarily need the “stream” update model, as with images.
### Determining success/failure
The declarative, [asynchronous control-loop-based approach](https://docs.google.com/presentation/d/1oPZ4rznkBe86O4rPwD2CWgqgMuaSXguIBHIE7Y0TKVc/edit#slide=id.g21b1f16809_3_155) makes it more challenging for the user to determine whether the change they made succeeded or failed, or the system is still converging towards the new desired state. Enough status information needs to be reported such that progress and problems are visible to controllers watching the status, and the status needs to be reported in a consistent enough way that a [general-purpose mechanism](https://github.com/kubernetes/kubernetes/issues/34363) can be built that works for arbitrary API types following Kubernetes API conventions. [Third-party attempts](https://github.com/Mirantis/k8s-AppController#dependencies) to monitor the status generally are not implemented correctly, since Kubernetess extensible API model requires exposing distributed-system effects to clients. This complexity can be seen all over our [end-to-end tests](https://github.com/kubernetes/kubernetes/blob/master/test/utils/deployment.go#L74), which have been made robust over many thousands of executions. Definitely authors of individual application configurations should not be forced to figure out how to implement such checks, as they currently do in Helm charts (--wait, test).
### Configuration customization
The strategy for customization involves the following main approaches:
1. Fork or simply copy the resource specifications, and then locally modify them, imperatively, declaratively, or manually, in order to reuse off-the-shelf configuration. To facilitate these modifications, we should:
* Automate common customizations, especially [name prefixing and label injection](https://github.com/kubernetes/kubernetes/issues/1698) (including selectors, pod template labels, and object references), which would address the most common substitutions in existing templates
* Fix rough edges for local mutation via kubectl get --export and [kubectl set](https://github.com/kubernetes/kubernetes/issues/21648) ([--dry-run](https://github.com/kubernetes/kubernetes/issues/11488), --local, -o yaml), and enable kubectl to directly update files on disk
* Build fork/branch management tooling for common workflows, such as branch creation, cherrypicking (e.g., to copy configuration changes from a staging to production branch), rebasing, etc., perhaps as a plugin to kubectl.
* Build/improve structural diff, conflict detection, validation (e.g., [kubeval](https://github.com/garethr/kubeval), [ConfigMap element properties](https://github.com/kubernetes/kubernetes/issues/4210)), and comprehension tools
2. Resource overlays, for instantiating multiple variants. Kubectl patch already works locally using strategic merge patch, so the overlays have the same structure as the base resources. The main feature needed to facilitate that is automatic pairing of overlays with the resources they should patch.
Fork provides one-time customization, which is the most common case. Overlay patches provide deploy-time customization. These techniques can be combined with dynamic customization (PodPreset, other admission controllers, third-party controllers, etc.) and run-time customization (initContainers and entrypoint.sh scripts inside containers).
Benefits of these approaches:
* Easier for app developers and operators to build initial configurations (no special template syntax)
* Compatible with existing project tooling and conventions, and easy to read since it doesnt obfuscate the API and doesnt force users to learn a new way to configure their applications
* Supports best practices
* Handles cases the [original configuration author didnt envision](http://blog.shippable.com/the-new-devops-matrix-from-hell)
* Handles cases where original author changes things that break existing users
* Supports composition by adding resources: secrets, configmaps, autoscaling
* Supports injection of operational concerns, such as node affinity/anti-affinity and tolerations
* Supports selection among alternatives, and multiple simultaneous versions
* Supports canaries and multi-cluster deployment
* Usable for [add-on management](https://github.com/kubernetes/kubernetes/issues/23233), by avoiding [obstacles that Helm has](https://github.com/kubernetes/kubernetes/issues/23233#issuecomment-285524825), and should eliminate the need for the EnsureExists behavior
#### What about parameterization?
An area where more investigation is needed is explicit inline parameter substitution, which, while overused and should be rendered unnecessary by the capabilities described above, is [frequently requested](https://stackoverflow.com/questions/44832085/passing-variables-to-args-field-in-a-yaml-file-kubernetes) and has been reinvented many times by the community.
A [simple parameterization approach derived from Openshifts design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md) was approved because it was constrained in functionality and solved other problems (e.g., instantiation of resource variants by other controllers, [project templates in Openshift](https://github.com/openshift/training/blob/master/content/default-project-template.yaml)). That proposal explains some of the reasoning behind the design tradeoffs, as well as the [use cases](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md#use-cases). Work started, but was abandoned, though there is an independent [client-based implementation](https://github.com/InQuicker/ktmpl). However, the Template resource wrapped the resource specifications in another object, which is suboptimal, since transformations would then need to be able to deal with standalone resources, Lists of resources, and Templates, or would need to be applied post-instantiation, and it couldnt be represented using multiple files, as users prefer.
What is more problematic is that our client libraries, schema validators, yaml/json parsers/decoders, initializers, and protobuf encodings all require that all specified fields have valid values, so parameters cannot currently be left in non-string (e.g., int, bool) fields in actual resources. Additionally, the API server requires at least complete/final resource names to be specified, and strategic merge also requires all merge keys to be specified. Therefore, some amount of pre-instantiation (though not necessarily client-side) transformation is necessary to create valid resources, and we may want to explicitly store the output, or the fields should just contain the default values initially. Parameterized fields could be automatically converted to patches to produce valid resources. Such a transformation could be made reversible, unlike traditional substitution approaches, since the patches could be preserved (e.g., using annotations). The Template API supported the declaration of parameter names, display names, descriptions, default values, required/optional, and types (string, int, bool, base64), and both string and raw json substitutions. If we were to update that specification, we could use the same mechanism for both parameter validation and ConfigMap validation, so that the same mechanism could be used for env substitution and substitution of values of other fields. As mentioned in the [env validation issue](https://github.com/kubernetes/kubernetes/issues/4210#issuecomment-305555589), we should consider a subset of [JSON schema](http://json-schema.org/example1.html), which well probably use for CRD. The only [unsupported attribute](https://tools.ietf.org/html/draft-wright-json-schema-validation-00) appears to be the display name, which is non-critical. [Base64 could be represented using media](http://json-schema.org/latest/json-schema-hypermedia.html#rfc.section.5.3.2). That could be useful as a common parameter schema to facilitate parameter discovery and documentation that is independent of the substitution syntax and mechanism ([example from Deployment Manager](https://github.com/GoogleCloudPlatform/deploymentmanager-samples/blob/master/templates/replicated_service.py.schema)).
Without parameters how would we support a click-to-deploy experience? People who are kicking the tires, have undemanding use cases, are learning, etc. are unlikely to know what customization they want to perform initially, if they even need any. The main information users need to provide is the name prefix they want to apply. Otherwise, choosing among a few alternatives would suit their needs better than parameters. The overlay approach should support that pretty well. Beyond that, I suggest kicking users over to a Kubernetes-specific configuration wizard or schema-aware IDE, and/or support a fork workflow.
The other application-definition [use cases mentioned in the Template proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/templates.md#use-cases) are achievable without parameterization, as well.
#### What about application configuration generation?
A number of legacy applications have configuration mechanisms that couple application options and information about the deployment environment. In such cases, a ConfigMap containing the configuration data is not sufficient, since the runtime information (e.g., identities, secrets, service addresses) must be incorporated. There are a [number of tools used for this purpose outside Kubernetes](https://github.com/kubernetes/kubernetes/issues/2068). However, in Kubernetes, they would have to be run as Pod initContainers, sidecar containers, or container [entrypoint.sh init scripts](https://github.com/kubernetes/kubernetes/issues/30716). As this is only a need of some legacy applications, we should not complicate Kubernetes itself to solve it. Instead, we should be prepared to recommend a third-party tool, or provide one, and ensure the downward API provides the information it would need.
#### What about [package management](https://medium.com/@sdboyer/so-you-want-to-write-a-package-manager-4ae9c17d9527) and Helm?
[Helm](https://github.com/kubernetes/helm/blob/master/docs/chart_repository.md), [KPM](https://github.com/coreos/kpm), [App Registry](https://github.com/app-registry), [Kubepack](https://kubepack.com/), and [DCOS](https://docs.mesosphere.com/1.7/usage/managing-services/) (for Mesos) bundle whitebox off-the-shelf application configurations into **_packages_**. However, unlike traditional artifact repositories, which store and serve generated build artifacts, configurations are primarily human-authored. As mentioned above, it is industry best practice to manage such configurations using version control systems, and Helm package repositories are backed by source code repositories. (Example: [MariaDB](https://github.com/kubernetes/charts/tree/master/stable/mariadb).)
Advantages of packages:
1. Package formats add structure to raw Kubernetes primitives, which are deliberately flexible and freeform
* Starter resource specifications that illustrate API schema and best practices
* Labels for application topology (e.g., app, role, tier, track, env) -- similar to the goals of [Label Schema](http://label-schema.org/rc1/)
* File organization and manifest (list of files), to make it easier for users to navigate larger collections of application specifications, to reduce the need for tooling to search for information, and to facilitate segregation of resources from other artifacts (e.g., container sources)
* Application metadata: name, authors, description, icon, version, source(s), etc.
* Application lifecycle operations: build, test, debug, up, upgrade, down, etc.
1. [Package registries/repositories](https://github.com/app-registry/spec) facilitate [discovery](https://youtu.be/zGJsXyzE5A8?t=1159) of off-the-shelf applications and of their dependencies
* Scattered source repos are hard to find
* Ideally it would be possible to map the format type to a container containing the tool that understands the format.
Helm is probably the most-used configuration tool other than kubectl, many [application charts](https://github.com/kubernetes/charts) have been developed (as with the [Openshift template library](https://github.com/openshift/library)), and there is an ecosystem growing around it (e.g., [chartify](https://github.com/appscode/chartify), [helmfile](https://github.com/roboll/helmfile), [landscaper](https://github.com/Eneco/landscaper), [draughtsman](https://github.com/giantswarm/draughtsman), [chartmuseum](https://github.com/chartmuseum/chartmuseum)). Helms users like the familiar analogy to package management and the structure that it provides. However, while Helm is useful and is the most comprehensive tool, it isnt suitable for all use cases, such as [add-on management](https://github.com/kubernetes/kubernetes/issues/23233). The biggest obstacle is that its [non-Kubernetes-compatible API](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not)[ and DSL syntax push it out of Kubernetes proper into the Kubernetes ecosystem](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#what-kubernetes-is-not). And, as much as Helm is targeting only Kubernetes, it takes little advantage of that. Additionally, scenarios wed like to support better include chart authoring (prefer simpler syntax and more straightforward management under version control), operational customization (e.g., via scripting, [forking](https://github.com/kubernetes/helm/issues/2554), or patching/injection), deployment pipelines (e.g., [canaries](https://groups.google.com/forum/#!topic/kubernetes-sig-apps/ouqXYXdsPYw)), multi-cluster / [multi-environment](https://groups.google.com/d/msg/kubernetes-users/GPaGOGxCDD8/NbNL-NPhCAAJ) deployment, and multi-tenancy.
Helm provides functionality covering several areas:
* Package conventions: metadata (e.g., name, version, descriptions, icons; Openshift has [something similar](https://github.com/luciddreamz/library/blob/master/official/java/templates/openjdk18-web-basic-s2i.json#L10)), labels, file organization
* Package bundling, unbundling, and hosting
* Package discovery: search and browse
* [Dependency management](https://github.com/kubernetes/helm/blob/master/docs/charts.md#chart-dependencies)
* Application lifecycle management framework: build, install, uninstall, upgrade, test, etc.
* a non-container-centric example of that would be [ElasticBox](https://www.ctl.io/knowledge-base/cloud-application-manager/automating-deployments/start-stop-and-upgrade-boxes/)
* Kubernetes drivers for creation, update, deletion, etc.
* Template expansion / schema transformation
* (Its currently lacking a formal parameter schema.)
It's useful for Helm to provide an integrated framework, but the independent functions could be decoupled, and re-bundled into multiple separate tools:
* Package management -- search, browse, bundle, push, and pull of off-the-shelf application packages and their dependencies.
* Application lifecycle management -- install, delete, upgrade, rollback -- and pre- and post- hooks for each of those lifecycle transitions, and success/failure tests.
* Configuration customization via parameter substitution, aka template expansion, aka rendering.
That would enable the package-like structure and conventions to be used with raw declarative management via kubectl or other tool that linked in its [business logic](https://github.com/kubernetes/kubernetes/issues/7311), for the lifecycle management to be used without the template expansion, and the template expansion to be used in declarative workflows without the lifecycle management. Support for both client-only and server-side operation and migration from grpc to Kubernetes API extension mechanisms would further expand the addressable use cases.
([Newer proposal, presented at the Helm Summit](https://docs.google.com/presentation/d/10dp4hKciccincnH6pAFf7t31s82iNvtt_mwhlUbeCDw/edit#slide=id.p).)
#### What about the service broker?
The [Open Service Broker API](https://openservicebrokerapi.org/) provides a standardized way to provision and bind to blackbox services. It enables late binding of clients to service providers and enables usage of higher-level application services (e.g., caches, databases, messaging systems, object stores) portably, mitigating lock-in and facilitating hybrid and multi-cloud usage of these services, extending the portability of cloud-native applications running on Kubernetes. The service broker is not intended to be a solution for whitebox applications that require any level of management by the user. That degree of abstraction/encapsulation requires full automation, essentially creating a software appliance (cf. [autonomic computing](https://en.wikipedia.org/wiki/Autonomic_computing)): autoscaling, auto-repair, auto-update, automatic monitoring / logging / alerting integration, etc. Operators, initializers, autoscalers, and other automation may eventually achieve this, and we need to for [cluster add-ons](https://github.com/kubernetes/kubernetes/issues/23233) and other [self-hosted components](https://github.com/kubernetes/kubernetes/issues/246), but the typical off-the-shelf application template doesnt achieve that.
#### What about configurations with high cyclomatic complexity or massive numbers of variants?
Consider more automation, such as autoscaling, self-configuration, etc. to reduce the amount of explicit configuration necessary. One could also write a program in some widely used conventional programming language to generate the resource specifications. Its more likely to have IDE support, test frameworks, documentation generators, etc. than a DSL. Better yet, create composable transformations, applying [the Unix Philosophy](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules). In any case, dont look for a silver bullet to solve all configuration-related problems. Decouple solutions instead.
#### What about providing an intentionally restrictive simplified, tailored developer experience to streamline a specific use case, environment, workflow, etc.?
This is essentially a [DIY PaaS](https://kubernetes.io/blog/2017/02/caas-the-foundation-for-next-gen-paas/). Write a configuration generator, either client-side or using CRDs ([example](https://github.com/pearsontechnology/environment-operator/blob/dev/User_Guide.md)). The effort involved to document the format, validate it, test it, etc. is similar to building a new API, but I could imagine someone eventually building a SDK to make that easier.
#### What about more sophisticated deployment orchestration?
Deployment pipelines, [canary deployments](https://groups.google.com/forum/#!topic/kubernetes-sig-apps/ouqXYXdsPYw), [blue-green deployments](https://groups.google.com/forum/#!topic/kubernetes-sig-apps/mwIq9bpwNCA), dependency-based orchestration, event-driven orchestrations, and [workflow-driven orchestration](https://github.com/kubernetes/kubernetes/issues/1704) should be able to use the building blocks discussed in this document. [AppController](https://github.com/Mirantis/k8s-AppController) and [Smith](https://github.com/atlassian/smith) are examples of tools built by the community.
#### What about UI wizards, IDE integration, application frameworks, etc.?
Representing configuration using the literal API types should facilitate programmatic manipulation of the configuration via user-friendly tools, such as UI wizards (e.g., [dashboard](https://kubernetes.io/docs/tasks/access-application-cluster/web-ui-dashboard/#deploying-containerized-applications), [Yipee.io](https://yipee.io/), and many CD tools, such as [Distelli](https://www.distelli.com/docs/k8s/add-container-to-a-project/)) and IDEs (e.g., [VSCode](https://www.youtube.com/watch?v=QfqS9OSVWGs), [IntelliJ](https://github.com/tinselspoon/intellij-kubernetes)), as well as configuration generation and manipulation by application frameworks (e.g., [Spring Cloud](https://github.com/fabric8io/spring-cloud-kubernetes)).

Binary file not shown.

After

Width:  |  Height:  |  Size: 28 KiB

View File

@ -0,0 +1,128 @@
# Kubernetes Resource Management
> This article was authored by Brian Grant (bgrant0607) on 2/20/2018. The original Google Doc can be found [here](https://docs.google.com/document/d/1RmHXdLhNbyOWPW_AtnnowaRfGejw-qlKQIuLKQWlwzs/edit#).
Kubernetes is not just API-driven, but is *API-centric*.
At the center of the Kubernetes control plane is the [apiserver](https://kubernetes.io/docs/admin/kube-apiserver/), which implements common functionality for all of the systems APIs. Both user clients and components implementing the business logic of Kubernetes, called controllers, interact with the same APIs. The APIs are REST-like, supporting primarily CRUD operations on (mostly) persistent resources. All persistent cluster state is stored in one or more instances of the [etcd](https://github.com/coreos/etcd) key-value store.
![apiserver](./images/apiserver.png)
With the growth in functionality over the past four years, the number of built-in APIs grown by more than an order of magnitude. Moreover, Kubernetes now supports multiple API extension mechanisms that are not only used to add new functionality to Kubernetes itself, but provide frameworks for constructing an ecosystem of components, such as [Operators](https://coreos.com/operators/), for managing applications, platforms, infrastructure, and other things beyond the scope of Kubernetes itself. In addition to providing an overview of the common behaviors of built-in Kubernetes API resources, this document attempts to explain the assumptions, expectations, principles, conventions, and goals of the **Kubernetes Resource Model** so as to foster consistency and interoperability within that ecosystem as the uses of its API mechanisms and patterns expand. Any API using the same mechanisms and patterns will automatically work with any libraries and tools (e.g., CLIs, UIs, configuration, deployment, workflow) that have already integrated support for the model, which means that integrating support for N APIs implemented using the model in M tools is merely O(M) work rather than O(NM) work.
## Declarative control
In Kubernetes, declarative abstractions are primary, rather than layered on top of the system. The Kubernetes control plane is analogous to cloud-provider declarative resource-management systems (NOTE: Kubernetes also doesnt bake-in templating, for reasons discussed in the last section.), but presents higher-level (e.g., containerized workloads and services), portable abstractions. Imperative operations and flowchart-like workflow orchestration can be built on top of its declarative model, however.
Kubernetes supports declarative control by recording user intent as the desired state in its API resources. This enables a single API schema for each resource to serve as a declarative data model, as both a source and a target for automated components (e.g., autoscalers), and even as an intermediate representation for resource transformations prior to instantiation.
The intent is carried out by asynchronous [controllers](https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md), which interact through the Kubernetes API. Controllers dont access the state store, etcd, directly, and dont communicate via private direct APIs. Kubernetes itself does expose some features similar to key-value stores such as etcd and [Zookeeper](https://zookeeper.apache.org/), however, in order to facilitate centralized [state and configuration management and distribution](https://sysgears.com/articles/managing-configuration-of-distributed-system-with-apache-zookeeper/) to decentralized components.
Controllers continuously strive to make the observed state match the desired state, and report back their status to the apiserver asynchronously. All of the state, desired and observed, is made visible through the API to users and to other controllers. The API resources serve as coordination points, common intermediate representation, and shared state.
Controllers are level-based (as described [here](http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html) and [here](https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d)) to maximize fault tolerance, which enables the system to operate correctly just given the desired state and the observed state, regardless of how many intermediate state updates may have been missed. However, they can achieve the benefits of an edge-triggered implementation by monitoring changes to relevant resources via a notification-style watch API, which minimizes reaction latency and redundant work. This facilitates efficient decentralized and decoupled coordination in a more resilient manner than message buses. (NOTE: Polling is simple, and messaging is simple, but neither is ideal. There should be a CAP-like theorem about simultaneously achieving low latency, resilience, and simplicity -- pick any 2. Challenges with using "reliable" messaging for events/updates include bootstrapping consumers, events lost during bus outages, consumers not keeping up, bounding queue state, and delivery to unspecified numbers of consumers.)
## Additional resource model properties
The Kubernetes control-plane design is intended to make the system resilient and extensible, supporting both declarative configuration and automation, while providing a consistent experience to users and clients. In order to add functionality conforming to these objectives, it should be as easy as defining a new resource type and adding a new controller.
The Kubernetes resource model is designed to reinforce these objectives through its core assumptions (e.g., lack of exclusive control and multiple actors), principles (e.g., transparency and loose coupling), and goals (e.g., composability and extensibility):
* There are few direct inter-component APIs, and no hidden internal resource-oriented APIs. All APIs are visible and available (subject to authorization policy). The distinction between being part of the system and being built on top of the system is deliberately blurred. In order to handle more complex use cases, there's no glass to break. One can just access lower-level APIs in a fully transparent manner, or add new APIs, as necessary.
* Kubernetes operates in a distributed environment, and the control-plane itself may be sharded and distributed (e.g., as in the case of aggregated APIs). Desired state is updated immediately but actuated asynchronously and eventually. Kubernetes does not support atomic transactions across multiple resources and (especially) resource types, pessimistic locking, other durations where declarative intent cannot be updated (e.g., unavailability while busy), discrete synchronous long-running operations, nor synchronous success preconditions based on the results of actuation (e.g., failing to write a new image tag to a PodSpec when the image cannot be pulled). The Kubernetes API also does not provide strong ordering or consistency across multiple resources, and does not enforce referential integrity. Providing stronger semantics would compromise the resilience, extensibility, and observability of the system, while providing less benefit than one might expect, especially given other assumptions, such as the lack of exclusive control and multiple actors. Resources could be modified or deleted immediately after being created. Failures could occur immediately after success, or even prior to apparent success, if not adequately monitored. Caching and concurrency generally obfuscate event ordering. Workflows often involve external, non-transactional resources, such as git repositories and cloud resources. Therefore, graceful tolerance of out-of-order events and problems that could be self-healed automatically is expected. As an example, if a resource can't properly function due to a nonexistent dependent resource, that should be reported as the reason the resource isn't fully functional in the resource's status field.
* Typically each resource specifies a single desired state. However, for safety reasons, changes to that state may not be fully realized immediately. Since progressive transitions (e.g., rolling updates, traffic shifting, data migrations) are dependent on the underlying mechanisms being controlled, they must be implemented for each resource type, as needed. If multiple versions of some desired state need to coexist simultaneously (e.g., previous and next versions), they each need to be represented explicitly in the system. The convention is to generate a separate resource for each version, each with a content-derived generated name. Comprehensive version control is the responsibility of other systems (e.g., git).
* The reported observed state is truth. Controllers are expected to reconcile observed and desired state and repair discrepancies, and Kubernetes avoids maintaining opaque, internal persistent state. Resource status must be reconstructable by observation.
* The current status is represented using as many properties as necessary, rather than being modeled by state machines with explicit, enumerated states. Such state machines are not extensible (states can neither be added nor removed), and they encourage inference of implicit properties from the states rather than representing the properties explicitly.
* Resources are not assumed to have single, exclusive "owners". They may be read and written by multiple parties and/or components, often, but not always, responsible for orthogonal concerns (not unlike [aspects](https://en.wikipedia.org/wiki/Aspect-oriented_programming)). Controllers cannot assume their decisions will not be overridden or rejected, must continually verify assumptions, and should gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a controller; it just replaces them.
* Object references are usually represented using predictable, client-provided names, to facilitate loose coupling, declarative references, disaster recovery, deletion and re-creation (e.g., to change immutable properties or to transition between incompatible APIs), and more. They are also represented as tuples (of name, namespace, API version, and resource type, or subsets of those) rather than URLs in order to facilitate inference of reference components from context.
## API topology and conventions
The [API](https://kubernetes.io/docs/reference/api-concepts/) URL structure is of the following form:
<p align="center">
/prefix/group/version/namespaces/namespace/resourcetype/name
</p>
The API is divided into [**groups**](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-groups) of related **resource types** that co-evolve together, in API [**version**s](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#api-versioning). A client may interact with a resource in any supported version for that group and type.
Instances of a given resource type are usually (NOTE: There are a small number of non-namespaced resources, also, which have global scope within a particular API service.) grouped by user-created [**namespaces**](https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/), which scope [**names**](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/), references, and some policies.
All resources contain common **metadata**, including the information contained within the URL path, to enable content-based path discovery. Because the resources are uniform and self-describing, they may be operated on generically and in bulk. The metadata also include user-provided key-value metadata in the form of [**labels**](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and [**annotations**](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/). Labels are used for filtering and grouping by identifying attributes, and annotations are generally used by extensions for configuration and checkpointing.
Most resources also contain the [desired state ](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status)[(**spec**)](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status)[ and observed state ](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status)[(**status**)](https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/#object-spec-and-status). Status is written using the /status subresource (appended to the standard resource path (NOTE: Note that subresources dont follow the collection-name/collection-item convention. They are singletons.)), using the same API schema, in order to enable distinct authorization policies for users and controllers.
A few other subresources (e.g., `/scale`), with their own API types, similarly enable distinct authorization policies for controllers, and also polymorphism, since the same subresource type may be implemented for multiple parent resource types. Where distinct authorization policies are not required, polymorphism may be achieved simply by convention, using patch, akin to duck typing.
Supported data formats include YAML, JSON, and protocol buffers.
Example resource:
```yaml
apiVersion: v1
kind: Pod
metadata:
namespace: default
name: explorer
labels:
category: demo
annotations:
commit: 483ac937f496b2f36a8ff34c3b3ba84f70ac5782
spec:
containers:
- name: explorer
image: gcr.io/google_containers/explorer:1.1.3
args: ["-port=8080"]
ports:
- containerPort: 8080
protocol: TCP
status:
```
API groups may be exposed as a unified API surface while being served by distinct [servers](https://kubernetes.io/docs/tasks/access-kubernetes-api/setup-extension-api-server/) using [**aggregation**](https://kubernetes.io/docs/concepts/api-extension/apiserver-aggregation/), which is particularly useful for APIs with special storage needs. However, Kubernetes also supports [**custom resources**](https://kubernetes.io/docs/concepts/api-extension/custom-resources/) (CRDs), which enables users to define new types that fit the standard API conventions without needing to build and run another server. CRDs can be used to make systems declaratively and dynamically configurable in a Kubernetes-compatible manner, without needing another storage system.
Each API server supports a custom [discovery API](https://github.com/kubernetes/client-go/blob/master/discovery/discovery_client.go) to enable clients to discover available API groups, versions, and types, and also [OpenAPI](https://kubernetes.io/blog/2016/12/kubernetes-supports-openapi/), which can be used to extract documentation and validation information about the resource types.
See the [Kubernetes API conventions](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md ) for more details.
## Resource semantics and lifecycle
Each API resource undergoes [a common sequence of behaviors](https://kubernetes.io/docs/admin/accessing-the-api/) upon each operation. For a mutation, these behaviors include:
1. [Authentication](https://kubernetes.io/docs/admin/authentication/)
2. [Authorization](https://kubernetes.io/docs/admin/authorization/): [Built-in](https://kubernetes.io/docs/admin/authorization/rbac/) and/or [administrator-defined](https://kubernetes.io/docs/admin/authorization/webhook/) identity-based policies
3. [Defaulting](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#defaulting): API-version-specific default values are made explicit and persisted
4. Conversion: The apiserver converts between the client-requested [API version](https://kubernetes.io/docs/concepts/overview/kubernetes-api/#API-versioning) and the version it uses to store each resource type in etcd
5. [Admission control](https://kubernetes.io/docs/admin/admission-controllers/): [Built-in](https://kubernetes.io/docs/admin/admission-controllers/) and/or [administrator-defined](https://kubernetes.io/docs/admin/extensible-admission-controllers/) resource-type-specific policies
6. [Validation](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#validation): Resource field values are validated. Other than the presence of required fields, the API resource schema is not currently validated, but optional validation may be added in the future
7. Idempotence: Resources are accessed via immutable client-provided, declarative-friendly names
8. [Optimistic concurrency](https://github.com/kubernetes/community/blob/master/contributors/devel/api-conventions.md#concurrency-control-and-consistency): Writes may specify a precondition that the **resourceVersion** last reported for a resource has not changed
9. [Audit logging](https://kubernetes.io/docs/tasks/debug-application-cluster/audit/): Records the sequence of changes to each resource by all actors
Additional behaviors are supported upon deletion:
* Graceful termination: Some resources support delayed deletion, which is indicated by **deletionTimestamp** and **deletionGracePeriodSeconds** being set upon deletion
* Finalization: A **finalizer** is block on deletion placed by an external controller, and needs to be removed before the resource deletion can complete
* [Garbage collection](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/garbage-collection.md): A resource may specify **ownerReferences**, in which case the resource will be deleted once all of the referenced resources have been deleted
And get:
* List: All resources of a particular type within a particular namespace may be requested; [response chunking](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/api-machinery/api-chunking.md) is supported
* [Label selection](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering): Lists may be filtered by their label keys and values
* Watch: A client may subscribe to changes to listed resources using the resourceVersion returned with the list results
## Declarative configuration
Kubernetes API resource specifications are designed for humans to directly author and read as declarative configuration data, as well as to enable composable configuration tools and automated systems to manipulate them programmatically. We chose this simple approach of using literal API resource specifications for configuration, rather than other representations, because it was natural, given that we designed the API to support CRUD on declarative primitives. The API schema must already well defined, documented, and supported. With this approach, theres no other representation to keep up to date with new resources and versions, or to require users to learn. [Declarative configuration](https://goo.gl/T66ZcD) is only one client use case; there are also CLIs (e.g., kubectl), UIs, deployment pipelines, etc. The user will need to interact with the system in terms of the API in these other scenarios, and knowledge of the API transfers to other clients and tools. Additionally, configuration, macro/substitution, and templating languages are generally more difficult to manipulate programmatically than pure data, and involve complexity/expressiveness tradeoffs that prevent one solution being ideal for all use cases. Such languages/tools could be layered over the native API schemas, if desired, but they should not assume exclusive control over all API fields, because doing so obstructs automation and creates undesirable coupling with the configuration ecosystem.
The Kubernetes Resource Model encourages separation of concerns by supporting multiple distinct configuration sources and preserving declarative intent while allowing automatically set attributes. Properties not explicitly declaratively managed by the user are free to be changed by other clients, enabling the desired state to be cooperatively determined by both users and systems. This is achieved by an operation, called [**Apply**](https://docs.google.com/document/d/1q1UGAIfmOkLSxKhVg7mKknplq3OTDWAIQGWMJandHzg/edit#heading=h.xgjl2srtytjt) ("make it so"), that performs a 3-way merge of the previous configuration, the new configuration, and the live state. A 2-way merge operation, called [strategic merge patch](https://github.com/kubernetes/community/blob/master/contributors/devel/strategic-merge-patch.md), enables patches to be expressed using the same schemas as the resources themselves. Such patches can be used to perform automated updates without custom mutation operations, common updates (e.g., container image updates), combinations of configurations of orthogonal concerns, and configuration customization, such as for overriding properties of variants.

View File

@ -0,0 +1,239 @@
# Bound Service Account Tokens
Author: @mikedanese
# Objective
This document describes an API that would allow workloads running on Kubernetes
to request JSON Web Tokens that are audience, time and eventually key bound.
# Background
Kubernetes already provisions JWTs to workloads. This functionality is on by
default and thus widely deployed. The current workload JWT system has serious
issues:
1. Security: JWTs are not audience bound. Any recipient of a JWT can masquerade
as the presenter to anyone else.
1. Security: The current model of storing the service account token in a Secret
and delivering it to nodes results in a broad attack surface for the
Kubernetes control plane when powerful components are run - giving a service
account a permission means that any component that can see that service
account's secrets is at least as powerful as the component.
1. Security: JWTs are not time bound. A JWT compromised via 1 or 2, is valid
for as long as the service account exists. This may be mitigated with
service account signing key rotation but is not supported by client-go and
not automated by the control plane and thus is not widely deployed.
1. Scalability: JWTs require a Kubernetes secret per service account.
# Proposal
Infrastructure to support on demand token requests will be implemented in the
core apiserver. Once this API exists, a client of the apiserver will request an
attenuated token for its own use. The API will enforce required attenuations,
e.g. audience and time binding.
## Token attenuations
### Audience binding
Tokens issued from this API will be audience bound. Audience of requested tokens
will be bound by the `aud` claim. The `aud` claim is an array of strings
(usually URLs) that correspond to the intended audience of the token. A
recipient of a token is responsible for verifying that it identifies as one of
the values in the audience claim, and should otherwise reject the token. The
TokenReview API will support this validation.
### Time binding
Tokens issued from this API will be time bound. Time validity of these tokens
will be claimed in the following fields:
* `exp`: expiration time
* `nbf`: not before
* `iat`: issued at
A recipient of a token should verify that the token is valid at the time that
the token is presented, and should otherwise reject the token. The TokenReview
API will support this validation.
Cluster administrators will be able to configure the maximum validity duration
for expiring tokens. During the migration off of the old service account tokens,
clients of this API may request tokens that are valid for many years. These
tokens will be drop in replacements for the current service account tokens.
### Object binding
Tokens issued from this API may be bound to a Kubernetes object in the same
namespace as the service account. The name, group, version, kind and uid of the
object will be embedded as claims in the issued token. A token bound to an
object will only be valid for as long as that object exists.
Only a subset of object kinds will support object binding. Initially the only
kinds that will be supported are:
* v1/Pod
* v1/Secret
The TokenRequest API will validate this binding.
## API Changes
### Add `tokenrequests.authentication.k8s.io`
We will add an imperative API (a la TokenReview) to the
`authentication.k8s.io` API group:
```golang
type TokenRequest struct {
Spec TokenRequestSpec
Status TokenRequestStatus
}
type TokenRequestSpec struct {
// Audiences are the intendend audiences of the token. A token issued
// for multiple audiences may be used to authenticate against any of
// the audiences listed. This implies a high degree of trust between
// the target audiences.
Audiences []string
// ValidityDuration is the requested duration of validity of the request. The
// token issuer may return a token with a different validity duration so a
// client needs to check the 'expiration' field in a response.
ValidityDuration metav1.Duration
// BoundObjectRef is a reference to an object that the token will be bound to.
// The token will only be valid for as long as the bound object exists.
BoundObjectRef *BoundObjectReference
}
type BoundObjectReference struct {
// Kind of the referent. Valid kinds are 'Pod' and 'Secret'.
Kind string
// API version of the referent.
APIVersion string
// Name of the referent.
Name string
// UID of the referent.
UID types.UID
}
type TokenRequestStatus struct {
// Token is the token data
Token string
// Expiration is the time of expiration of the returned token. Empty means the
// token does not expire.
Expiration metav1.Time
}
```
This API will be exposed as a subresource under a serviceaccount object. A
requestor for a token for a specific service account will `POST` a
`TokenRequest` to the `/token` subresource of that serviceaccount object.
### Modify `tokenreviews.authentication.k8s.io`
The TokenReview API will be extended to support passing an additional audience
field which the service account authenticator will validate.
```golang
type TokenReviewSpec struct {
// Token is the opaque bearer token.
Token string
// Audiences is the identifier that the client identifies as.
Audiences []string
}
```
### Example Flow
```
> POST /apis/v1/namespaces/default/serviceaccounts/default/token
> {
> "kind": "TokenRequest",
> "apiVersion": "authentication.k8s.io/v1",
> "spec": {
> "audience": [
> "https://kubernetes.default.svc"
> ],
> "validityDuration": "99999h",
> "boundObjectRef": {
> "kind": "Pod",
> "apiVersion": "v1",
> "name": "pod-foo-346acf"
> }
> }
> }
{
"kind": "TokenRequest",
"apiVersion": "authentication.k8s.io/v1",
"spec": {
"audience": [
"https://kubernetes.default.svc"
],
"validityDuration": "99999h",
"boundObjectRef": {
"kind": "Pod",
"apiVersion": "v1",
"name": "pod-foo-346acf"
}
},
"status": {
"token":
"eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJz[payload omitted].EkN-[signature omitted]",
"expiration": "Jan 24 16:36:00 PST 3018"
}
}
```
The token payload will be:
```
{
"iss": "https://example.com/some/path",
"sub": "system:serviceaccount:default:default,
"aud": [
"https://kubernetes.default.svc"
],
"exp": 24412841114,
"iat": 1516841043,
"nbf": 1516841043,
"kubernetes.io": {
"serviceAccountUID": "c0c98eab-0168-11e8-92e5-42010af00002",
"boundObjectRef": {
"kind": "Pod",
"apiVersion": "v1",
"uid": "a4bb8aa4-0168-11e8-92e5-42010af00002",
"name": "pod-foo-346acf"
}
}
}
```
## Service Account Authenticator Modification
The service account token authenticator will be extended to support validation
of time and audience binding claims.
## ACLs for TokenRequest
The NodeAuthorizer will allow the kubelet to use its credentials to request a
service account token on behalf of pods running on that node. The
NodeRestriction admission controller will require that these tokens are pod
bound.
## Footnotes
* New apiserver flags:
* --service-account-issuer: Identifier of the issuer.
* --service-account-signing-key: Path to issuer private key used for signing.
* --service-account-api-audience: Identifier of the API. Used to validate
tokens authenticating to the Kubernetes API.
* The Kubernetes apiserver will identify itself as `kubernetes.default.svc`
which is the DNS name of the Kubernetes apiserver. When no audience is
requested, the audience is defaulted to an array
containing only this identifier.

View File

@ -49,7 +49,7 @@ while creating containers, for example
`docker run --security-opt=no_new_privs busybox`.
Docker provides via their Go api an object named `ContainerCreateConfig` to
configure container creation parameters. In this object, there is a string
configure container creation parameters. In this object, there is a string
array `HostConfig.SecurityOpt` to specify the security options. Client can
utilize this field to specify the arguments for security options while
creating new containers.

View File

@ -0,0 +1,93 @@
# ProcMount/ProcMountType Option
## Background
Currently the way docker and most other container runtimes work is by masking
and setting as read-only certain paths in `/proc`. This is to prevent data
from being exposed into a container that should not be. However, there are
certain use-cases where it is necessary to turn this off.
## Motivation
For end-users who would like to run unprivileged containers using user namespaces
_nested inside_ CRI containers, we need an option to have a `ProcMount`. That is,
we need an option to designate explicitly turn off masking and setting
read-only of paths so that we can
mount `/proc` in the nested container as an unprivileged user.
Please see the following filed issues for more information:
- [opencontainers/runc#1658](https://github.com/opencontainers/runc/issues/1658#issuecomment-373122073)
- [moby/moby#36597](https://github.com/moby/moby/issues/36597)
- [moby/moby#36644](https://github.com/moby/moby/pull/36644)
Please also see the [use case for building images securely in kubernetes](https://github.com/jessfraz/blog/blob/master/content/post/building-container-images-securely-on-kubernetes.md).
Unmasking the paths in `/proc` option really only makes sense for when a user
is nesting
unprivileged containers with user namespaces as it will allow more information
than is necessary to the program running in the container spawned by
kubernetes.
The main use case for this option is to run
[genuinetools/img](https://github.com/genuinetools/img) inside a kubernetes
container. That program then launches sub-containers that take advantage of
user namespaces and re-mask /proc and set /proc as read-only. So therefore
there is no concern with having an unmasked proc open in the top level container.
It should be noted that this is different that the host /proc. It is still
a newly mounted /proc just the container runtimes will not mask the paths.
Since the only use case for this option is to run unprivileged nested
containers,
this option should only be allowed or used if the user in the container is not `root`.
This can be easily enforced with `MustRunAs`.
Since the user inside is still unprivileged,
doing things to `/proc` would be off limits regardless, since linux user
support already prevents this.
## Existing SecurityContext objects
Kubernetes defines `SecurityContext` for `Container` and `PodSecurityContext`
for `PodSpec`. `SecurityContext` objects define the related security options
for Kubernetes containers, e.g. selinux options.
To support "ProcMount" options in Kubernetes, it is proposed to make
the following changes:
## Changes of SecurityContext objects
Add a new `string` type field named `ProcMountType` will hold the viable
options for `procMount` to the `SecurityContext`
definition.
By default,`procMount` is `default`, aka the same behavior as today and the
paths are masked.
This will look like the following in the spec:
```go
type ProcMountType string
const (
// DefaultProcMount uses the container runtime default ProcType. Most
// container runtimes mask certain paths in /proc to avoid accidental security
// exposure of special devices or information.
DefaultProcMount ProcMountType = "Default"
// UnmaskedProcMount bypasses the default masking behavior of the container
// runtime and ensures the newly created /proc the container stays in tact with
// no modifications.
UnmaskedProcMount ProcMountType = "Unmasked"
)
procMount *ProcMountType
```
This requires changes to the CRI runtime integrations so that
kubelet will add the specific `unmasked` or `whatever_it_is_named` option.
## Pod Security Policy changes
A new `[]ProcMountType{}` field named `allowedProcMounts` will be added to the Pod
Security Policy as well to gate the allowed ProcMountTypes a user is allowed to
set. This field will default to `[]ProcMountType{ DefaultProcMount }`.

View File

@ -42,7 +42,7 @@ containers.
In order to support external integration with shared storage, processes running
in a Kubernetes cluster should be able to be uniquely identified by their Unix
UID, such that a chain of ownership can be established. Processes in pods will
UID, such that a chain of ownership can be established. Processes in pods will
need to have consistent UID/GID/SELinux category labels in order to access
shared disks.

View File

@ -211,6 +211,8 @@ the ReplicationController being autoscaled.
```yaml
kind: HorizontalPodAutoscaler
apiVersion: autoscaling/v2alpha1
metadata:
name: WebFrontend
spec:
scaleTargetRef:
kind: ReplicationController

View File

@ -81,8 +81,7 @@ on historical utilization. It is designed to only kick in on Pod creation.
VPA is intended to supersede this feature.
#### In-place updates ####
In-place Pod updates ([#5774]
(https://github.com/kubernetes/kubernetes/issues/5774)) is a planned feature to
In-place Pod updates ([#5774](https://github.com/kubernetes/kubernetes/issues/5774)) is a planned feature to
allow changing resources (request/limit) of existing containers without killing them, assuming sufficient free resources available on the node.
Vertical Pod Autoscaler will greatly benefit from this ability, however it is
not considered a blocker for the MVP.
@ -190,7 +189,7 @@ Design
### API ###
We introduce a new type of API object `VerticalPodAutoscaler`, which
consists of the Target, that is a [label selector](https://kubernetes.io/docs/api-reference/v1.5/#labelselector-unversioned)
consists of the Target, that is a [label selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#label-selectors)
for matching Pods and two policy sections: the update policy and the resources
policy.
Additionally it holds the most recent recommendation computed by VPA.
@ -558,7 +557,7 @@ VPA controls the request (memory and CPU) of containers. In MVP it always sets
the limit to infinity. It is not yet clear whether there is a use-case for VPA
setting the limit.
The request is calculated based on analysis of the current and revious runs of
The request is calculated based on analysis of the current and previous runs of
the container and other containers with similar properties (name, image,
command, args).
The recommendation model (MVP) assumes that the memory and CPU consumption are

View File

@ -7,7 +7,7 @@ Author: @mengqiy
Background of the Strategic Merge Patch is covered [here](../devel/strategic-merge-patch.md).
The Kubernetes API may apply semantic meaning to the ordering of items within a list,
however the strategic merge patch does not keeping the ordering of elements.
however the strategic merge patch does not keep the ordering of elements.
Ordering has semantic meaning for Environment variables,
as later environment variables may reference earlier environment variables,
but not the other way around.
@ -30,7 +30,7 @@ Add to the current patch, a directive ($setElementOrder) containing a list of el
either the patch merge key, or for primitives the value. When applying the patch,
the server ensures that the relative ordering of elements matches the directive.
The server will reject the patch if it doesn't satisfy the following 2 requirement.
The server will reject the patch if it doesn't satisfy the following 2 requirements.
- the relative order of any two items in the `$setElementOrder` list
matches that in the patch list if they present.
- the items in the patch list must be a subset or the same as the `$setElementOrder` list if the directive presents.
@ -45,7 +45,7 @@ The relative order of two items are determined by the following order:
If the relative order of the live config in the server is different from the order of the parallel list,
the user's patch will always override the order in the server.
Here is an simple example of the patch format:
Here is a simple example of the patch format:
Suppose we have a type called list. The patch will look like below.
The order from the parallel list ($setElementOrder/list) will be respected.
@ -60,7 +60,7 @@ list:
- C
```
All the items in the server's live list but not in the parallel list will be come before the parallel list.
All the items in the server's live list but not in the parallel list will come before the parallel list.
The relative order between these appended items are kept.
The patched list will look like:
@ -114,7 +114,7 @@ list:
### `$setElementOrder` may contain elements not present in the patch list
The $setElementOrder value may contain elements that are not present in the patch
but present in the list to be merge to reorder the elements as part of the merge.
but present in the list to be merged to reorder the elements as part of the merge.
Example where A & B have not changed:
@ -481,15 +481,15 @@ we send a whole list from user's config.
It is NOT backward compatible in terms of list of primitives.
When patching a list of maps:
- An old client sends a old patch to a new server, the server just merges the change and no reordering.
- An old client sends an old patch to a new server, the server just merges the change and no reordering.
The server behaves the same as before.
- An new client sends a new patch to an old server, the server doesn't understand the new directive.
- A new client sends a new patch to an old server, the server doesn't understand the new directive.
So it just simply does the merge.
When patching a list of primitives:
- An old client sends a old patch to a new server, the server will reorder the patch list which is sublist of user's.
- An old client sends an old patch to a new server, the server will reorder the patch list which is sublist of user's.
The server has the WRONG behavior.
- An new client sends a new patch to an old server, the server will deduplicate after merging.
- A new client sends a new patch to an old server, the server will deduplicate after merging.
The server behaves the same as before.
## Example

View File

@ -5,7 +5,7 @@ The current local cluster experience is sub-par and often not functional.
There are several options to setup a local cluster (docker, vagrant, linux processes, etc) and we do not test any of them continuously.
Here are some highlighted issues:
- Docker based solution breaks with docker upgrades, does not support DNS, and many kubelet features are not functional yet inside a container.
- Vagrant based solution are too heavy and have mostly failed on OS X.
- Vagrant based solution are too heavy and have mostly failed on macOS.
- Local linux cluster is poorly documented and is undiscoverable.
From an end user perspective, they want to run a kubernetes cluster. They care less about *how* a cluster is setup locally and more about what they can do with a functional cluster.
@ -15,7 +15,7 @@ From an end user perspective, they want to run a kubernetes cluster. They care l
From a high level the goal is to make it easy for a new user to run a Kubernetes cluster and play with curated examples that require least amount of knowledge about Kubernetes.
These examples will only use kubectl and only a subset of Kubernetes features that are available will be exposed.
- Works across multiple OSes - OS X, Linux and Windows primarily.
- Works across multiple OSes - macOS, Linux and Windows primarily.
- Single command setup and teardown UX.
- Unified UX across OSes
- Minimal dependencies on third party software.
@ -68,7 +68,7 @@ This is only a part of the overall local cluster solution.
The kube-up.sh script included in Kubernetes release supports a few Vagrant based local cluster deployments.
kube-up.sh is not user friendly.
It typically takes a long time for the cluster to be set up using vagrant and often times is unsuccessful on OS X.
It typically takes a long time for the cluster to be set up using vagrant and often times is unsuccessful on macOS.
The [Core OS single machine guide](https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant-single.html) uses Vagrant as well and it just works.
Since we are targeting a single command install/teardown experience, vagrant needs to be an implementation detail and not be exposed to our users.
@ -90,8 +90,8 @@ For running and managing the kubernetes components themselves, we can re-use [S
Localkube is a self-contained go binary that includes all the master components including DNS and runs them using multiple go threads.
Each Kubernetes release will include a localkube binary that has been tested exhaustively.
To support Windows and OS X, minikube will use [libmachine](https://github.com/docker/machine/tree/master/libmachine) internally to create and destroy virtual machines.
Minikube will be shipped with an hypervisor (virtualbox) in the case of OS X.
To support Windows and macOS, minikube will use [libmachine](https://github.com/docker/machine/tree/master/libmachine) internally to create and destroy virtual machines.
Minikube will be shipped with an hypervisor (virtualbox) in the case of macOS.
Minikube will include a base image that will be well tested.
In the case of Linux, since the cluster can be run locally, we ideally want to avoid setting up a VM.
@ -105,7 +105,7 @@ Alternatives to docker for running the localkube core includes using [rkt](https
To summarize the pipeline is as follows:
##### OS X / Windows
##### macOS / Windows
minikube -> libmachine -> virtualbox/hyper V -> linux VM -> localkube
@ -126,7 +126,7 @@ minikube -> docker -> localkube
##### Cons
- Not designed to be wrapped, may be unstable
- Might make configuring networking difficult on OS X and Windows
- Might make configuring networking difficult on macOS and Windows
- Versioning and updates will be challenging. We can mitigate some of this with testing at HEAD, but we'll - inevitably hit situations where it's infeasible to work with multiple versions of docker.
- There are lots of different ways to install docker, networking might be challenging if we try to support many paths.

View File

@ -0,0 +1,87 @@
# **External Metrics API**
# Overview
[HPA v2 API extension proposal](https://github.com/kubernetes/community/blob/hpa_external/contributors/design-proposals/autoscaling/hpa-external-metrics.md) introduces new External metric type for autoscaling based on metrics coming from outside of Kubernetes cluster. This document proposes a new External Metrics API that will be used by HPA controller to get those metrics.
This API performs a similar role to and is based on existing [Custom Metrics API](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md). Unless explicitly specified otherwise all sections related to semantics, implementation and design decisions in [Custom Metrics API design](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md) apply to External Metrics API as well. It is generally expected that a Custom Metrics Adapter will provide both Custom Metrics API and External Metrics API, however, this is not a requirement and both APIs can be implemented and used separately.
# API
The API will consist of a single path:
```
/apis/external.metrics.k8s.io/v1beta1/namespaces/<namespace_name>/<metric_name>?labelSelector=<selector>
```
Similar to endpoints in Custom Metrics API it would only support GET requests.
The query would return the `ExternalMetricValueList` type described below:
```go
// a list of values for a given metric for some set labels
type ExternalMetricValueList struct {
       metav1.TypeMeta `json:",inline"`
       metav1.ListMeta `json:"metadata,omitempty"`
    // value of the metric matching a given set of labels
       Items []ExternalMetricValue `json:"items"`
}
// a metric value for external metric
type ExternalMetricValue struct {
    metav1.TypeMeta`json:",inline"`
// the name of the metric
MetricName string `json:"metricName"`
// label set identifying the value within metric
MetricLabels map[string]string `json:"metricLabels"`
    // indicates the time at which the metrics were produced
    Timestamp unversioned.Time `json:"timestamp"`
    // indicates the window ([Timestamp-Window, Timestamp]) from
    // which these metrics were calculated, when returning rate
    // metrics calculated from cumulative metrics (or zero for
    // non-calculated instantaneous metrics).
    WindowSeconds *int64 `json:"window,omitempty"`
    // the value of the metric
    Value resource.Quantity
}
```
# Semantics
## Namespaces
Kubernetes namespaces don't have a natural 1-1 mapping to metrics coming from outside of Kubernetes. It is up to adapter implementing the API to decide which metric is available in which namespace. In particular a single metric may be available through many different namespaces.
## Metric Values
A request for a given metric may return multiple values if MetricSelector matches multiple time series. Each value should include a complete set of labels, which is sufficient to uniquely identify a timeseries.
A single value should always be returned if MetricSelector specifies a single value for every label defined for a given metric.
## Metric names
Custom Metrics API [doesn't allow using certain characters in metric names](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/custom-metrics-api.md#metric-names). The reason for that is a technical limitation in GO libraries. This list of forbidden characters includes slash (`/`). This is problematic as many systems use slashes in their metric naming convention.
Rather than expect metric adapters to come up with their custom ways of handling that this document proposes introducing `\|` as a custom escape sequence for slash. HPA controller will automatically replace any slashes in MetricName field for External metric with this escape sequence.
Otherwise the allowed metric names are the same as in Custom Metrics API.
## Access Control
Access can be controlled with per-metric granularity, same as in Custom Metrics API. The API has been designed to allow adapters to implement more granular access control if required. Possible future extension of API supporting label level access control is described in [ExternalMetricsPolicy](#externalmetricspolicy) section.
# Future considerations
## ExternalMetricsPolicy
If a more granular access control turns out to be a common requirement an ExternalMetricPolicy object could be added to API. This object could be defined at cluster level, per namespace or per user and would consist of a list of rules. Each rule would consist of a mandatory regexp and either a label selector or a 'deny' statement. For each metric the rules would be applied top to bottom, with the first matching rule being used. A query that hit a deny rule or specified a selector that is not a subset of selector specified by policy would be rejected with 403 error.
Additionally an admission controller could be used to check the policy when creating HPA object.

View File

@ -78,7 +78,7 @@ horizontally, though its rather complicated and is out of the scope of this d
Metrics server will be Kubernetes addon, create by kube-up script and managed by
[addon-manager](https://git.k8s.io/kubernetes/cluster/addons/addon-manager).
Since there is a number of dependent components, it will be marked as a critical addon.
Since there are a number of dependent components, it will be marked as a critical addon.
In the future when the priority/preemption feature is introduced we will migrate to use this
proper mechanism for marking it as a high-priority, system component.

View File

@ -209,7 +209,7 @@ Go 1.5 introduced many changes. To name a few that are relevant to Kubernetes:
- The garbage collector became more efficient (but also [confused our latency test](https://github.com/golang/go/issues/14396)).
- `linux/arm64` and `linux/ppc64le` were added as new ports.
- The `GO15VENDOREXPERIMENT` was started. We switched from `Godeps/_workspace` to the native `vendor/` in [this PR](https://github.com/kubernetes/kubernetes/pull/24242).
- It's not required to pre-build the whole standard library `std` when cross-compliling. [Details](#prebuilding-the-standard-library-std)
- It's not required to pre-build the whole standard library `std` when cross-compiling. [Details](#prebuilding-the-standard-library-std)
- Builds are approximately twice as slow as earlier. That affects the CI. [Details](#releasing)
- The native Go DNS resolver will suffice in the most situations. This makes static linking much easier.

View File

@ -286,7 +286,7 @@ enumerated the key idea elements:
+ [E1] Master rejects LRS creation (for known or unknown
reason). In this case another attempt to create a LRS should be
attempted in 1m or so. This action can be tied with
[[I5]](#heading=h.ififs95k9rng). Until the the LRS is created
[[I5]](#heading=h.ififs95k9rng). Until the LRS is created
the situation is the same as [E5]. If this happens multiple
times all due replicas should be moved elsewhere and later moved
back once the LRS is created.
@ -348,7 +348,7 @@ to that LRS along with their current status and status change timestamp.
+ [I6] If a cluster is removed from the federation then the situation
is equal to multiple [E4]. It is assumed that if a connection with
a cluster is lost completely then the cluster is removed from the
the cluster list (or marked accordingly) so
cluster list (or marked accordingly) so
[[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda)
don't need to be handled.
@ -383,7 +383,7 @@ To calculate the (re)scheduling moves for a given FRS:
1. For each cluster FRSC calculates the number of replicas that are placed
(not necessary up and running) in the cluster and the number of replicas that
failed to be scheduled. Cluster capacity is the difference between the
the placed and failed to be scheduled.
placed and failed to be scheduled.
2. Order all clusters by their weight and hash of the name so that every time
we process the same replica-set we process the clusters in the same order.

View File

@ -0,0 +1,51 @@
# Multicluster reserved namespaces
@perotinus
06/06/2018
## Background
sig-multicluster has identified the need for a canonical set of namespaces that
can be used for supporting multicluster applications and use cases. Initially,
an [issue](https://github.com/kubernetes/cluster-registry/issues/221) was filed
in the cluster-registry repository describing the need for a namespace that
would be used for public, global cluster records. This topic was further
discussed at the
[SIG meeting on June 5, 2018](https://www.youtube.com/watch?v=j6tHK8_mWz8&t=3012)
and in a
[thread](https://groups.google.com/forum/#!topic/kubernetes-sig-multicluster/8u-li_ZJpDI)
on the SIG mailing list.
## Reserved namespaces
We determined that there is currently a strong case for two reserved namespaces
for multicluster use:
- `kube-multicluster-public`: a global, public namespace for storing cluster
registry Cluster objects. If there are other custom resources that
correspond with the global, public Cluster objects, they can also be stored
here. For example, a custom resource that contains cloud-provider-specific
metadata about a cluster. Tools built against the cluster registry can
expect to find the canonical set of Cluster objects in this namespace[1].
- `kube-multicluster-system`: an administrator-accessible namespace that
contains components, such as multicluster controllers and their
dependencies, that are not meant to be seen by most users directly.
The definition of these namespaces is not intended to be exhaustive: in the
future, there may be reason to define more multicluster namespaces, and
potentially conventions for namespaces that are replicated between clusters (for
example, to support a global cluster list that is replicated to all clusters
that are contained in the list).
## Conventions for reserved namespaces
By convention, resources in these namespaces are local to the clusters in which
they exist and will not be replicated to other clusters. In other words, these
namespaces are private to the clusters they are in, and multicluster operations
must not replicate them or their resources into other clusters.
[1] Tools are by no means compelled to look in this namespace for clusters, and
can choose to reference Cluster objects from other namespaces as is suitable to
their design and environment.

View File

@ -0,0 +1,89 @@
# Support traffic shaping for CNI network plugin
Version: Alpha
Authors: @m1093782566
## Motivation and background
Currently the kubenet code supports applying basic traffic shaping during pod setup. This will happen if bandwidth-related annotations have been added to the pod's metadata, for example:
```json
{
"kind": "Pod",
"metadata": {
"name": "iperf-slow",
"annotations": {
"kubernetes.io/ingress-bandwidth": "10M",
"kubernetes.io/egress-bandwidth": "10M"
}
}
}
```
Our current implementation uses the `linux tc` to add an download(ingress) and upload(egress) rate limiter using 1 root `qdisc`, 2 `class `(one for ingress and one for egress) and 2 `filter`(one for ingress and one for egress attached to the ingress and egress classes respectively).
Kubelet CNI code doesn't support it yet, though CNI has already added a [traffic sharping plugin](https://github.com/containernetworking/plugins/tree/master/plugins/meta/bandwidth). We can replicate the behavior we have today in kubenet for kubelet CNI network plugin if we feel this is an important feature.
## Goal
Support traffic shaping for CNI network plugin in Kubernetes.
## Non-goal
CNI plugins to implement this sort of traffic shaping guarantee.
## Proposal
If kubelet starts up with `network-plugin = cni` and user enabled traffic shaping via the network plugin configuration, it would then populate the `runtimeConfig` section of the config when calling the `bandwidth` plugin.
Traffic shaping in Kubelet CNI network plugin can work with ptp and bridge network plugins.
### Pod Setup
When we create a pod with bandwidth configuration in its metadata, for example,
```json
{
"kind": "Pod",
"metadata": {
"name": "iperf-slow",
"annotations": {
"kubernetes.io/ingress-bandwidth": "10M",
"kubernetes.io/egress-bandwidth": "10M"
}
}
}
```
Kubelet would firstly parse the ingress and egress bandwidth values and transform them to Kbps because both `ingressRate` and `egressRate` in cni bandwidth plugin are in Kbps. A user would add something like this to their CNI config list if they want to enable traffic shaping via the plugin:
```json
{
"type": "bandwidth",
"capabilities": {"trafficShaping": true}
}
```
Kubelet would then populate the `runtimeConfig` section of the config when calling the `bandwidth` plugin:
```json
{
"type": "bandwidth",
"runtimeConfig": {
"trafficShaping": {
"ingressRate": "X",
"egressRate": "Y"
}
}
}
```
### Pod Teardown
When we delete a pod, kubelet will build the runtime config for calling cni plugin `DelNetwork/DelNetworkList` API, which will remove this pod's bandwidth configuration.
## Next step
* Support ingress and egress burst bandwidth in Pod.
* Graduate annotations to Pod Spec.

View File

@ -53,7 +53,7 @@ type AcceleratorStats struct {
// ID of the accelerator. device minor number? Or UUID?
ID string `json:"id"`
// Total acclerator memory.
// Total accelerator memory.
// unit: bytes
MemoryTotal uint64 `json:"memory_total"`
@ -75,7 +75,7 @@ From the summary API, they will flow to heapster and stackdriver.
## Caveats
- As mentioned before, this would add a requirement that cAdvisor and kubelet are dynamically linked.
- We would need to make sure that kubelet is able to access the nvml libraries. Some existing container based nvidia driver installers install drivers in a special directory. We would need to make sure that that directory is in kubelets `LD_LIBRARY_PATH`.
- We would need to make sure that kubelet is able to access the nvml libraries. Some existing container based nvidia driver installers install drivers in a special directory. We would need to make sure that directory is in kubelets `LD_LIBRARY_PATH`.
## Testing Plan
- Adding unit tests and e2e tests to cAdvisor for this code.

View File

@ -20,7 +20,7 @@ On the Windows platform, processes may be assigned to a job object, which can ha
[#547](https://github.com/kubernetes/features/issues/547)
## Motivation
The goal is to start filling the gap of platform support in CRI, specifically for Windows platform. For example, currrently in dockershim Windows containers are scheduled using the default resource constraints and does not respect the resource requests and limits specified in POD. With this proposal, Windows containers will be able to leverage POD spec and CRI to allocate compute resource and respect restriction.
The goal is to start filling the gap of platform support in CRI, specifically for Windows platform. For example, currently in dockershim Windows containers are scheduled using the default resource constraints and does not respect the resource requests and limits specified in POD. With this proposal, Windows containers will be able to leverage POD spec and CRI to allocate compute resource and respect restriction.
## Proposed design
@ -85,7 +85,7 @@ The implementation will mainly be in two parts:
In both parts, we need to implement:
* Fork code for Windows from Linux.
* Convert from Resources.Requests and Resources.Limits to Windows configuration in CRI, and convert from Windows configration in CRI to container configuration.
* Convert from Resources.Requests and Resources.Limits to Windows configuration in CRI, and convert from Windows configuration in CRI to container configuration.
To implement resource controls for Windows containers, refer to [this MSDN documentation](https://docs.microsoft.com/en-us/virtualization/windowscontainers/manage-containers/resource-controls) and [Docker's conversion to OCI spec](https://github.com/moby/moby/blob/master/daemon/oci_windows.go).

View File

@ -133,7 +133,7 @@ PodSandboxConfig.LogDirectory: /var/log/pods/<podUID>/
ContainerConfig.LogPath: <containerName>_<instance#>.log
```
Because kubelet determines where the logs are stores and can access them
Because kubelet determines where the logs are stored and can access them
directly, this meets requirement (1). As for requirement (2), the log collector
can easily extract basic pod metadata (e.g., pod UID, container name) from
the paths, and watch the directly for any changes. In the future, we can
@ -142,14 +142,25 @@ extend this by maintaining a metadata file in the pod directory.
**Log format**
The runtime should decorate each log entry with a RFC 3339Nano timestamp
prefix, the stream type (i.e., "stdout" or "stderr"), and ends with a newline.
prefix, the stream type (i.e., "stdout" or "stderr"), the tags of the log
entry, the log content that ends with a newline.
The `tags` fields can support multiple tags, delimited by `:`. Currently, only
one tag is defined in CRI to support multi-line log entries: partial or full.
Partial (`P`) is used when a log entry is split into multiple lines by the
runtime, and the entry has not ended yet. Full (`F`) indicates that the log
entry is completed -- it is either a single-line entry, or this is the last
line of the multiple-line entry.
For example,
```
2016-10-06T00:17:09.669794202Z stdout The content of the log entry 1
2016-10-06T00:17:10.113242941Z stderr The content of the log entry 2
2016-10-06T00:17:09.669794202Z stdout F The content of the log entry 1
2016-10-06T00:17:09.669794202Z stdout P First line of log entry 2
2016-10-06T00:17:09.669794202Z stdout P Second line of the log entry 2
2016-10-06T00:17:10.113242941Z stderr F Last line of the log entry 2
```
With the knowledge, kubelet can parses the logs and serve them for `kubectl
With the knowledge, kubelet can parse the logs and serve them for `kubectl
logs` requests. This meets requirement (3). Note that the format is defined
deliberately simple to provide only information necessary to serve the requests.
We do not intend for kubelet to host various logging plugins. It is also worth
@ -165,7 +176,7 @@ to rotate the logs periodically, similar to today's implementation.
We do not rule out the possibility of letting kubelet or a per-node daemon
(`DaemonSet`) to take up the responsibility, or even declare rotation policy
in the kubernetes API as part of the `PodSpec`, but it is beyond the scope of
the this proposal.
this proposal.
**What about non-supported log formats?**

View File

@ -249,7 +249,7 @@ Users are expected to specify `KubeReserved` and `SystemReserved` based on their
Resource requirements for Kubelet and the runtime is typically proportional to the number of pods running on a node.
Once a user identified the maximum pod density for each of their nodes, they will be able to compute `KubeReserved` using [this performance dashboard](http://node-perf-dash.k8s.io/#/builds).
[This blog post](http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html) explains how the dashboard has to be interpreted.
[This blog post](https://kubernetes.io/blog/2016/11/visualize-kubelet-performance-with-node-dashboard/) explains how the dashboard has to be interpreted.
Note that this dashboard provides usage metrics for docker runtime only as of now.
Support for evictions based on Allocatable will be introduced in this phase.

View File

@ -0,0 +1,209 @@
# Support Node-Level User Namespaces Remapping
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Use Stories](#user-stories)
- [Proposal](#proposal)
- [Future Work](#future-work)
- [Risks and Mitigations](risks-and-mitigations)
- [Graduation Criteria](graduation-criteria)
- [Alternatives](alternatives)
_Authors:_
* Mrunal Patel &lt;mpatel@redhat.com&gt;
* Jan Pazdziora &lt;jpazdziora@redhat.com&gt;
* Vikas Choudhary &lt;vichoudh@redhat.com&gt;
## Summary
Container security consists of many different kernel features that work together to make containers secure. User namespaces is one such feature that enables interesting possibilities for containers by allowing them to be root inside the container while not being root on the host. This gives more capabilities to the containers while protecting the host from the container being root and adds one more layer to container security.
In this proposal we discuss:
- use-cases/user-stories that benefit from this enhancement
- implementation design and scope for alpha release
- long-term roadmap to fully support this feature beyond alpha
## Motivation
From user_namespaces(7):
> User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs, the root directory, keys, and capabilities. A process's user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace.
In order to run Pods with software which expects to run as root or with elevated privileges while still containing the processes and protecting both the Nodes and other Pods, Linux kernel mechanism of user namespaces can be used make the processes in the Pods view their environment as having the privileges, while on the host (Node) level these processes appear as without privileges or with privileges only affecting processes in the same Pods
The purpose of using user namespaces in Kubernetes is to let the processes in Pods think they run as one uid set when in fact they run as different “real” uids on the Nodes.
In this text, most everything said about uids can also be applied to gids.
## Goals
Enable user namespace support in a kubernetes cluster so that workloads that work today also work with user namespaces enabled at runtime. Furthermore, make workloads that require root/privileged user inside the container, safer for the node using the additional security of user namespaces. Containers will run in a user namespace different from user-namespace of the underlying host.
## Non-Goals
- Non-goal is to support pod/container level user namespace isolation. There can be images using different users but on the node, pods/containers running with these images will share common user namespace remapping configuration. In other words, all containers on a node share a common user-namespace range.
- Remote volumes support eg. NFS
## User Stories
- As a cluster admin, I want to protect the node from the rogue container process(es) running inside pod containers with root privileges. If such a process is able to break out into the node, it could be a security issue.
- As a cluster admin, I want to support all the images irrespective of what user/group that image is using.
- As a cluster admin, I want to allow some pods to disable user namespaces if they require elevated privileges.
## Proposal
Proposal is to support user-namespaces for the pod containers. This can be done at two levels:
- Node-level : This proposal explains this part in detail.
- Namespace-Level/Pod-level: Plan is to target this in future due to missing support in the low level system components such as runtimes and kernel. More on this in the `Future Work` section.
Node-level user-namespace support means that, if feature is enabled, all pods on a node will share a common user-namespace, common UID(and GID) range (which is a subset of nodes total UIDs(and GIDs)). This common user-namespace is runtimes default user-namespace range which is remapped to containers UIDs(and GID), starting with the first UID as containers root.
In general Linux convention, UID(or GID) mapping consists of three parts:
1. Host (U/G)ID: First (U/G)ID of the range on the host that is being remapped to the (U/G)IDs in the container user-namespace
2. Container (U/G)ID: First (U/G)ID of the range in the container namespace and this is mapped to the first (U/G)ID on the host(mentioned in previous point).
3. Count/Size: Total number of consecutive mapping between host and container user-namespaces, starting from the first one (including) mentioned above.
As an example, `host_id 1000, container_id 0, size 10`
In this case, 1000 to 1009 on host will be mapped to 0 to 9 inside the container.
User-namespace support should be enabled only when container runtime on the node supports user-namespace remapping and is enabled in its configuration. To enable user-namespaces, feature-gate flag will need to be passed to Kubelet like this `--feature-gates=”NodeUserNamespace=true”`
A new CRI API, `GetRuntimeConfigInfo` will be added. Kubelet will use this API:
- To verify if user-namespace remapping is enabled at runtime. If found disabled, kubelet will fail to start
- To determine the default user-namespace range at the runtime, starting UID of which is mapped to the UID '0' of the container.
### Volume Permissions
Kubelet will change the file permissions, i.e chown, at `/var/lib/kubelet/pods` prior to any container start to get file permissions updated according to remapped UID and GID.
This proposal will work only for local volumes and not with remote volumes such as NFS.
### How to disable `NodeUserNamespace` for a specific pod
This can be done in two ways:
- **Alpha:** Implicitly using host namespace for the pod containers
This support is already present (currently it seems broken, will be fixed) in Kubernetes as an experimental functionality, which can be enabled using `feature-gates=”ExperimentalHostUserNamespaceDefaulting=true”`.
If Pod-Security-Policy is configured to allow the following to be requested by a pod, host user-namespace will be enabled for the container:
- host namespaces (pid, ipc, net)
- non-namespaced capabilities (mknod, sys_time, sys_module)
- the pod contains a privileged container or using host path volumes.
- https://github.com/kubernetes/kubernetes/commit/d0d78f478ce0fb9d5e121db3b7c6993b482af82c#diff-a53fa76e941e0bdaee26dcbc435ad2ffR437 introduced via https://github.com/kubernetes/kubernetes/commit/d0d78f478ce0fb9d5e121db3b7c6993b482af82c.
- **Beta:** Explicit API to request host user-namespace in pod spec
This is being targeted under Beta graduation plans.
### CRI API Changes
Proposed CRI API changes:
```golang
// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
// Version returns the runtime name, runtime version, and runtime API version.
rpc Version(VersionRequest) returns (VersionResponse) {}
…….
…….
// GetRuntimeConfigInfo returns the configuration details of the runtime.
rpc GetRuntimeConfigInfo(GetRuntimeConfigInfoRequest) returns (GetRuntimeConfigInfoResponse) {}
}
// LinuxIDMapping represents a single user namespace mapping in Linux.
message LinuxIDMapping {
// container_id is the starting id for the mapping inside the container.
uint32 container_id = 1;
// host_id is the starting id for the mapping on the host.
uint32 host_id = 2;
// size is the length of the mapping.
uint32 size = 3;
}
message LinuxUserNamespaceConfig {
// is_enabled, if true indicates that user-namespaces are supported and enabled in the container runtime
bool is_enabled = 1;
// uid_mappings is an array of user id mappings.
repeated LinuxIDMapping uid_mappings = 1;
// gid_mappings is an array of group id mappings.
repeated LinuxIDMapping gid_mappings = 2;
}
message GetRuntimeConfig {
LinuxUserNamespaceConfig user_namespace_config = 1;
}
message GetRuntimeConfigInfoRequest {}
message GetRuntimeConfigInfoResponse {
GetRuntimeConfig runtime_config = 1
}
...
// NamespaceOption provides options for Linux namespaces.
message NamespaceOption {
// Network namespace for this container/sandbox.
// Note: There is currently no way to set CONTAINER scoped network in the Kubernetes API.
// Namespaces currently set by the kubelet: POD, NODE
NamespaceMode network = 1;
// PID namespace for this container/sandbox.
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER.
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods.
// Namespaces currently set by the kubelet: POD, CONTAINER, NODE
NamespaceMode pid = 2;
// IPC namespace for this container/sandbox.
// Note: There is currently no way to set CONTAINER scoped IPC in the Kubernetes API.
// Namespaces currently set by the kubelet: POD, NODE
NamespaceMode ipc = 3;
// User namespace for this container/sandbox.
// Note: There is currently no way to set CONTAINER scoped user namespace in the Kubernetes API.
// The container runtime should ignore this if user namespace is NOT enabled.
// POD is the default value. Kubelet will set it to NODE when trying to use host user-namespace
// Namespaces currently set by the kubelet: POD, NODE
NamespaceMode user = 4;
}
```
### Runtime Support
- Docker: Here is the [user-namespace documentation](https://docs.docker.com/engine/security/userns-remap/) and this is the [implementation PR](https://github.com/moby/moby/pull/12648)
- Concerns:
Docker API does not provide user-namespace mapping. Therefore to handle `GetRuntimeConfigInfo` API, changes will be done in `dockershim` to read system files, `/etc/subuid` and `/etc/subgid`, for figuring out default user-namespace mapping. `/info` api will be used to figure out if user-namespace is enabled and `Docker Root Dir` will be used to figure out host uid mapped to the uid `0` in container. eg. `Docker Root Dir: /var/lib/docker/2131616.2131616` this shows host uid `2131616` will be mapped to uid `0`
- CRI-O: https://github.com/kubernetes-incubator/cri-o/pull/1519
- Containerd: https://github.com/containerd/containerd/blob/129167132c5e0dbd1b031badae201a432d1bd681/container_opts_unix.go#L149
### Implementation Roadmap
#### Phase 1: Support in Kubelet, Alpha, [Target: Kubernetes v1.11]
- Add feature gate `NodeUserNamespace`, disabled by default
- Add new CRI API, `GetRuntimeConfigInfo()`
- Add logic in Kubelet to handle pod creation which includes parsing GetRuntimeConfigInfo response and changing file-permissions in /var/lib/kubelet with learned userns mapping.
- Add changes in dockershim to implement GetRuntimeConfigInfo() for docker runtime
- Add changes in CRI-O to implement userns support and GetRuntimeConfigInfo() support
- Unit test cases
- e2e tests
#### Phase 2: Beta Support [Target: Kubernetes v1.12]
- PSP integration
- To grow ExperimentalHostUserNamespaceDefaulting from experimental feature gate to a Kubelet flag
- API changes to allow pod able to request HostUserNamespace in pod spec
- e2e tests
### References
- Default host user namespace via experimental flag
- https://github.com/kubernetes/kubernetes/pull/31169
- Enable userns support for containers launched by kubelet
- https://github.com/kubernetes/features/issues/127
- Track Linux User Namespaces in the Pod Security Policy
- https://github.com/kubernetes/kubernetes/issues/59152
- Add support for experimental-userns-remap-root-uid and experimental-userns-remap-root-gid options to match the remapping used by the container runtime.
- https://github.com/kubernetes/kubernetes/pull/55707
- rkt User Namespaces Background
- https://coreos.com/rkt/docs/latest/devel/user-namespaces.html
## Future Work
### Namespace-Level/Pod-Level user-namespace support
There is no runtime today which supports creating containers with a specified user namespace configuration. For example here is the discussion related to this support in Docker https://github.com/moby/moby/issues/28593
Once user-namespace feature in the runtimes has evolved to support containers request for a specific user-namespace mapping(UID and GID range), we can extend current Node-Level user-namespace support in Kubernetes to support Namespace-level isolation(or if desired even pod-level isolation) by dividing and allocating learned mapping from runtime among Kubernetes namespaces (or pods, if desired). From end-user UI perspective, we don't expect any change in the UI related to user namespaces support.
### Remote Volumes
Remote Volumes support should be investigated and should be targeted in future once support is there at lower infra layers.
## Risks and Mitigations
The main risk with this change stems from the fact that processes in Pods will run with different “real” uids than they used to, while expecting the original uids to make operations on the Nodes or consistently access shared persistent storage.
- This can be mitigated by turning the feature on gradually, per-Pod or per Kubernetes namespace.
- For the Kubernetes' cluster Pods (that provide the Kubernetes functionality), testing of their behaviour and ability to run in user namespaced setups is crucial.
## Graduation Criteria
- PSP integration
- API changes to allow pod able to request host user namespace using for example, `HostUserNamespace: True`, in pod spec
- e2e tests
## Alternatives
User Namespace mappings can be passed explicitly through kubelet flags similar to https://github.com/kubernetes/kubernetes/pull/55707 but we do not prefer this option because this is very much prone to mis-configuration.

View File

@ -0,0 +1,169 @@
# Plugin Watcher Utility
## Background
Portability and extendability are the major goals of Kubernetes from its beginning and we have seen more plugin mechanisms developed on Kubernetes to further improve them. Moving in this direction, Kubelet is starting to support pluggable [device exporting](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md) and [CSI volume plugins](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/container-storage-interface.md). We are seeing the need for a common Kubelet plugin discovery model that can be used by different types of node-level plugins, such as device plugins, CSI, and CNI, to establish communication channels with Kubelet. This document lists two possible approaches of implementing this common Kubelet plugin discovery model. We are hoping to discuss these proposals with the OSS community to gather consensus on which model we would like to take forward.
## General Requirements
The primary goal of the Kubelet plugin discovery model is to provide a common mechanism for users to dynamically deploy vendor specific plugins that make different types of devices, or storage system, or network components available on a Kubernetes node.
Here are the general requirements to consider when designing this system:
* Security/authentication requirements
* Underlying communication channel to use: stay with gRPC v.s. flexibility to support multiple communication protocol
* How to detect API version mismatching
* Ping-pong plugin registration
* Failure recovery or upgrade story upon kubelet restart and/or plugin restart
* How to prevent single misbehaving plugin from flooding kubelet
* How to de-registration
* Need to support some existing protocol that is not bound to K8s, like CSI
## Proposed Models
#### Model 1: plugin registers with Kubelet through grpc (currently used in device plugin)
* Currently requires plugin to run with privilege and communicate to kubelet through unix socket under a canonical directory, but has flexibility to support different communication channels or authentication methods.
* API version mismatch is detected during registration.
* Currently always take newest plugin upon re-registration. Can implement some policy to reject plugin re-registration if a plugin re-registers too frequently. Can terminate the communication channel if a plugin sends too many updates to Kubelet.
* In the current implementation, kubelet removes all of the device plugin unix sockets. Device plugins are expected to watch for such event and re-register with the new kubelet instance. The solution is a bit ad-hoc. There is also a temporary period that we can't schedule new pods requiring device plugin resource on the node after kubelet restart, till the corresponding device plugin re-registers. This temporary period can be avoided if we also checkpoints device plugin socket information on Kubelet side. Pods previously scheduled can continue with device plugin allocation information already recorded in a checkpoint file. Checkpointing plugin socket information is easier to be added in DevicePlugins that already maintains a checkpoint file for other purposes. This however could be a new requirement for other plugin systems like CSI.
![image](https://user-images.githubusercontent.com/11936386/42627970-3bf055a4-85ec-11e8-93cb-f4f393b2bd76.png)
#### Model 2: Kubelet watches new plugins under a canonical path through inotify (Preferred one and current implementation)
* Plugin can export a registration rpc for API version checking or further authentication. Kubelet doesn't need to export a rpc service.
* We will take gRPC as the single supported communication channel.
* Can take the newest plugin from the latest inotify creation. May require socket name to follow certain naming convention (e.g., resourceName.timestamp) to detect ping-pong plugin registration, and ignore socket creations from a plugin if it creates too many sockets during a short period of time. We can even require that the resource name embedded in the socket path to be part of the identification process, e.g., a plugin at `/var/lib/kubelet/plugins/resourceName.timestamp` must identify itself as resourceName or it will be rejected.
* Easy to avoid temporary plugin unavailability after kubelet restart. Kubelet just needs to scan through the special directory. It can remove plugin sockets that fail to respond, and always take the last live socket when multiple registrations happen with the same plugin name. This simplifies device plugin implementation because they don't need to detect Kubelet restarts and re-register.
* A plugin should remove its socket upon termination to avoid leaving dead sockets in the canonical path, although this is not strictly required.
* CSI needs flexibility to not only bound to Kubernetes. With probe model, may need to add an interface for K8s to get plugin information.
* We can introduce special plugin pod for which we automatically setup its environment to communicate with kubelet. Even if Kubelet runs in a container, it is easy to config the communication path between plugin and Kubelet.
![image](https://user-images.githubusercontent.com/11936386/42628430-be3ab5bc-85ed-11e8-93ac-173511fdd39a.png)
**More Implementation Details on Model 2:**
* Kubelet will have a new module, PluginWatcher, which will probe a canonical path recursively
* On detecting a socket creation, Watcher will try to get plugin identity details using a gRPC client on the discovered socket and the RPCs of a newly introduced `Identity` service.
* Plugins must implement `Identity` service RPCs for initial communication with Watcher.
**Identity Service Primitives:**
```golang
// PluginInfo is the message sent from a plugin to the Kubelet pluginwatcher for plugin registration
message PluginInfo {
// Type of the Plugin. CSIPlugin or DevicePlugin
string type = 1;
// Plugin name that uniquely identifies the plugin for the given plugin type.
// For DevicePlugin, this is the resource name that the plugin manages and
// should follow the extended resource name convention.
// For CSI, this is the CSI driver registrar name.
string name = 2;
// Optional endpoint location. If found set by Kubelet component,
// Kubelet component will use this endpoint for specific requests.
// This allows the plugin to register using one endpoint and possibly use
// a different socket for control operations. CSI uses this model to delegate
// its registration external from the plugin.
string endpoint = 3;
// Plugin service API versions the plugin supports.
// For DevicePlugin, this maps to the deviceplugin API versions the
// plugin supports at the given socket.
// The Kubelet component communicating with the plugin should be able
// to choose any preferred version from this list, or returns an error
// if none of the listed versions is supported.
repeated string supported_versions = 4;
}
// RegistrationStatus is the message sent from Kubelet pluginwatcher to the plugin for notification on registration status
message RegistrationStatus {
// True if plugin gets registered successfully at Kubelet
bool plugin_registered = 1;
// Error message in case plugin fails to register, empty string otherwise
string error = 2;
}
// RegistrationStatusResponse is sent by plugin to kubelet in response to RegistrationStatus RPC
message RegistrationStatusResponse {
}
// InfoRequest is the empty request message from Kubelet
message InfoRequest {
}
// Registration is the service advertised by the Plugins.
service Registration {
rpc GetInfo(InfoRequest) returns (PluginInfo) {}
rpc NotifyRegistrationStatus(RegistrationStatus) returns (RegistrationStatusResponse) {}
}
```
**PluginWatcher primitives:**
```golang
// Watcher is the plugin watcher
type Watcher struct {
path string
handlers map[string]RegisterCallbackFn
stopCh chan interface{}
fs utilfs.Filesystem
fsWatcher *fsnotify.Watcher
wg sync.WaitGroup
mutex sync.Mutex
}
// RegisterCbkFn is the type of the callback function that handlers will provide
type RegisterCallbackFn func(pluginName string, endpoint string, versions []string, socketPath string) (chan bool, error)
// AddHandler registers a callback to be invoked for a particular type of plugin
func (w *Watcher) AddHandler(pluginType string, handlerCbkFn RegisterCbkFn) {
w.handlers[handlerType] = handlerCbkFn
}
// Start watches for the creation of plugin sockets at the path
func (w *Watcher) Start() error {
// Probes on the canonical path for socket creations in a forever loop
// For any new socket creation, invokes `Info()` at plugins Identity service
resp, err := client.Info(context.Background(), &watcherapi.Empty{})
// Keeps the connection open and passes plugin's identity details, along with socket path to the handler using callback function registered by handler. Handler callback is selected based on the Type of the plugin, for example device plugin or CSI plugin
// Handler Callback is supposed to authenticate the plugin details and if all correct, register the Plugin at the kubelet subsystem.
if handlerCbkFn, ok := w.handlers[resp.Type]; ok {
err = handlerCbkFn(resp, event.Name)
...
}
// After Callback returns, PluginWatcher notifies back status to the plugin
client.NotifyRegistrationStatus(ctx, &registerapi.RegistrationStatus{
...
})
```
**How any Kubelet sub-module can use PluginWatcher:**
* There must be a callback function defined in the sub-module of the signature:
```golang
type RegisterCallbackFn func(pluginName string, endpoint string, versions []string, socketPath string) (chan bool, error)
```
* Just after sub-module start, this callback should be registered with the PluginWatcher, eg:
```golang
kl.pluginWatcher.AddHandler(pluginwatcherapi.DevicePlugin, kl.containerManager.GetPluginRegistrationHandlerCbkFunc())
```
**Open issues (Points from the meeting notes for the record):**
* Discuss with security team if this is a viable approach (and if cert auth can be added on top for added security).
* Plugin author should be able to write yaml once, so the plugin dir should not be hard coded. 3 options:
* Downward API param for plugin directory that will be used as hostpath src
* A new volume plugin that can be used by plugin to drop a socket
* Have plugins call kubelet -- link local interface
* Bigger change -- kubelet doesn't do this
* Path of most resistance

View File

@ -228,7 +228,7 @@ of a compressible resource that was requested by a pod in a higher QoS tier.
The `kubelet` will support a flag `experimental-qos-reserved` that
takes a set of percentages per incompressible resource that controls how the
QoS cgroup sandbox attempts to reserve resources for its tier. It attempts
to reserve requested resources to exclude pods from lower OoS classes from
to reserve requested resources to exclude pods from lower QoS classes from
using resources requested by higher QoS classes. The flag will accept values
in a range from 0-100%, where a value of `0%` instructs the `kubelet` to attempt
no reservation, and a value of `100%` will instruct the `kubelet` to attempt to
@ -564,10 +564,10 @@ Pod3 and Pod4 are both classified as Burstable and are hence nested under
the Burstable cgroup.
```
/ROOT/burstable/cpu.shares = 30m
/ROOT/burstable/cpu.shares = 130m
/ROOT/burstable/memory.limit_in_bytes = Allocatable - 5Gi
/ROOT/burstable/Pod3/cpu.quota = 150m
/ROOT/burstable/Pod3/cpu.shares = 20m
/ROOT/burstable/Pod3/cpu.shares = 120m
/ROOT/burstable/Pod3/memory.limit_in_bytes = 3Gi
/ROOT/burstable/Pod4/cpu.quota = 20m
/ROOT/burstable/Pod4/cpu.shares = 10m

View File

@ -95,7 +95,7 @@ When `limits` are not specified, they default to the node capacity.
Examples:
Container `bar` has not resources specified.
Container `bar` has no resources specified.
```yaml
containers:

View File

@ -28,6 +28,7 @@ This design should:
* be container-runtime agnostic
* allow use of custom profiles
* facilitate containerized applications that link directly to libseccomp
* enable a default seccomp profile for containers
## Use Cases
@ -40,6 +41,8 @@ This design should:
unmediated by Kubernetes
4. As a user, I want to be able to use a custom seccomp profile and use
it with my containers
5. As a user and administrator I want kubernetes to apply a sane default
seccomp profile to containers unless I otherwise specify.
### Use Case: Administrator access control
@ -47,7 +50,7 @@ Controlling access to seccomp profiles is a cluster administrator
concern. It should be possible for an administrator to control which users
have access to which profiles.
The [pod security policy](https://github.com/kubernetes/kubernetes/pull/7893)
The [Pod Security Policy](https://github.com/kubernetes/kubernetes/pull/7893)
API extension governs the ability of users to make requests that affect pod
and container security contexts. The proposed design should deal with
required changes to control access to new functionality.
@ -101,9 +104,7 @@ implement a sandbox for user-provided code, such as
## Community Work
### Container runtime support for seccomp
#### Docker / opencontainers
### Docker / OCI
Docker supports the open container initiative's API for
seccomp, which is very close to the libseccomp API. It allows full
@ -112,14 +113,21 @@ specification of seccomp filters, with arguments, operators, and actions.
Docker allows the specification of a single seccomp filter. There are
community requests for:
Issues:
* [docker/22109](https://github.com/docker/docker/issues/22109): composable
seccomp filters
* [docker/21105](https://github.com/docker/docker/issues/22105): custom
seccomp filters for builds
#### rkt / appcontainers
Implementation details:
* [docker/17989](https://github.com/moby/moby/pull/17989): initial
implementation
* [docker/18780](https://github.com/moby/moby/pull/18780): default blacklist
profile
* [docker/18979](https://github.com/moby/moby/pull/18979): default whitelist
profile
### rkt / appcontainers
The `rkt` runtime delegates to systemd for seccomp support; there is an open
issue to add support once `appc` supports it. The `appc` project has an open
@ -133,16 +141,11 @@ Issues:
* [appc/529](https://github.com/appc/spec/issues/529)
* [rkt/1614](https://github.com/coreos/rkt/issues/1614)
#### HyperContainer
### HyperContainer
[HyperContainer](https://hypercontainer.io) does not support seccomp.
### Other platforms and seccomp-like capabilities
FreeBSD has a seccomp/capability-like facility called
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4).
#### lxd
### lxd
[`lxd`](http://www.ubuntu.com/cloud/lxd) constrains containers using a default profile.
@ -150,6 +153,11 @@ Issues:
* [lxd/1084](https://github.com/lxc/lxd/issues/1084): add knobs for seccomp
### Other platforms and seccomp-like capabilities
FreeBSD has a seccomp/capability-like facility called
[Capsicum](https://www.freebsd.org/cgi/man.cgi?query=capsicum&sektion=4).
## Proposed Design
### Seccomp API Resource?
@ -168,8 +176,6 @@ Instead of implementing a new API resource, we propose that pods be able to
reference seccomp profiles by name. Since this is an alpha feature, we will
use annotations instead of extending the API with new fields.
### API changes?
In the alpha version of this feature we will use annotations to store the
names of seccomp profiles. The keys will be:
@ -191,7 +197,8 @@ profiles to be opaque to kubernetes for now.
The following format is scoped as follows:
1. `docker/default` - the default profile for the container runtime
1. `runtime/default` - the default profile for the container runtime, can be
overwritten by the following two.
2. `unconfined` - unconfined profile, ie, no seccomp sandboxing
3. `localhost/<profile-name>` - the profile installed to the node's local seccomp profile root

View File

@ -169,7 +169,7 @@ Adding it there allows the user to change the mode bits of every file in the
object, so it achieves the goal, while having the option to have a default and
not specify all files in the object.
The are two downside:
There are two downsides:
* The files are symlinks pointint to the real file, and the realfile
permissions are only set. The symlink has the classic symlink permissions.

View File

@ -1,6 +1,6 @@
# Troubleshoot Running Pods
* Status: Pending
* Status: Implementing
* Version: Alpha
* Implementation Owner: @verb
@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an
execution platform for custom binaries produced by a build system. These users
can forgo the scripted OS install of traditional Dockerfiles and instead `COPY`
the output of their build system into a container image built `FROM scratch` or
a [distroless container
image](https://github.com/GoogleCloudPlatform/distroless). This confers several
advantages:
a
[distroless container image](https://github.com/GoogleCloudPlatform/distroless).
This confers several advantages:
1. **Minimal images** lower operational burden and reduce attack vectors.
1. **Immutable images** improve correctness and reliability.
@ -45,26 +45,25 @@ A solution to troubleshoot arbitrary container images MUST:
* fetch troubleshooting utilities at debug time rather than at the time of pod
creation
* be compatible with admission controllers and audit logging
* allow discovery of debugging status
* allow discovery of current debugging status
* support arbitrary runtimes via the CRI (possibly with reduced feature set)
* require no administrative access to the node
* have an excellent user experience (i.e. should be a feature of the platform
rather than config-time trickery)
* have no *inherent* side effects to the running container image
* have no _inherent_ side effects to the running container image
* v1.Container must be available for inspection by admission controllers
## Feature Summary
Any new debugging functionality will require training users. We can ease the
transition by building on an existing usage pattern. We will create a new
command, `kubectl debug`, which parallels an existing command, `kubectl exec`.
Whereas `kubectl exec` runs a *process* in a *container*, `kubectl debug` will
be similar but run a *container* in a *pod*.
Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will
be similar but run a _container_ in a _pod_.
A container created by `kubectl debug` is a *Debug Container*. Just like a
process run by `kubectl exec`, a Debug Container is not part of the pod spec and
has no resource stored in the API. Unlike `kubectl exec`, a Debug Container
*does* have status that is reported in `v1.PodStatus` and displayed by `kubectl
describe pod`.
A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl
exec`, Debug Containers have status that is reported in `PodStatus` and
displayed by `kubectl describe pod`.
For example, the following command would attach to a newly created container in
a pod:
@ -82,22 +81,16 @@ kubectl debug target-pod
This creates an interactive shell in a pod which can examine and signal other
processes in the pod. It has access to the same network and IPC as processes in
the pod. It can access the filesystem of other processes by `/proc/$PID/root`.
As is already the case with regular containers, Debug Containers can enter
arbitrary namespaces of another container via `nsenter` when run with
`CAP_SYS_ADMIN`.
the pod. When [process namespace sharing](https://features.k8s.io/495) is
enabled, it can access the filesystem of other processes by `/proc/$PID/root`.
Debug Containers can enter arbitrary namespaces of another visible container via
`nsenter` when run with `CAP_SYS_ADMIN`.
*Please see the User Stories section for additional examples and Alternatives
Considered for the considerable list of other solutions we considered.*
_Please see the User Stories section for additional examples and Alternatives
Considered for the considerable list of other solutions we considered._
## Implementation Details
The implementation of `kubectl debug` closely mirrors the implementation of
`kubectl exec`, with most of the complexity implemented in the `kubelet`. How
functionality like this best fits into Kubernetes API has been contentious. In
order to make progress, we will start with the smallest possible API change,
extending `/exec` to support Debug Containers, and iterate.
From the perspective of the user, there's a new command, `kubectl debug`, that
creates a Debug Container and attaches to its console. We believe a new command
will be less confusing for users than overloading `kubectl exec` with a new
@ -106,192 +99,154 @@ subsequently be used to reattach and is reported by `kubectl describe`.
### Kubernetes API Changes
#### Chosen Solution: "exec++"
This will be implemented in the Core API to avoid new dependencies in the
kubelet. The user-level concept of a _Debug Container_ implemented with the
API-level concept of an _Ephemeral Container_. The API doesn't require an
Ephemeral Container to be used as a Debug Container. It's intended as a general
purpose construct for running a short-lived process in a pod.
We will extend `v1.Pod`'s `/exec` subresource to support "executing" container
images. The current `/exec` endpoint must implement `GET` to support streaming
for all clients. We don't want to encode a (potentially large) `v1.Container` as
an HTTP parameter, so we must extend `v1.PodExecOptions` with the specific
fields required for creating a Debug Container:
#### Pod Changes
Ephemeral Containers are represented in `PodSpec` and `PodStatus`:
```
// PodExecOptions is the query options to a Pod's remote exec call
type PodExecOptions struct {
...
// EphemeralContainerName is the name of an ephemeral container in which the
// command ought to be run. Either both EphemeralContainerName and
// EphemeralContainerImage fields must be set, or neither.
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
// EphemeralContainerImage is the image of an ephemeral container in which the command
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
// fields must be set, or neither.
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
type PodSpec struct {
...
// List of user-initiated ephemeral containers to run in this pod.
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
// +optional
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"`
}
```
After creating the Debug Container, the kubelet will upgrade the connection to
streaming and perform an attach to the container's console. If disconnected, the
Debug Container can be reattached using the pod's `/attach` endpoint with
`EphemeralContainerName`.
Debug Containers cannot be removed via the API and instead the process must
terminate. While not ideal, this parallels existing behavior of `kubectl exec`.
To kill a Debug Container one would `attach` and exit the process interactively
or create a new Debug Container to send a signal with `kill(1)` to the original
process.
#### Alternative 1: Debug Subresource
Rather than extending an existing subresource, we could create a new,
non-streaming `debug` subresource. We would create a new API Object:
```
// DebugContainer describes a container to attach to a running pod for troubleshooting.
type DebugContainer struct {
metav1.TypeMeta
metav1.ObjectMeta
// Name is the name of the Debug Container. Its presence will cause
// exec to create a Debug Container rather than performing a runtime exec.
Name string `json:"name,omitempty" ...`
// Image is an optional container image name that will be used to for the Debug
// Container in the specified Pod with Command as ENTRYPOINT. If omitted a
// default image will be used.
Image string `json:"image,omitempty" ...`
}
```
The pod would gain a new `/debug` subresource that allows the following:
1. A `POST` of a `PodDebugContainer` to
`/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` to create Debug
Container named `$NAME` running in pod `$POD_NAME`.
1. A `DELETE` of `/api/v1/namespaces/$NS/pods/$POD_NAME/debug/$NAME` will stop
the Debug Container `$NAME` in pod `$POD_NAME`.
Once created, a client would attach to the console of a debug container using
the existing attach endpoint, `/api/v1/namespaces/$NS/pods/$POD_NAME/attach`.
However, this pattern does not resemble any other current usage of the API, so
we prefer to start with exec++ and reevaluate if we discover a compelling
reason.
#### Alternative 2: Declarative Configuration
Using subresources is an imperative style API where the client instructs the
kubelet to perform an action, but in general Kubernetes prefers declarative APIs
where the client declares a state for Kubernetes to enact.
We could implement this in a declarative manner by creating a new
`EphemeralContainer` type:
```
type EphemeralContainer struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec EphemeralContainerSpec
Status v1.ContainerStatus
}
```
`EphemeralContainerSpec` is similar to `v1.Container`, but contains only fields
relevant to Debug Containers:
```
type EphemeralContainerSpec struct {
// Target is the pod in which to run the EphemeralContainer
// Required.
Target v1.ObjectReference
Name string
Image String
Command []string
Args []string
ImagePullPolicy PullPolicy
SecurityContext *SecurityContext
}
```
A new controller in the kubelet would watch for EphemeralContainers and
create/delete debug containers. `EphemeralContainer.Status` would be updated by
the kubelet at the same time it updates `ContainerStatus` for regular and init
containers. Clients would create a new `EphemeralContainer` object, wait for it
to be started and then attach using the pod's attach subresource and the name of
the `EphemeralContainer`.
Debugging is inherently imperative, however, rather than a state for Kubernetes
to enforce. Once a Debug Container is started it should not be automatically
restarted, for example. This solution imposes additionally complexity and
dependencies on the kubelet, but it's not yet clear if the complexity is
justified.
### Debug Container Status
The status of a Debug Container is reported in a new field in `v1.PodStatus`:
```
type PodStatus struct {
...
EphemeralContainerStatuses []v1.ContainerStatus
...
// Status for any Ephemeral Containers that running in this pod.
// This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
// +optional
EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"`
}
```
This status is only populated for Debug Containers, but there's interest in
tracking status for traditional exec in a similar manner.
`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and
`InitContainerStatuses`, but `EphemeralContainers` introduces a new type:
Note that `Command` and `Args` would have to be tracked in the status object
because there is no spec for Debug Containers or exec. These must either be made
available by the runtime or tracked by the kubelet. For Debug Containers this
could be stored as runtime labels, but the kubelet currently has no method of
storing state across restarts for exec. Solving this problem for exec is out of
scope for Debug Containers, but we will look for a solution as we implement this
feature.
```
// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions
// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag.
type EphemeralContainer struct {
// Spec describes the Ephemeral Container to be created.
Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"`
`EphemeralContainerStatuses` is populated by the kubelet in the same way as
regular and init container statuses. This is sent to the API server and
displayed by `kubectl describe pod`.
// If set, the name of the container from PodSpec that this ephemeral container targets.
// The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container.
// If not set then the ephemeral container is run in whatever namespaces are shared
// for the pod.
// +optional
TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"`
}
```
Much of the utility of Ephemeral Containers comes from the ability to run a
container within the PID namespace of another container. `TargetContainerName`
allows targeting a container that doesn't share its PID namespace with the rest
of the pod. We must modify the CRI to enable this functionality (see below).
##### Alternative Considered: Omitting TargetContainerName
It would be simpler for the API, kubelet and kubectl if `EphemeralContainers`
was a `[]Container`, but as isolated PID namespaces will be the default for some
time, being able to target a container will provide a better user experience.
#### Updates
Most fields of `Pod.Spec` are immutable once created. There is a short whitelist
of fields which may be updated, and we could extend this to include
`EphemeralContainers`. The ability to add new containers is a large change for
Pod, however, and we'd like to begin conservatively by enforcing the following
best practices:
1. Ephemeral Containers lack guarantees for resources or execution, and they
will never be automatically restarted. To avoid pods that depend on
Ephemeral Containers, we allow their addition only in pod updates and
disallow them during pod create.
1. Some fields of `v1.Container` imply a fundamental role in a pod. We will
disallow the following fields in Ephemeral Containers: `resources`, `ports`,
`livenessProbe`, `readinessProbe`, and `lifecycle.`
1. Cluster administrators may want to restrict access to Ephemeral Containers
independent of other pod updates.
To enforce these restrictions and new permissions, we will introduce a new Pod
subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified
via this subresource. `EphemeralContainerStatuses` is updated with everything
else in `Pod.Status` via `/status`.
To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
the desired `v1.Container` as `Spec` in `Pod.Spec.EphemeralContainers` and
`PUT`s the pod to `/ephemeralcontainers`.
The subresources `attach`, `exec`, `log`, and `portforward` are available for
Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl
attach`, `kubelet exec`, `kubectl log`, and `kubectl port-forward` will work for
Ephemeral Containers.
Once the pod is updated, the kubelet worker watching this pod will launch the
Ephemeral Container and update its status. The client is expected to watch for
the creation of the container status and then attach to the console of a debug
container using the existing attach endpoint,
`/api/v1/namespaces/$NS/pods/$POD_NAME/attach`. Note that any output of the new
container occurring between its creation and attach will not be replayed, but it
can be viewed using `kubectl log`.
##### Alternative Considered: Standard Pod Updates
It would simplify initial implementation if we updated the pod spec via the
normal means, and switched to a new update subresource if required at a future
date. It's easier to begin with a too-restrictive policy than a too-permissive
one on which users come to rely, and we expect to be able to remove the
`/ephemeralcontainers` subresource prior to exiting alpha should it prove
unnecessary.
### Container Runtime Interface (CRI) changes
The CRI requires no changes for basic functionality, but it will need to be
updated to support container namespace targeting, as described in the
[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace).
### Creating Debug Containers
1. `kubectl` invokes the exec API as described in the preceding section.
1. The API server checks for name collisions with existing containers, performs
admission control and proxies the connection to the kubelet's
`/exec/$NS/$POD_NAME/$CONTAINER_NAME` endpoint.
1. The kubelet instructs the Runtime Manager to create a Debug Container.
1. The runtime manager uses the existing `startContainer()` method to create a
container in an existing pod. `startContainer()` has one modification for
Debug Containers: it creates a new runtime label (e.g. a docker label) that
identifies this container as a Debug Container.
1. After creating the container, the kubelet schedules an asynchronous update
of `PodStatus`. The update publishes the debug container status to the API
server at which point the Debug Container becomes visible via `kubectl
describe pod`.
1. The kubelet will upgrade the connection to streaming and attach to the
container's console.
To create a debug container, kubectl will take the following steps:
Rather than performing the implicit attach the kubelet could return success to
the client and require the client to perform an explicit attach, but the
implicit attach maintains consistent semantics across `/exec` rather than
varying behavior based on parameters.
The apiserver detects container name collisions with both containers in the pod
spec and other running Debug Containers by checking
`EphemeralContainerStatuses`. In a race to create two Debug Containers with the
same name, the API server will pass both requests and the kubelet must return an
error to all but one request.
1. `kubectl` constructs an `EphemeralContainer` based on command line arguments
and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod
to the pod's `/ephemeralcontainers`.
1. The apiserver discards changes other than additions to
`Pod.Spec.EphemeralContainers` and validates the pod update.
1. Pod validation fails if container spec contains fields disallowed for
Ephemeral Containers or the same name as a container in the spec or
`EphemeralContainers`.
1. API resource versioning resolves update races.
1. The kubelet's pod watcher notices the update and triggers a `syncPod()`.
During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()`
for any new Ephemeral Container.
1. `StartEphemeralContainer()` uses the existing `startContainer()` to
start the Ephemeral Container.
1. After initial creation, future invocations of `syncPod()` will publish
its ContainerStatus but otherwise ignore the Ephemeral Container. It
will exist for the life of the pod sandbox or it exits. In no event will
it be restarted.
1. `syncPod()` finishes a regular sync, publishing an updated PodStatus (which
includes the new `EphemeralContainer`) by its normal, existing means.
1. The client performs an attach to the debug container's console.
There are no limits on the number of Debug Containers that can be created in a
pod, but exceeding a pod's resource allocation may cause the pod to be evicted.
### Restarting and Reattaching Debug Containers
Debug Containers will never be restarted automatically. It is possible to
replace a Debug Container that has exited by re-using a Debug Container name. It
is an error to attempt to replace a Debug Container that is still running, which
is detected by both the API server and the kubelet.
Debug Containers will not be restarted.
We want to be more user friendly by allowing re-use of the name of an exited
debug container, but this will be left for a future improvement.
One can reattach to a Debug Container using `kubectl attach`. When supported by
a runtime, multiple clients can attach to a single debug container and share the
@ -299,50 +254,25 @@ terminal. This is supported by Docker.
### Killing Debug Containers
Debug containers will not be killed automatically until the pod (specifically,
the pod sandbox) is destroyed. Debug Containers will stop when their command
exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug
Containers will not receive an EOF if their connection is interrupted.
Debug containers will not be killed automatically unless the pod is destroyed.
Debug Containers will stop when their command exits, such as exiting a shell.
Unlike `kubectl exec`, processes in Debug Containers will not receive an EOF if
their connection is interrupted.
### Container Lifecycle Changes
Implementing debug requires no changes to the Container Runtime Interface as
it's the same operation as creating a regular container. The following changes
are necessary in the kubelet:
1. `SyncPod()` must not kill any Debug Container even though it is not part of
the pod spec.
1. As an exception to the above, `SyncPod()` will kill Debug Containers when
the pod sandbox changes since a lone Debug Container in an abandoned sandbox
is not useful. Debug Containers are not automatically started in the new
sandbox.
1. `convertStatusToAPIStatus()` must sort Debug Containers status into
`EphemeralContainerStatuses` similar to as it does for
`InitContainerStatuses`
1. The kubelet must preserve `ContainerStatus` on debug containers for
reporting.
1. Debug Containers must be excluded from calculation of pod phase and
condition
It's worth noting some things that do not change:
1. `KillPod()` already operates on all running containers returned by the
runtime.
1. Containers created prior to this feature being enabled will have a
`containerType` of `""`. Since this does not match `"EPHEMERAL"` the special
handling of Debug Containers is backwards compatible.
A future improvement to Ephemeral Containers could allow killing Debug
Containers when they're removed the `EphemeralContainers`, but it's not clear
that we want to allow this. Removing an Ephemeral Container spec makes it
unavailable for future authorization decisions (e.g. whether to authorize exec
in a pod that had a privileged Ephemeral Container).
### Security Considerations
Debug Containers have no additional privileges above what is available to any
`v1.Container`. It's the equivalent of configuring an shell container in a pod
spec but created on demand.
spec except that it is created on demand.
Admission plugins that guard `/exec` must be updated for the new parameters. In
particular, they should enforce the same container image policy on the `Image`
parameter as is enforced for regular containers. During the alpha phase we will
additionally support a container image whitelist as a kubelet flag to allow
cluster administrators to easily constraint debug container images.
Admission plugins must be updated to guard `/ephemeralcontainers`. They should
apply the same container image and security policy as for regular containers.
### Additional Consideration
@ -352,116 +282,33 @@ cluster administrators to easily constraint debug container images.
troubleshooting causes a pod to exceed its resource limit it may be evicted.
1. There's an output stream race inherent to creating then attaching a
container which causes output generated between the start and attach to go
to the log rather than the client. This is not specific to Debug Containers
and exists because Kubernetes has no mechanism to attach a container prior
to starting it. This larger issue will not be addressed by Debug Containers,
but Debug Containers would benefit from future improvements or work arounds.
1. We do not want to describe Debug Containers using `v1.Container`. This is to
reinforce that Debug Containers are not general purpose containers by
limiting their configurability. Debug Containers should not be used to build
services.
1. Debug Containers are of limited usefulness without a shared PID namespace.
If a pod is configured with isolated PID namespaces, the Debug Container
will join the PID namespace of the target container. Debug Containers will
not be available with runtimes that do not implement PID namespace sharing
in some form.
to the log rather than the client. This is not specific to Ephemeral
Containers and exists because Kubernetes has no mechanism to attach a
container prior to starting it. This larger issue will not be addressed by
Ephemeral Containers, but Ephemeral Containers would benefit from future
improvements or work arounds.
1. Ephemeral Containers should not be used to build services, which we've
attempted to reflect in the API.
## Implementation Plan
### Alpha Release
### 1.12: Initial Alpha Release
#### Goals and Non-Goals for Alpha Release
We're targeting an alpha release in Kubernetes 1.9 that includes the following
We're targeting an alpha release in Kubernetes 1.12 that includes the following
basic functionality:
* Support in the kubelet for creating debug containers in a running pod
* A `kubectl debug` command to initiate a debug container
* `kubectl describe pod` will list status of debug containers running in a pod
1. Approval for basic core API changes to Pod
1. Basic support in the kubelet for creating Ephemeral Containers
Functionality out of scope for 1.12:
* Killing running Ephemeral Containers by removing them from the Pod Spec.
* Updating `pod.Spec.EphemeralContainers` when containers are garbage
collected.
* `kubectl` commands for creating Ephemeral Containers
Functionality will be hidden behind an alpha feature flag and disabled by
default. The following are explicitly out of scope for the 1.9 alpha release:
* Exited Debug Containers will be garbage collected as regular containers and
may disappear from the list of Debug Container Statuses.
* Security Context for the Debug Container is not configurable. It will always
be run with `CAP_SYS_PTRACE` and `CAP_SYS_ADMIN`.
* Image pull policy for the Debug Container is not configurable. It will
always be run with `PullAlways`.
#### kubelet Implementation
Debug Containers are implemented in the kubelet's generic runtime manager.
Performing this operation with a legacy (non-CRI) runtime will result in a not
implemented error. Implementation in the kubelet will be split into the
following steps:
##### Step 1: Container Type
The first step is to add a feature gate to ensure all changes are off by
default. This will be added in the `pkg/features` `DefaultFeatureGate`.
The runtime manager stores metadata about containers in the runtime via labels
(e.g. docker labels). These labels are used to populate the fields of
`kubecontainer.ContainerStatus`. Since the runtime manager needs to handle Debug
Containers differently in a few situations, we must add a new piece of metadata
to distinguish Debug Containers from regular containers.
`startContainer()` will be updated to write a new label
`io.kubernetes.container.type` to the runtime. Existing containers will be
started with a type of `REGULAR` or `INIT`. When added in a subsequent step,
Debug Containers will start with the type `EPHEMERAL`.
##### Step 2: Creation and Handling of Debug Containers
This step adds methods for creating debug containers, but doesn't yet modify the
kubelet API. Since the runtime manager discards runtime (e.g. docker) labels
after populating `kubecontainer.ContainerStatus`, the label value will be stored
in a the new field `ContainerStatus.Type` so it can be used by `SyncPod()`.
The kubelet gains a `RunDebugContainer()` method which accepts a `v1.Container`
and passes it on to the Runtime Manager's `RunDebugContainer()` if implemented.
Currently only the Generic Runtime Manager (i.e. the CRI) implements the
`DebugContainerRunner` interface.
The Generic Runtime Manager's `RunDebugContainer()` calls `startContainer()` to
create the Debug Container. Additionally, `SyncPod()` is modified to skip Debug
Containers unless the sandbox is restarted.
##### Step 3: kubelet API changes
The kubelet exposes the new functionality in its existing `/exec/` endpoint.
`ServeExec()` constructs a `v1.Container` based on `PodExecOptions`, calls
`RunDebugContainer()`, and performs the attach.
##### Step 4: Reporting EphemeralContainerStatus
The last major change to the kubelet is to populate
v1.`PodStatus.EphemeralContainerStatuses` based on the
`kubecontainer.ContainerStatus` for the Debug Container.
#### Kubernetes API Changes
There are two changes to be made to the Kubernetes, which will be made
independently:
1. `v1.PodExecOptions` must be extended with new fields.
1. `v1.PodStatus` gains a new field to hold Debug Container statuses.
In all cases, new fields will be prepended with `Alpha` for the duration of this
feature's alpha status.
#### kubectl changes
In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a
`kubectl alpha` command to contain alpha features. We will add `kubectl alpha
debug` to invoke Debug Containers. `kubectl` does not use feature gates, so
`kubectl alpha debug` will be visible by default in `kubectl` 1.9 and return an
error when used on a cluster with the feature disabled.
`kubectl describe pod` will report the contents of `EphemeralContainerStatuses`
when not empty as it means the feature is enabled. The field will be hidden when
empty.
default.
## Appendices
@ -592,10 +439,10 @@ container image distribution mechanisms to fetch images when the debug command
is run.
**Respect admission restrictions.** Requests from kubectl are proxied through
the apiserver and so are available to existing [admission
controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins
already exist to intercept `exec` and `attach` calls, but extending this to
support `debug` has not yet been scoped.
the apiserver and so are available to existing
[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/).
Plugins already exist to intercept `exec` and `attach` calls, but extending this
to support `debug` has not yet been scoped.
**Allow introspection of pod state using existing tools**. The list of
`EphemeralContainerStatuses` is never truncated. If a debug container has run in
@ -629,26 +476,146 @@ active debug container.
### Appendix 3: Alternatives Considered
#### Mutable Pod Spec
#### Container Spec in PodStatus
Rather than adding an operation to have Kubernetes attach a pod we could instead
make the pod spec mutable so the client can generate an update adding a
container. `SyncPod()` has no issues adding the container to the pod at that
point, but an immutable pod spec has been a basic assumption in Kubernetes thus
far and changing it carries risk. It's preferable to keep the pod spec immutable
as a best practice.
Originally there was a desire to keep the pod spec immutable, so we explored
modifying only the pod status. An `EphemeralContainer` would contain a Spec, a
Status and a Target:
#### Ephemeral container
```
// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
type EphemeralContainer struct {
metav1.TypeMeta `json:",inline"`
An earlier version of this proposal suggested running an ephemeral container in
the pod namespaces. The container would not be added to the pod spec and would
exist only as long as the process it ran. This has the advantage of behaving
similarly to the current kubectl exec, but it is opaque and likely violates
design assumptions. We could add constructs to track and report on both
traditional exec process and exec containers, but this would probably be more
work than adding to the pod spec. Both are generally useful, and neither
precludes the other in the future, so we chose mutating the pod spec for
expedience.
// Spec describes the Ephemeral Container to be created.
Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
// Most recently observed status of the container.
// This data may not be up to date.
// Populated by the system.
// Read-only.
// +optional
Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
// If set, the name of the container from PodSpec that this ephemeral container targets.
// If not set then the ephemeral container is run in whatever namespaces are shared
// for the pod.
TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
}
```
Ephemeral Containers for a pod would be listed in the pod's status:
```
type PodStatus struct {
...
// List of user-initiated ephemeral containers that have been run in this pod.
// +optional
EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
}
```
To create a new Ephemeral Container, one would append a new `EphemeralContainer`
with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod`
in the API. Users cannot normally modify the pod status, so we'd create a new
subresource `/ephemeralcontainers` that allows an update of solely
`EphemeralContainers` and enforces append-only semantics.
Since we have a requirement to describe the Ephemeral Container with a
`v1.Container`, this lead to a "spec in status" that seemed to violate API best
practices. It was confusing, and it required added complexity in the kubelet to
persist and publish user intent, which is rightfully the job of the apiserver.
#### Extend the Existing Exec API ("exec++")
A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
"executing" container images. The current `/exec` endpoint must implement `GET`
to support streaming for all clients. We don't want to encode a (potentially
large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
with the specific fields required for creating a Debug Container:
```
// PodExecOptions is the query options to a Pod's remote exec call
type PodExecOptions struct {
...
// EphemeralContainerName is the name of an ephemeral container in which the
// command ought to be run. Either both EphemeralContainerName and
// EphemeralContainerImage fields must be set, or neither.
EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
// EphemeralContainerImage is the image of an ephemeral container in which the command
// ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
// fields must be set, or neither.
EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
}
```
After creating the Ephemeral Container, the kubelet would upgrade the connection
to streaming and perform an attach to the container's console. If disconnected,
the Ephemeral Container could be reattached using the pod's `/attach` endpoint
with `EphemeralContainerName`.
Ephemeral Containers could not be removed via the API and instead the process
must terminate. While not ideal, this parallels existing behavior of `kubectl
exec`. To kill an Ephemeral Container one would `attach` and exit the process
interactively or create a new Ephemeral Container to send a signal with
`kill(1)` to the original process.
Since the user cannot specify the `v1.Container`, this approach sacrifices a
great deal of flexibility. This solution still requires the kubelet to publish a
`Container` spec in the `PodStatus` that can be examined for future admission
decisions and so retains many of the downsides of the Container Spec in
PodStatus approach.
#### Ephemeral Container Controller
Kubernetes prefers declarative APIs where the client declares a state for
Kubernetes to enact. We could implement this in a declarative manner by creating
a new `EphemeralContainer` type:
```
type EphemeralContainer struct {
metav1.TypeMeta
metav1.ObjectMeta
Spec v1.Container
Status v1.ContainerStatus
}
```
A new controller in the kubelet would watch for EphemeralContainers and
create/delete debug containers. `EphemeralContainer.Status` would be updated by
the kubelet at the same time it updates `ContainerStatus` for regular and init
containers. Clients would create a new `EphemeralContainer` object, wait for it
to be started and then attach using the pod's attach subresource and the name of
the `EphemeralContainer`.
A new controller is a significant amount of complexity to add to the kubelet,
especially considering that the kubelet is already watching for changes to pods.
The kubelet would have to be modified to create containers in a pod from
multiple config sources. SIG Node strongly prefers to minimize kubelet
complexity.
#### Mutable Pod Spec Containers
Rather than adding to the pod API, we could instead make the pod spec mutable so
the client can generate an update adding a container. `SyncPod()` has no issues
adding the container to the pod at that point, but an immutable pod spec has
been a basic assumption and best practice in Kubernetes. Changing this
assumption complicates the requirements of the kubelet state machine. Since the
kubelet was not written with this in mind, we should expect such a change would
create bugs we cannot predict.
#### Image Exec
An earlier version of this proposal suggested simply adding `Image` parameter to
the exec API. This would run an ephemeral container in the pod namespaces
without adding it to the pod spec or status. This container would exist only as
long as the process it ran. This parallels the current kubectl exec, including
its lack of transparency. We could add constructs to track and report on both
traditional exec process and exec containers. In the end this failed to meet our
transparency requirements.
#### Attaching Container Type Volume
@ -669,9 +636,8 @@ this simplifies the solution by working within the existing constraints of
If Kubernetes supported the concept of an "inactive" container, we could
configure it as part of a pod and activate it at debug time. In order to avoid
coupling the debug tool versions with those of the running containers, we would
need to ensure the debug image was pulled at debug time. The container could
then be run with a TTY and attached using kubectl. We would need to figure out a
solution that allows access the filesystem of other containers.
want to ensure the debug image was pulled at debug time. The container could
then be run with a TTY and attached using kubectl.
The downside of this approach is that it requires prior configuration. In
addition to requiring prior consideration, it would increase boilerplate config.
@ -681,14 +647,14 @@ than a feature of the platform.
#### Implicit Empty Volume
Kubernetes could implicitly create an EmptyDir volume for every pod which would
then be available as target for either the kubelet or a sidecar to extract a
then be available as a target for either the kubelet or a sidecar to extract a
package of binaries.
Users would have to be responsible for hosting a package build and distribution
infrastructure or rely on a public one. The complexity of this solution makes it
undesirable.
#### Standalone Pod in Shared Namespace
#### Standalone Pod in Shared Namespace ("Debug Pod")
Rather than inserting a new container into a pod namespace, Kubernetes could
instead support creating a new pod with container namespaces shared with
@ -698,21 +664,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces
(network, pid, etc) of the target pod but remain in a separate resource group
(e.g. cgroup for container-based runtimes).
This would be a rather fundamental change to pod, which is currently treated as
an atomic unit. The Container Runtime Interface has no provisions for sharing
This would be a rather large change for pod, which is currently treated as an
atomic unit. The Container Runtime Interface has no provisions for sharing
outside of a pod sandbox and would need a refactor. This could be a complicated
change for non-container runtimes (e.g. hypervisor runtimes) which have more
rigid boundaries between pods.
Effectively, Debug Pod must be implemented by the runtimes while Debug
Containers are implemented by the kubelet. Minimizing change to the Kubernetes
API is not worth the increased complexity for the kubelet and runtimes.
This is pushing the complexity of the solution from the kubelet to the runtimes.
Minimizing change to the Kubernetes API is not worth the increased complexity
for the kubelet and runtimes.
It could also be possible to implement a Debug Pod as a privileged pod that runs
in the host namespace and interacts with the runtime directly to run a new
container in the appropriate namespace. This solution would be runtime-specific
and effectively pushes the complexity of debugging to the user. Additionally,
requiring node-level access to debug a pod does not meet our requirements.
and pushes the complexity of debugging to the user. Additionally, requiring
node-level access to debug a pod does not meet our requirements.
#### Exec from Node
@ -729,8 +695,7 @@ coupling it with container images.
* [Pod Troubleshooting Tracking Issue](https://issues.k8s.io/27140)
* [CRI Tracking Issue](https://issues.k8s.io/28789)
* [CRI: expose optional runtime features](https://issues.k8s.io/32803)
* [Resource QoS in
Kubernetes](resource-qos.md)
* [Resource QoS in Kubernetes](resource-qos.md)
* Related Features
* [#1615](https://issues.k8s.io/1615) - Shared PID Namespace across
containers in a pod

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

View File

@ -48,13 +48,15 @@ Thus, we decide to reuse the existing `Scopes` of `ResourceQuotaSpec` to provide
## Overview
This design doc introduces how to define a group of priority class scopes for the quota to match with and explains how quota enforcement logic is changed to apply the quota to pods with the given priority classes.
This design doc introduces how to define a priority class scope and scope selectors for the quota to match with and explains how quota enforcement logic is changed to apply the quota to pods with the given priority classes.
## Detailed Design
### Changes in ResourceQuota
ResourceQuotaSpec contains an array of filters, `Scopes`, that if mentioned, must match each object tracked by a ResourceQuota.
A new field `scopeSelector` will be introduced.
```go
// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
type ResourceQuotaSpec struct {
@ -64,22 +66,56 @@ type ResourceQuotaSpec struct {
// If not specified, the quota matches all objects.
// +optional
Scopes []ResourceQuotaScope
// ScopeSelector is also a collection of filters like Scopes that must match each object tracked by a quota
// but expressed using ScopeSelectorOperator in combination with possible values.
// +optional
ScopeSelector *ScopeSelector
}
// A scope selector represents the AND of the selectors represented
// by the scoped-resource selector terms.
type ScopeSelector struct {
// A list of scope selector requirements by scope of the resources.
// +optional
MatchExpressions []ScopedResourceSelectorRequirement
}
// A scoped-resource selector requirement is a selector that contains values, a scope name, and an operator
// that relates the scope name and values.
type ScopedResourceSelectorRequirement struct {
// The name of the scope that the selector applies to.
ScopeName ResourceQuotaScope
// Represents a scope's relationship to a set of values.
// Valid operators are In, NotIn, Exists, DoesNotExist.
Operator ScopeSelectorOperator
// An array of string values. If the operator is In or NotIn,
// the values array must be non-empty. If the operator is Exists or DoesNotExist,
// the values array must be empty.
// This array is replaced during a strategic merge patch.
// +optional
Values []string
}
// A scope selector operator is the set of operators that can be used in
// a scope selector requirement.
type ScopeSelectorOperator string
const (
ScopeSelectorOpIn ScopeSelectorOperator = "In"
ScopeSelectorOpNotIn ScopeSelectorOperator = "NotIn"
ScopeSelectorOpExists ScopeSelectorOperator = "Exists"
ScopeSelectorOpDoesNotExist ScopeSelectorOperator = "DoesNotExist"
)
```
Four new `ResourceQuotaScope` will be defined for matching pods based on priority class names.
A new `ResourceQuotaScope` will be defined for matching pods based on priority class names.
```go
// A ResourceQuotaScope defines a filter that must match each object tracked by a quota
type ResourceQuotaScope string
const (
...
ResourceQuotaScopePriorityClassNameExists ResourceQuotaScope = "PriorityClassNameExists"
// Match all pod objects that do not have any priority class mentioned
ResourceQuotaScopePriorityClassNameNotExists ResourceQuotaScope = "PriorityClassNameNotExists"
// Match all pod objects that have priority class from the set
ResourceQuotaScopePriorityClassNameIn ResourceQuotaScope = "PriorityClassNameIn"
// Match all pod objects that do not have priority class from the set
ResourceQuotaScopePriorityClassNameNotIn ResourceQuotaScope = "PriorityClassNameNotIn"
ResourceQuotaScopePriorityClass ResourceQuotaScope = "PriorityClass"
)
```
@ -99,17 +135,15 @@ type Configuration struct {
// its consumption.
type LimitedResource struct {
...
// MatchScopes is a collection of filters based on priority classes.
// If the object in the intercepted request matches these rules,
// quota system will ensure that corresponding quota MUST have
// priority based Scopes matching the object in request.
//
// If MatchScopes has matched on an object, request for the resource will be denied
// if there is no quota with matching Scopes. In this case, matching priority class based Scopes
// will be an additional requirement for any quota to qualified as covering quota.
// For each intercepted request, the quota system will figure out if the input object
// satisfies a scope which is present in this listing, then
// quota system will ensure that there is a covering quota. In the
// absence of a covering quota, the quota system will deny the request.
// For example, if an administrator wants to globally enforce that
// a quota must exist to create a pod with "cluster-services" priorityclass
// the list would include "scopeName=PriorityClass, Operator=In, Value=cluster-services"
// +optional
MatchScopes []string `json:"matchScopes,omitempty"`
MatchScopes []v1.ScopedResourceSelectorRequirement `json:"matchScopes,omitempty"`
}
```
@ -141,24 +175,42 @@ kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: resourcequota.admission.k8s.io/v1alpha1
kind: Configuration
limitedResources:
- resource: pods
matchScopes:
- "ResourceQuotaScopePriorityClassNameIn:cluster-services"
apiVersion: resourcequota.admission.k8s.io/v1alpha1
kind: Configuration
limitedResources:
- resource: pods
matchScopes:
- scopeName: PriorityClass
operator: In
values: ["cluster-services"]
```
2. Admin will then create a corresponding resource quota object in `kube-system` namespace:
`$ kubectl create quota critical --hard=count/pods=10 --scopes=ResourceQuotaScopePriorityClassNameIn:cluster-services -n kube-system`
```shell
$ cat ./quota.yml
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-cluster-services
spec:
hard:
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["cluster-services"]
$ kubectl create -f ./quota.yml -n kube-system`
```
In this case, a pod creation will be allowed if:
1. Pod has no priority class and created in any namespace.
2. Pod has priority class other than `cluster-service` and created in any namespace.
3. Pod has priority class `cluster-service` and created in `kube-system` namespace, and passed resource quota check.
2. Pod has priority class other than `cluster-services` and created in any namespace.
3. Pod has priority class `cluster-services` and created in `kube-system` namespace, and passed resource quota check.
Pod creation will be rejected if pod has priority class `cluster-service` and created in namespace other than `kube-system`
Pod creation will be rejected if pod has priority class `cluster-services` and created in namespace other than `kube-system`
#### Sample User Story 2
@ -172,15 +224,31 @@ kind: AdmissionConfiguration
plugins:
- name: "ResourceQuota"
configuration:
apiVersion: resourcequota.admission.k8s.io/v1alpha1
kind: Configuration
limitedResources:
- resource: pods
matchScopes:
- "ResourceQuotaScopePriorityClassNameExists"
apiVersion: resourcequota.admission.k8s.io/v1alpha1
kind: Configuration
limitedResources:
- resource: pods
matchScopes:
- operator : Exists
scopeName: PriorityClass
```
2. Create resource quota to match all pods where there is priority set
`$ kubectl create quota example --hard=count/pods=10 --scopes=ResourceQuotaScopePriorityClassNameExists`
```shell
$ cat ./quota.yml
- apiVersion: v1
kind: ResourceQuota
metadata:
name: pods-cluster-services
spec:
hard:
pods: "10"
scopeSelector:
matchExpressions:
- operator : In
scopeName: PriorityClass
values: ["cluster-services"]
$ kubectl create -f ./quota.yml -n kube-system`
```

View File

@ -28,7 +28,7 @@ implied. However, describing the process as "moving" the pod is approximately ac
and easier to understand, so we will use this terminology in the document.
We use the term "rescheduling" to describe any action the system takes to move an
already-running pod. The decision may be made and executed by any component; we wil
already-running pod. The decision may be made and executed by any component; we will
introduce the concept of a "rescheduler" component later, but it is not the only
component that can do rescheduling.
@ -177,7 +177,7 @@ topic that is outside the scope of this document. For example, resource fragment
RequiredDuringScheduling node and pod affinity and anti-affinity means that even if the
sum of the quotas at the top priority level is less than or equal to the total aggregate
capacity of the cluster, some pods at the top priority level might still go pending. In
general, priority provdes a *probabilistic* guarantees of pod schedulability in the face
general, priority provides a *probabilistic* guarantees of pod schedulability in the face
of overcommitment, by allowing prioritization of which pods should be allowed to run pods
when demand for cluster resources exceeds supply.

View File

@ -45,22 +45,21 @@ This option is to leverage NodeAffinity feature to avoid introducing scheduler
1. DS controller filter nodes by nodeSelector, but does NOT check against schedulers predicates (e.g. PodFitHostResources)
2. For each node, DS controller creates a Pod for it with the following NodeAffinity
```yaml
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- nodeSelectorTerms:
matchExpressions:
- key: kubernetes.io/hostname
operator: in
values:
- dest_hostname
```
3. When sync Pods, DS controller will map nodes and pods by this NodeAffinity to check whether Pods are started for nodes
4. In scheduler, DaemonSet Pods will stay pending if scheduling predicates fail. To avoid this, an appropriate priority must
be set to all critical DaemonSet Pods. Scheduler will preempt other pods to ensure critical pods were scheduled even when
the cluster is under resource pressure.
```yaml
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- nodeSelectorTerms:
matchExpressions:
- key: kubernetes.io/hostname
operator: in
values:
- dest_hostname
```
## Reference
* [DaemonsetController can't feel it when node has more resources, e.g. other Pod exits](https://github.com/kubernetes/kubernetes/issues/46935)

View File

@ -190,7 +190,7 @@ Please note with the change of predicates in subsequent development, this doc wi
- **Invalid predicates:**
- `MaxPDVolumeCountPredicate` (only if the added/deleted PVC as a binded volume so it drops to the PV change case, otherwise it should not affect scheduler).
- `MaxPDVolumeCountPredicate` (only if the added/deleted PVC as a bound volume so it drops to the PV change case, otherwise it should not affect scheduler).
- **Scope:**
- All nodes (we don't know which node this PV will be attached to).
@ -229,14 +229,14 @@ Please note with the change of predicates in subsequent development, this doc wi
- **Invalid predicates:**
- `GeneralPredicates`. This invalidate should be done during `scheduler.assume(...)` because binding can be asynchronous. So we just optimistically invalidate predicate cached result there, and if later this pod failed to bind, the following pods will go through normal predicate functions and nothing breaks.
- No `MatchInterPodAffinity`: the scheduler will make sure newly binded pod will not break the existing inter pod affinity. So we does not need to invalidate MatchInterPodAffinity when pod added. But when a pod is deleted, existing inter pod affinity may become invalid. (e.g. this pod was preferred by some else, or vice versa).
- No `MatchInterPodAffinity`: the scheduler will make sure newly bound pod will not break the existing inter pod affinity. So we do not need to invalidate MatchInterPodAffinity when pod added. But when a pod is deleted, existing inter pod affinity may become invalid. (e.g. this pod was preferred by some else, or vice versa).
- NOTE: assumptions above **will not** stand when we implemented features like `RequiredDuringSchedulingRequiredDuringExecution`.
- No `NoDiskConflict`: the newly scheduled pod fits to existing pods on this node, it will also fits to equivalence class of existing pods.
- **Scope:**
- The node which the pod was binded with.
- The node where the pod is bound.
@ -252,7 +252,7 @@ Please note with the change of predicates in subsequent development, this doc wi
- `MatchInterPodAffinity` if the pod's labels are updated.
- **Scope:**
- The node which the pod was binded with
- The node where the pod is bound.
@ -270,7 +270,7 @@ Please note with the change of predicates in subsequent development, this doc wi
- `NoDiskConflict` if the pod has special volume like `RBD`, `ISCSI`, `GCEPersistentDisk` etc.
- **Scope:**
- The node which the pod was binded with.
- The node where the pod is bound.
### 3.5 Node

View File

@ -0,0 +1,436 @@
Status: Draft
Created: 2018-04-09 / Last updated: 2018-08-15
Author: bsalamat
Contributors: misterikkit
---
#
- [SUMMARY ](#summary-)
- [OBJECTIVE](#objective)
- [Terminology](#terminology)
- [BACKGROUND](#background)
- [OVERVIEW](#overview)
- [Non-goals](#non-goals)
- [DETAILED DESIGN](#detailed-design)
- [Bare bones of scheduling](#bare-bones-of-scheduling)
- [Communication and statefulness of plugins](#communication-and-statefulness-of-plugins)
- [Plugin registration](#plugin-registration)
- [Extension points](#extension-points)
- [Scheduling queue sort](#scheduling-queue-sort)
- [Pre-filter](#pre-filter)
- [Filter](#filter)
- [Post-filter](#post-filter)
- [Scoring](#scoring)
- [Post-scoring/pre-reservation](#post-scoringpre-reservation)
- [Reserve](#reserve)
- [Permit](#permit)
- [Approving a Pod binding](#approving-a-pod-binding)
- [Reject](#reject)
- [Pre-Bind](#pre-bind)
- [Bind](#bind)
- [Post Bind](#post-bind)
- [USE-CASES](#use-cases)
- [Dynamic binding of cluster-level resources](#dynamic-binding-of-cluster-level-resources)
- [Gang Scheduling](#gang-scheduling)
- [OUT OF PROCESS PLUGINS](#out-of-process-plugins)
- [CONFIGURING THE SCHEDULING FRAMEWORK](#configuring-the-scheduling-framework)
- [BACKWARD COMPATIBILITY WITH SCHEDULER v1](#backward-compatibility-with-scheduler-v1)
- [DEVELOPMENT PLAN](#development-plan)
- [TESTING PLAN](#testing-plan)
- [WORK ESTIMATES ](#work-estimates)
# SUMMARY
This document describes the Kubernetes Scheduling Framework. The scheduling
framework implements only basic functionality, but exposes many extension points
for plugins to expand its functionality. The plan is that this framework (with
its plugins) will eventually replace the current Kubernetes scheduler.
# OBJECTIVE
- make scheduler more extendable.
- Make scheduler core simpler by moving some of its features to plugins.
- Propose extension points in the framework.
- Propose a mechanism to receive plugin results and continue or abort based
on the received results.
- Propose a mechanism to handle errors and communicate it with plugins.
## Terminology
Scheduler v1, current scheduler: refer to existing scheduler of Kubernetes.
Scheduler v2, scheduling framework: refer to the new scheduler proposed in this
doc.
# BACKGROUND
Many features are being added to the Kubernetes default scheduler. They keep
making the code larger and logic more complex. A more complex scheduler is
harder to maintain, its bugs are harder to find and fix, and those users running
a custom scheduler have a hard time catching up and integrating new changes.
The current Kubernetes scheduler provides
[webhooks to extend](./scheduler_extender.md)
its functionality. However, these are limited in a few ways:
1. The number of extension points are limited: "Filter" extenders are called
after default predicate functions. "Prioritize" extenders are called after
default priority functions. "Preempt" extenders are called after running
default preemption mechanism. "Bind" verb of the extenders are used to bind
a Pod. Only one of the extenders can be a binding extender, and that
extender performs binding instead of the scheduler. Extenders cannot be
invoked at other points, for example, they cannot be called before running
predicate functions.
1. Every call to the extenders involves marshaling and unmarshalling JSON.
Calling a webhook (HTTP request) is also slower than calling native functions.
1. It is hard to inform an extender that scheduler has aborted scheduling of
a Pod. For example, if an extender provisions a cluster resource and
scheduler contacts the extender and asks it to provision an instance of the
resource for the Pod being scheduled and then scheduler faces errors
scheduling the Pod and decides to abort the scheduling, it will be hard to
communicate the error with the extender and ask it to undo the provisioning
of the resource.
1. Since current extenders run as a separate process, they cannot use
scheduler's cache. They must either build their own cache from the API
server or process only the information they receive from the default scheduler.
The above limitations hinder building high performance and versatile scheduler
extensions. We would ideally like to have an extension mechanism that is fast
enough to allow keeping a bare minimum logic in the scheduler core and convert
many of the existing features of default scheduler, such as predicate and
priority functions and preemption into plugins. Such plugins will be compiled
with the scheduler. We would also like to provide an extension mechanism that do
not need recompilation of scheduler. The expected performance of such plugins is
lower than in-process plugins. Such out-of-process plugins should be used in
cases where quick invocation of the plugin is not a constraint.
# OVERVIEW
Scheduler v2 allows both built-in and out-of-process extenders. This new
architecture is a scheduling framework that exposes several extension points
during a scheduling cycle. Scheduler plugins can register to run at one or more
extension points.
#### Non-goals
- We will keep Kubernetes API backward compatibility, but keeping scheduler
v1 backward compatibility is a non-goal. Particularly, scheduling policy
config and v1 extenders won't work in this new framework.
- Solve all the scheduler v1 limitations, although we would like to ensure
that the new framework allows us to address known limitations in the future.
- Provide implementation details of plugins and call-back functions, such as
all of their arguments and return values.
# DETAILED DESIGN
## Bare bones of scheduling
Pods that are not assigned to any node go to a scheduling queue and sorted by
order specified by plugins (described [here](#scheduling-queue-sort)). The
scheduling framework picks the head of the queue and starts a **scheduling
cycle** to schedule the pod. At the end of the cycle scheduler determines
whether the pod is schedulable or not. If the pod is not schedulable, its status
is updated and goes back to the scheduling queue. If the pod is schedulable (one
or more nodes are found that can run the Pod), the scoring process is started.
The scoring process finds the best node to run the Pod. Once the best node is
picked, the scheduler updates its cache and then a bind go routine is started to
bind the pod.
The above process is the same as what Kubernetes scheduler v1 does. Some of the
essential features of scheduler v1, such as leader election, will also be
transferred to the scheduling framework.
In the rest of this section we describe how various plugins are used to enrich
this basic workflow. This document focuses on in-process plugins.
Out-of-process plugins are discussed later in a separate doc.
## Communication and statefulness of plugins
The scheduling framework provides a library that plugins can use to pass
information to other plugins. This library keeps a map from keys of type string
to opaque pointers of type interface{}. A write operation takes a key and a
pointer and stores the opaque pointer in the map with the given key. Other
plugins can provide the key and receive the opaque pointer. Multiple plugins can
share the state or communicate via this mechanism.
The saved state is preserved only during a single scheduling cycle. At the end
of a scheduling cycle, this map is destructed. So, plugins cannot keep shared
state across multiple scheduling cycle. They can, however, update the scheduler
cache via the provided interface of the cache. The cache interface allows
limited state preservation across multiple scheduling cycle.
It is worth noting that plugins are assumed to be **trusted**. Scheduler does
not prevent one plugin from accessing or modifying another plugin's state.
## Plugin registration
Plugin registration is done by providing an extension point and a function that
should be called at that extension point. This step will be something like:
```go
register("pre-filter", plugin.foo)
```
The details of the function signature will be provided later.
## Extension points
The following picture shows the scheduling cycle of a Pod and the extension
points that the scheduling framework exposes. In this picture "Filter" is
equivalent to "Predicate" in scheduler v1 and "Scoring" is equivalent to
"Priority function". Plugins are go functions. They are registered to be called
at one of these extension points. They are called by the framework in the same
order they are registered for each extension point.
In the following sections we describe each extension point in the same order
they are called in a schedule cycle.
![image](images/scheduling-framework-extensions.png)
### Scheduling queue sort
These plugins indicate how Pods should be sorted in the scheduling queue. A
plugin registered at this point only returns greater, smaller, or equal to
indicate an ordering between two Pods. In other words, a plugin at this
extension point returns the answer to "less(pod1, pod2)". Multiple plugins may
be registered at this point. Plugins registered at this point are called in
order and the invocation continues as long as plugins return "equal". Once a
plugin returns "greater" or "smaller" the invocation of these plugins are
stopped.
### Pre-filter
These plugins are generally useful to check certain conditions that the cluster
or the Pod must meet. These are also useful to perform pre-processing on the pod
and store some information about the pod that can be used by other plugins.
The pod pointer is passed as an argument to these plugins. If any of these
plugins return an error, the scheduling cycle is aborted.
These plugins are called serially in the same order registered.
### Filter
Filter plugins filter out nodes that cannot run the Pod. Scheduler runs these
plugins per node in the same order that they are registered, but scheduler may
run these filter function for multiple nodes in parallel. So, these plugins must
use synchronization when they modify state.
Scheduler stops running the remaining filter functions for a node once one of
these filters fails for the node.
### Post-filter
The Pod and the set of nodes that can run the Pod are passed to these plugins.
They are called whether Pod is schedulable or not (whether the set of nodes is
empty or non-empty).
If any of these plugins return an error or if the Pod is determined
unschedulable, the scheduling cycle is aborted.
These plugins are called serially.
### Scoring
These plugins are similar to priority function in scheduler v1. They are
utilized to rank nodes that have passed the filtering stage. Similar to Filter
plugins, these are called per node serially in the same order registered, but
scheduler may run them for multiple nodes in parallel.
Each one of these functions return a score for the given node. The score is
multiplied by the weight of the function and aggregated with the result of other
scoring functions to yield a total score for the node.
These functions can never block scheduling. In case of an error they should
return zero for the Node being ranked.
### Post-scoring/pre-reservation
After all scoring plugins are invoked and the score of nodes are determined, the
framework picks the best node with the highest score and then it calls
post-scoring plugins. The Pod and the chosen Node are passed to these plugins.
These plugins have one more chance to check any conditions about the assignment
of the Pod to this Node and reject the node if needed.
![image](images/scheduling-framework-threads.png)
### Reserve
At this point scheduler updates its cache by "reserving" a Node (partially or
fully) for the Pod. In scheduler v1 this stage is called "assume".
At this point, only the scheduler cache is updated to
reflect that the Node is (partially) reserved for the Pod. The scheduling
framework calls plugins registered at this extension points so that they get a
chance to perform cache updates or other accounting activities. These plugins
do not return any value (except errors).
The actual assignment of the Node to the Pod happens during the "Bind" phase.
That is when the API server updates the Pod object with the Node information.
### Permit
Permit plugins run in a separate go routine (in parallel). Each plugin can return
one of the three possible values: 1) "permit", 2) "deny", or 3) "wait". If all
plugins registered at this extension point return "permit", the pod is sent to
the next step for binding. If any of the plugins returns "deny", the pod is
rejected and sent back to the scheduling queue. If any of the plugins returns
"wait", the Pod is kept in reserved state until it is explicitly approved for
binding. A plugin that returns "wait" must return a "timeout" as well. If the
timeout expires, the pod is rejected and goes back to the scheduling queue.
#### Approving a Pod binding
While any plugin can receive the list of reserved Pod from the cache and approve
them, we expect only the "Permit" plugins to approve binding of reserved Pods
that are in "waiting" state. Once a Pod is approved, it is sent to the Bind
stage.
### Reject
Plugins called at "Permit" may perform some operations that should be undone if
the Pod reservation fails. The "Reject" extension point allows such clean-up
operations to happen. Plugins registered at this point are called if the
reservation of the Pod is cancelled. The reservation is cancelled if any of the
"Permit" plugins returns "reject" or if a Pod reservation, which is in "wait"
state, times out.
### Pre-Bind
When a Pod is approved for binding it reaches to this stage. These plugins run
before the actual binding of the Pod to a Node happens. The binding starts only
if all of these plugins return true. If any returns false, the Pod is rejected
and sent back to the scheduling queue. These plugins run in a separate go
routine. The same go routine runs "Bind" after these plugins when all of them
return true.
### Bind
Once all pre-bind plugins return true, the Bind plugins are executed. Multiple
plugins may be registered at this extension point. Each plugin may return true
or false (or an error). If a plugin returns false, the next plugin will be
called until a plugin returns true. Once a true is returned **the remaining
plugins are skipped**. If any of the plugins returns an error or all of them
return false, the Pod is rejected and sent back to the scheduling queue.
### Post Bind
The Post Bind plugins can be useful for housekeeping after a pod is scheduled.
These plugins do not return any value and are not expected to influence the
scheduling decision made in the scheduling cycle.
### Informer Events
The scheduling framework, similar to Scheduler v1, will have informers that let
the framework keep its copy of the state of the cluster up-to-date. The
informers generate events, such as "PodAdd", "PodUpdate", "PodDelete", etc. The
framework allows plugins to register their own handlers for any of these events.
The handlers allow plugins with internal state or caches to keep their state
updated.
# USE-CASES
In this section we provide a couple of examples on how the scheduling framework
can be used to solve common scheduling scenarios.
### Dynamic binding of cluster-level resources
Cluster level resources are resources which are not immediately available on
nodes at the time of scheduling Pods. Scheduler needs to ensure that such
cluster level resources are bound to a chosen Node before it can schedule a Pod
that requires such resources to the Node. We refer to this type of binding of
resources to Nodes at the time of scheduling Pods as dynamic resource binding.
Dynamic resource binding has proven to be a challenge in Scheduler v1, because
Scheduler v1 is not flexible enough to support various types of plugins at
different phases of scheduling. As a result, binding of storage volumes is
integrated in the scheduler code and some non-trivial changes are done to the
scheduler extender to support dynamic binding of network GPUs.
The scheduling framework allows such dynamic bindings in a cleaner way. The main
thread of scheduling framework process a pending Pod that requests a network
resource and finds a node for the Pod and reserves the Pod. A dynamic resource
binder plugin installed at "Pre-Bind" stage is invoked (in a separate thread).
It analyzes the Pod and when detects that the Pod needs dynamic binding of the
resource, the plugin tries to attach the cluster resource to the chosen node and
then returns true so that the Pod can be bound. If the resource attachment
fails, it returns false and the Pod will be retried.
When there are multiple of such network resources, each one of them installs one
"pre-bind" plugin. Each plugin looks at the Pod and if the Pod is not requesting
the resource that they are interested in, they simply return "true" for the
pod.
### Gang Scheduling
Gang scheduling allows a certain number of Pods to be scheduled simultaneously.
If all the members of the gang cannot be scheduled at the same time, none of
them should be scheduled. Gang scheduling may have various other features as
well, but in this context we are interested in simultaneous scheduling of Pods.
Gang scheduling in the scheduling framework can be done with an "Permit" plugin.
The main scheduling thread processes pods one by one and reserves nodes for
them. The gang scheduling plugin at the Permit stage is invoked for each pod.
When it finds that the pod belongs to a gang, it checks the properties of the
gang. If there are not enough members of the gang which are scheduled or in
"wait" state, the plugin returns "wait". When the number reaches the desired
value, all the Pods in wait state are approved and sent for binding.
# OUT OF PROCESS PLUGINS
Out of process plugins (OOPP) are called via JSON over an HTTP interface. In
other words, the scheduler will support webhooks at most (maybe all) of the
extension points. Data sent to an OOPP must be marshalled to JSON and data
received must be unmarshalled. So, calling an OOPP is significantly slower than
in-process plugins.
We do not plan to build OOPPs in the first version of the scheduling framework.
So, more details on them is to be determined.
# DEVELOPMENT PLAN
Earlier, we wanted to develop the scheduling framework as an independent project
from scheduler V1. However, that would need much engineering resources.
It would also be more difficult to roll out a new and not fully-backward
compatible scheduler in Kubernetes where tens of thousands of users depend on
the behavior of the scheduler.
After revisiting the ideas and challenges, we changed our plan and have decided
to build some of the ideas of the scheduling framework into Scheduler V1 to make
it more extendable.
As the first step, we would like to build:
1. [Pre-bind](#pre-bind) and [Reserve](#reserve) plugin points. These will
help us move our existing cluster resource binding code, such as persistent
volume binding, to plugins.
1. We will also build
[the plugin communication mechanism](#communication-and-statefulness-of-plugins).
This will allow us to build more sophisticated plugins that would require
communication and also help us clean up existing scheduler's code by removing
existing transient cache data.
More features of the framework can be added to the Scheduler in the future based
on the requirements.
<s>
# CONFIGURING THE SCHEDULING FRAMEWORK
TBD
# BACKWARD COMPATIBILITY WITH SCHEDULER v1
We will build a new set of plugins for scheduler v2 to ensure that the existing
behavior of scheduler v1 in placing Pods on nodes is preserved. This includes
building plugins that replicate default predicate and priority functions of
scheduler v1 and its binding mechanism, but scheduler extenders built for
scheduler v1 won't be compatible with scheduler v2. Also, predicate and priority
functions which are not enabled by default (such as service affinity) are not
guaranteed to exist in scheduler v2.
# DEVELOPMENT PLAN
We will develop the scheduling framework as an incubator project in SIG
scheduling. It will be built in a separate code-base independently from
scheduler v1, but we will probably use a lot of code from scheduler v1.
# TESTING PLAN
We will add unit-tests as we build functionalities of the scheduling framework.
The scheduling framework should eventually be able to pass integration and e2e
tests of scheduler v1, excluding those tests that involve scheduler extensions.
The e2e and integration tests may need to be modified slightly as the
initialization and configuration of the scheduling framework will be different
than scheduler v1.
# WORK ESTIMATES
We expect to see an early version of the scheduling framework in two release
cycles (end of 2018). If things go well, we will start offering it as an
alternative to the scheduler v1 by the end of Q1 2019 and start the deprecation
of scheduler v1. We will make it the default scheduler of Kubernetes in Q2 2019,
but we will keep the option of using scheduler v1 for at least two more release
cycles.
</s>

View File

@ -19,8 +19,8 @@ In addition to this, with taint-based-eviction, the Node Controller already tain
| ------------------ | ------------------ | ------------ | -------- |
|Ready |True | - | |
| |False | NoExecute | node.kubernetes.io/not-ready |
| |Unknown | NoExecute | node.kubernetes.io/unreachable |
|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk |
| |Unknown | NoExecute | node.kubernetes.io/unreachable |
|OutOfDisk |True | NoSchedule | node.kubernetes.io/out-of-disk |
| |False | - | |
| |Unknown | - | |
|MemoryPressure |True | NoSchedule | node.kubernetes.io/memory-pressure |
@ -32,6 +32,9 @@ In addition to this, with taint-based-eviction, the Node Controller already tain
|NetworkUnavailable |True | NoSchedule | node.kubernetes.io/network-unavailable |
| |False | - | |
| |Unknown | - | |
|PIDPressure |True | NoSchedule | node.kubernetes.io/pid-pressure |
| |False | - | |
| |Unknown | - | |
For example, if a CNI network is not detected on the node (e.g. a network is unavailable), the Node Controller will taint the node with `node.kubernetes.io/network-unavailable=:NoSchedule`. This will then allow users to add a toleration to their `PodSpec`, ensuring that the pod can be scheduled to this node if necessary. If the kubelet did not update the nodes status after a grace period, the Node Controller will only taint the node with `node.kubernetes.io/unreachable`; it will not taint the node with any unknown condition.

View File

@ -87,7 +87,7 @@ allowed to use that new dedicated node group.
```go
// The node this Taint is attached to has the effect "effect" on
// any pod that that does not tolerate the Taint.
// any pod that does not tolerate the Taint.
type Taint struct {
Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
Value string `json:"value,omitempty"`

View File

@ -0,0 +1,281 @@
---
title: Attacher/Detacher refactor for local storage
authors:
- "@NickrenREN"
owning-sig: sig-storage
participating-sigs:
- nil
reviewers:
- "@msau42"
- "@jsafrane"
approvers:
- "@jsafrane"
- "@msau42"
- "@saad-ali"
editor: TBD
creation-date: 2018-07-30
last-updated: 2018-07-30
status: provisional
---
## Table of Contents
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [Implementation](#implementation)
* [Volume plugin interface change](#volume-plugin-interface-change)
* [MountVolume/UnmountDevice generation function change](#MountVolume/UnmountDevice-generation-function-change)
* [Volume plugin change](#volume-plugin-change)
* [Future](#future)
## Summary
Today, the workflow for a volume to be used by pod is:
- attach a remote volume to the node instance (if it is attachable)
- wait for the volume to be attached (if it is attachable)
- mount the device to a global path (if it is attachable)
- mount the global path to a pod directory
It is ok for remote block storage plugins which have a remote attach api,such as `GCE PD`, `AWS EBS`
and remote fs storage plugins such as `NFS`, and `Cephfs`.
But it is not so good for plugins which need local attach such as `fc`, `iscsi` and `RBD`.
It is not so good for local storage neither which is not attachable but needs `MountDevice`
## Motivation
### Goals
Update Attacher/Detacher interfaces for local storage
### Non-Goals
Update `fc`, `iscsi` and `RBD` implementation according to the new interfaces
## Proposal
Here we propose to only update the Attacher/Detacher interfaces for local storage.
We may expand it in future to `iscsi`, `RBD` and `fc`, if we figure out how to prevent multiple local attach without implementing attacher interface.
## Implementation
### Volume plugin interface change
We can create a new interface `DeviceMounter`, move `GetDeviceMountPath` and `MountDevice` from `Attacher`to it.
We can put `DeviceMounter` in `Attacher` which means any one who implements the `Attacher` interface must implement `DeviceMounter`.
```
// Attacher can attach a volume to a node.
type Attacher interface {
DeviceMounter
// Attaches the volume specified by the given spec to the node with the given Name.
// On success, returns the device path where the device was attached on the
// node.
Attach(spec *Spec, nodeName types.NodeName) (string, error)
// VolumesAreAttached checks whether the list of volumes still attached to the specified
// node. It returns a map which maps from the volume spec to the checking result.
// If an error is occurred during checking, the error will be returned
VolumesAreAttached(specs []*Spec, nodeName types.NodeName) (map[*Spec]bool, error)
// WaitForAttach blocks until the device is attached to this
// node. If it successfully attaches, the path to the device
// is returned. Otherwise, if the device does not attach after
// the given timeout period, an error will be returned.
WaitForAttach(spec *Spec, devicePath string, pod *v1.Pod, timeout time.Duration) (string, error)
}
// DeviceMounter can mount a block volume to a global path.
type DeviceMounter interface {
// GetDeviceMountPath returns a path where the device should
// be mounted after it is attached. This is a global mount
// point which should be bind mounted for individual volumes.
GetDeviceMountPath(spec *Spec) (string, error)
// MountDevice mounts the disk to a global path which
// individual pods can then bind mount
// Note that devicePath can be empty if the volume plugin does not implement any of Attach and WaitForAttach methods.
MountDevice(spec *Spec, devicePath string, deviceMountPath string) error
}
```
Note: we also need to make sure that if our plugin implements the `DeviceMounter` interface,
then executing mount operation from multiple pods referencing the same volume in parallel should be avoided,
even if it does not implement the `Attacher` interface.
Since `NestedPendingOperations` can achieve this by setting the same volumeName and same or empty podName in one operation,
we just need to add another check in `MountVolume`: check if the volume is DeviceMountable.
We also need to create another new interface `DeviceUmounter`, and move `UnmountDevice` to it.
```
// Detacher can detach a volume from a node.
type Detacher interface {
DeviceUnmounter
// Detach the given volume from the node with the given Name.
// volumeName is name of the volume as returned from plugin's
// GetVolumeName().
Detach(volumeName string, nodeName types.NodeName) error
}
// DeviceUnmounter can unmount a block volume from the global path.
type DeviceUnmounter interface {
// UnmountDevice unmounts the global mount of the disk. This
// should only be called once all bind mounts have been
// unmounted.
UnmountDevice(deviceMountPath string) error
}
```
Accordingly, we need to create a new interface `DeviceMountableVolumePlugin` and move `GetDeviceMountRefs` to it.
```
// AttachableVolumePlugin is an extended interface of VolumePlugin and is used for volumes that require attachment
// to a node before mounting.
type AttachableVolumePlugin interface {
DeviceMountableVolumePlugin
NewAttacher() (Attacher, error)
NewDetacher() (Detacher, error)
}
// DeviceMountableVolumePlugin is an extended interface of VolumePlugin and is used
// for volumes that requires mount device to a node before binding to volume to pod.
type DeviceMountableVolumePlugin interface {
VolumePlugin
NewDeviceMounter() (DeviceMounter, error)
NewDeviceUmounter() (DeviceUmounter, error)
GetDeviceMountRefs(deviceMountPath string) ([]string, error)
}
```
### MountVolume/UnmountDevice generation function change
Currently we will check if the volume plugin is attachable in `GenerateMountVolumeFunc`, if it is, we need to call `WaitForAttach` ,`GetDeviceMountPath` and `MountDevice` first, and then set up the volume.
After the refactor, we can split that into three sections: check if volume is attachable, check if it is deviceMountable and set up the volume.
```
devicePath := volumeToMount.DevicePath
if volumeAttacher != nil {
devicePath, err = volumeAttacher.WaitForAttach(
volumeToMount.VolumeSpec, devicePath, volumeToMount.Pod, waitForAttachTimeout)
if err != nil {
// On failure, return error. Caller will log and retry.
return volumeToMount.GenerateError("MountVolume.WaitForAttach failed", err)
}
// Write the attached device path back to volumeToMount, which can be used for MountDevice.
volumeToMount.DevicePath = devicePath
}
if volumeDeviceMounter != nil {
deviceMountPath, err :=
volumeDeviceMounter.GetDeviceMountPath(volumeToMount.VolumeSpec)
if err != nil {
// On failure, return error. Caller will log and retry.
return volumeToMount.GenerateError("MountVolume.GetDeviceMountPath failed", err)
}
deviceMountPath, err := volumeDeviceMounter.MountDevice(volumeToMount.VolumeSpec, devicePath, deviceMountPath)
if err != nil {
// On failure, return error. Caller will log and retry.
return volumeToMount.GenerateError("MountVolume.MountDevice failed", err)
}
glog.Infof(volumeToMount.GenerateMsgDetailed("MountVolume.MountDevice succeeded", fmt.Sprintf("device mount path %q", deviceMountPath)))
// Update actual state of world to reflect volume is globally mounted
markDeviceMountedErr := actualStateOfWorld.MarkDeviceAsMounted(
volumeToMount.VolumeName)
if markDeviceMountedErr != nil {
// On failure, return error. Caller will log and retry.
return volumeToMount.GenerateError("MountVolume.MarkDeviceAsMounted failed", markDeviceMountedErr)
}
}
```
Note that since local storage plugin will not implement the Attacher interface, we can get the device path directly from `spec.PersistentVolume.Spec.Local.Path` when we run `MountDevice`
The device unmounting operation will be executed in `GenerateUnmountDeviceFunc`, we can update the device unmounting generation function as below:
```
// Get DeviceMounter plugin
deviceMountableVolumePlugin, err :=
og.volumePluginMgr.FindDeviceMountablePluginByName(deviceToDetach.PluginName)
if err != nil || deviceMountableVolumePlugin == nil {
return volumetypes.GeneratedOperations{}, deviceToDetach.GenerateErrorDetailed("UnmountDevice.FindDeviceMountablePluginByName failed", err)
}
volumeDeviceUmounter, err := deviceMountablePlugin.NewDeviceUmounter()
if err != nil {
return volumetypes.GeneratedOperations{}, deviceToDetach.GenerateErrorDetailed("UnmountDevice.NewDeviceUmounter failed", err)
}
volumeDeviceMounter, err := deviceMountableVolumePlugin.NewDeviceMounter()
if err != nil {
return volumetypes.GeneratedOperations{}, deviceToDetach.GenerateErrorDetailed("UnmountDevice.NewDeviceMounter failed", err)
}
unmountDeviceFunc := func() (error, error) {
deviceMountPath, err :=
volumeDeviceMounter.GetDeviceMountPath(deviceToDetach.VolumeSpec)
if err != nil {
// On failure, return error. Caller will log and retry.
return deviceToDetach.GenerateError("GetDeviceMountPath failed", err)
}
refs, err := deviceMountablePlugin.GetDeviceMountRefs(deviceMountPath)
if err != nil || mount.HasMountRefs(deviceMountPath, refs) {
if err == nil {
err = fmt.Errorf("The device mount path %q is still mounted by other references %v", deviceMountPath, refs)
}
return deviceToDetach.GenerateError("GetDeviceMountRefs check failed", err)
}
// Execute unmount
unmountDeviceErr := volumeDeviceUmounter.UnmountDevice(deviceMountPath)
if unmountDeviceErr != nil {
// On failure, return error. Caller will log and retry.
return deviceToDetach.GenerateError("UnmountDevice failed", unmountDeviceErr)
}
// Before logging that UnmountDevice succeeded and moving on,
// use mounter.PathIsDevice to check if the path is a device,
// if so use mounter.DeviceOpened to check if the device is in use anywhere
// else on the system. Retry if it returns true.
deviceOpened, deviceOpenedErr := isDeviceOpened(deviceToDetach, mounter)
if deviceOpenedErr != nil {
return nil, deviceOpenedErr
}
// The device is still in use elsewhere. Caller will log and retry.
if deviceOpened {
return deviceToDetach.GenerateError(
"UnmountDevice failed",
fmt.Errorf("the device is in use when it was no longer expected to be in use"))
}
...
return nil, nil
}
```
### Volume plugin change
We need to olny implement the DeviceMounter/DeviceUnmounter interface for local storage since it is not attachable.
And we can keep `fc`,`iscsi` and `RBD` unchanged at the first stage.
## Future
Update `iscsi`, `RBD` and `fc` volume plugins accordingly, if we figure out how to prevent multiple local attach without implementing attacher interface.

View File

@ -0,0 +1,80 @@
# Skip attach for non-attachable CSI volumes
Author: @jsafrane
## Goal
* Non-attachable CSI volumes should not require external attacher and `VolumeAttachment` instance creation. This will speed up pod startup.
## Motivation
Currently, CSI requires admin to start external CSI attacher for **all** CSI drivers, including those that don't implement attach/detach operation (such as NFS or all ephemeral Secrets-like volumes). Kubernetes Attach/Detach controller always creates `VolumeAttachment` objects for them and always waits until they're reported as "attached" by external CSI attacher.
We want to skip creation of `VolumeAttachment` objects in A/D controller for CSI volumes that don't require 3rd party attach/detach.
## Dependencies
In order to skip both A/D controller attaching a volume and kubelet waiting for the attachment, both of them need to know if a particular CSI driver is attachable or not. In this document we expect that proposal #2514 is implemented and both A/D controller and kubelet has informer on `CSIDriver` so they can check if a volume is attachable easily.
## Design
### CSI volume plugin
* Rework [`Init`](https://github.com/kubernetes/kubernetes/blob/43f805b7bdda7a5b491d34611f85c249a63d7f97/pkg/volume/csi/csi_plugin.go#L58) to get or create informer to cache CSIDriver instances.
* Depending on where the API for CSIDriver ends up, we may:
* Rework VolumeHost to either provide the informer. This leaks CSI implementation details to A/D controller and kubelet
* Or the CSI volume plugin can create and run CSIDriver informer by itself. No other component in controller-manager or kubelet needs the informer right now, so a non-shared informer is viable option. Depending on when the API for CSIDriver ends up, `VolumeHost` may need to be extended to provide client interface to the API and kubelet and A/D controller may need to be updated to create the interface (somewhere in `cmd/`, where RESTConfig is still available to create new clients ) and pass it to their `VolumeHost` implementations.
* Rework `Attach`, `Detach`, `VolumesAreAttached` and `WaitForAttach` to check for `CSIDriver` instance using the informer.
* If CSIDriver for the driver exists and it's attachable, perform usual logic.
* If CSIDriver for the driver exists and it's not attachable, return success immediately (basically NOOP). A/D controller will still mark the volume as attached in `Node.Status.VolumesAttached`.
* If CSIDriver for the driver does not exist, perform usual logic (i.e. treat the volume as attachable).
* This keeps the behavior the same as in old Kubernetes version without CSIDriver object.
* This also happens when CSIDriver informer has not been quick enough. It is suggested that CSIDriver instance is created **before** any pod that uses corresponding CSI driver can run.
* In case that CSIDriver informer (or user) is too slow, CSI volume plugin `Attach()` will create `VolumeAttachment` instance and wait for (non-existing) external attacher to fulfill it. The CSI plugin shall recover when `CSIDriver` instance is created and skip attach. Any `VolumeAttachment` instance created here will be deleted on `Detach()`, see the next bullet.
* In addition to the above, `Detach()` removes `VolumeAttachment` instance even if the volume is not attachable. This deletes `VolumeAttachment` instances created by old A/D controller or before `CSIDriver` instance was created.
### Authorization
* A/D controller and kubelet must be allowed to list+watch CSIDriver instances. Updating RBAC rules should be enough.
## API
No API changes.
## Upgrade
This chapter covers:
* Upgrade from old Kubernetes that has `CSISkipAttach` disabled to new Kubernetes with `CSISkipAttach` enabled.
* Update from Kubernetes that has `CSISkipAttach` disabled to the same Kubernetes with `CSISkipAttach` enabled.
* Creation of CSIDriver instance with non-attachable CSI driver.
In all cases listed above, an "attachable" CSI driver becomes non-attachable. Upgrade does not affect attachable CSI drivers, both "old" and "new" Kubernetes processes them in the same way.
For non-attachable volumes, if the volume was attached by "old" Kubernetes (or "new" Kubernetes before CSIDriver instance was created), it has `VolumeAttachment` instance. It will be deleted by `Detach()`, as it deletes `VolumeAttachment` instance also for non-attachable volumes.
## Downgrade
This chapter covers:
* Downgrade from new Kubernetes that has `CSISkipAttach` enabled to old Kubernetes with `CSISkipAttach disabled.
* Update from Kubernetes that has `CSISkipAttach` feature enabled to the same Kubernetes with `CSISkipAttach` disabled.
* Deletion of CSIDriver instance with non-attachable CSI driver.
In all cases listed above, a non-attachable CSI driver becomes "attachable" (i.e. requires external attacher). Downgrade does not affect attachable CSI drivers, both "old" and "new" Kubernetes processes them in the same way.
For non-attachable volumes, if the volume was mounted by "new" Kubernetes, it has no VolumeAttachment instance. "Old" A/D controller does not know about it. However, it will periodically call plugin's `VolumesAreAttached()` that checks for `VolumeAttachment` presence. Volumes without `VolumeAttachment` will be reported as not attached and A/D controller will call `Attach()` on these. Since "old" Kubernetes required an external attacher even for non-attachable CSI drivers, the external attacher will pick the `VolumeAttachment` instances and fulfil them in the usual way.
## Performance considerations
* Flow suggested in this proposal adds new `CSIDriver` informer both to A/D controller and kubelet. We don't expect any high amount of instances of `CSIDriver` nor any high frequency of updates. `CSIDriver` should have negligible impact on performance.
* A/D controller will not create `VolumeAttachment` instances for non-attachable volumes. Etcd load will be reduced.
* On the other hand, all CSI volumes still must go though A/D controller. A/D controller **must** process every CSI volume and kubelet **must** wait until A/D controller marks a volume as attached, even if A/D controller basically does nothing. All CSI volumes must be added to `Node.Status.VolumesInUse` and `Node.Status.VolumesAttached`. This does not introduce any new API calls, all this is already implemented, however this proposal won't reduce `Node.Status` update frequency in any way.
* If *all* volumes move to CSI eventually, pod startup will be slower than when using in-tree volume plugins that don't go through A/D controller and `Node.Status` will grow in size.
## Implementation
Expected timeline:
* Alpha: 1.12 (behind feature gate `CSISkipAttach`)
* Beta: 1.13 (enabled by default)
* GA: 1.14
## Alternatives considered
A/D controller and kubelet can be easily extended to check if a given volume is attachable. This would make mounting of non-attachable volumes easier, as kubelet would not need to wait for A/D controller to mark the volume as attached. However, there would be issues when upgrading or downgrading Kubernetes (or marking CSIDriver as attachable or non-attachable, which has basically the same handling).
* On upgrade (i.e. a previously attachable CSI volume becomes non-attachable, e.g. when user creates CSIDriver instance while corresponding CSI driver is already running), A/D controller could discover that an attached volume is not attachable any longer. A/D controller could clean up `Node.Status.VolumesAttached`, but since A/D controller does not know anything about `VolumeAttachment`, we would either need to introduce a new volume plugin call to clean it up in CSI volume plugin, or something else would need to clean it.
* On downgrade (i.e. a previously non-attachable CSI volume becomes attachable, e.g. when user deletes CSIDriver instance or downgrades to old Kubernetes without this feature), kubelet must discover that already mounted volume has changed from non-attachable to attachable and put it into `Node.Status.VolumesInUse`. This would race with A/D controller detaching the volume when a pod was deleted at the same time a CSIDriver instance was made attachable.
Passing all volumes through A/D controller saves us from these difficulties and even races.

View File

@ -89,17 +89,60 @@ CSI volume drivers should create a socket at the following path on the node mach
`Sanitized CSIDriverName` is CSI driver name that does not contain dangerous character and can be used as annotation name. It can follow the same pattern that we use for [volume plugins](https://git.k8s.io/kubernetes/pkg/util/strings/escape.go#L27). Too long or too ugly driver names can be rejected, i.e. all components described in this document will report an error and won't talk to this CSI driver. Exact sanitization method is implementation detail (SHA in the worst case).
Upon initialization of the external “CSI volume driver”, some external component must call the CSI method `GetNodeId` to get the mapping from Kubernetes Node names to CSI driver NodeID. It must then add the CSI driver NodeID to the `csi.volume.kubernetes.io/nodeid` annotation on the Kubernetes Node API object. The key of the annotation must be `csi.volume.kubernetes.io/nodeid`. The value of the annotation is a JSON blob, containing key/value pairs for each CSI driver.
Upon initialization of the external “CSI volume driver”, kubelet must call the CSI method `NodeGetInfo` to get the mapping from Kubernetes Node names to CSI driver NodeID and the associated `accessible_topology`. It must:
For example:
```
csi.volume.kubernetes.io/nodeid: "{ \"driver1\": \"name1\", \"driver2\": \"name2\" }
```
* Create/update a `CSINodeInfo` object instance for the node with the NodeID and topology keys from `accessible_topology`.
* This will enable the component that will issue `ControllerPublishVolume` calls to use the `CSINodeInfo` as a mapping from cluster node ID to storage node ID.
* This will enable the component that will issue `CreateVolume` to reconstruct `accessible_topology` and provision a volume that is accesible from specific node.
* Each driver must completely overwrite its previous version of NodeID and topology keys, if they exist.
* If the `NodeGetInfo` call fails, kubelet must delete any previous NodeID and topology keys for this driver.
* When kubelet plugin unregistration mechanism is implemented, delete NodeID and topology keys when a driver is unregistered.
This will enable the component that will issue `ControllerPublishVolume` calls to use the annotation as a mapping from cluster node ID to storage node ID.
* Update Node API object with the CSI driver NodeID as the `csi.volume.kubernetes.io/nodeid` annotation. The value of the annotation is a JSON blob, containing key/value pairs for each CSI driver. For example:
```
csi.volume.kubernetes.io/nodeid: "{ \"driver1\": \"name1\", \"driver2\": \"name2\" }
```
*This annotation is deprecated and will be removed according to deprecation policy (1 year after deprecation). TODO mark deprecation date.*
* If the `NodeGetInfo` call fails, kubelet must delete any previous NodeID for this driver.
* When kubelet plugin unregistration mechanism is implemented, delete NodeID and topology keys when a driver is unregistered.
* Create/update Node API object with `accessible_topology` as labels.
There are no hard restrictions on the label format, but for the format to be used by the recommended setup, please refer to [Topology Representation in Node Objects](#topology-representation-in-node-objects).
To enable easy deployment of an external containerized CSI volume driver, the Kubernetes team will provide a sidecar "Kubernetes CSI Helper" container that can manage the unix domain socket registration and NodeId initialization. This is detailed in the “Suggested Mechanism for Deploying CSI Drivers on Kubernetes” section below.
The new API object called `CSINodeInfo` will be defined as follows:
```go
// CSINodeInfo holds information about status of all CSI drivers installed on a node.
type CSINodeInfo struct {
metav1.TypeMeta
// ObjectMeta.Name must be node name.
metav1.ObjectMeta
// List of CSI drivers running on the node and their properties.
CSIDrivers []CSIDriverInfo
}
// Information about one CSI driver installed on a node.
type CSIDriverInfo struct {
// CSI driver name.
Name string
// ID of the node from the driver point of view.
NodeID string
// Topology keys reported by the driver on the node.
TopologyKeys []string
}
```
A new object type `CSINodeInfo` is chosen instead of `Node.Status` field because Node is already big enough and there are issues with its size. `CSINodeInfo` is CRD installed by TODO (jsafrane) on cluster startup and defined in `kubernetes/kubernetes/pkg/apis/storage-csi/v1alpha1/types.go`, so k8s.io/client-go and k8s.io/api are generated automatically. All users of `CSINodeInfo` will tolerate if the CRD is not installed and retry anything they need to do with it with exponential backoff and proper error reporting. Especially kubelet is able to serve its usual duties when the CRD is missing.
Each node must have zero or one `CSINodeInfo` instance. This is ensured by `CSINodeInfo.Name == Node.Name`. TODO: how to validate this? Each `CSINodeInfo` is "owned" by corresponding Node for garbage collection.
#### Master to CSI Driver Communication
Because CSI volume driver code is considered untrusted, it might not be allowed to run on the master. Therefore, the Kube controller manager (responsible for create, delete, attach, and detach) can not communicate via a Unix Domain Socket with the “CSI volume driver” container. Instead, the Kube controller manager will communicate with the external “CSI volume driver” through the Kubernetes API.
@ -116,7 +159,27 @@ Provisioning and deletion operations are handled using the existing [external pr
In short, to dynamically provision a new CSI volume, a cluster admin would create a `StorageClass` with the provisioner corresponding to the name of the external provisioner handling provisioning requests on behalf of the CSI volume driver.
To provision a new CSI volume, an end user would create a `PersistentVolumeClaim` object referencing this `StorageClass`. The external provisioner will react to the creation of the PVC and issue the `CreateVolume` call against the CSI volume driver to provision the volume. The `CreateVolume` name will be auto-generated as it is for other dynamically provisioned volumes. The `CreateVolume` capacity will be taken from the `PersistentVolumeClaim` object. The `CreateVolume` parameters will be passed through from the `StorageClass` parameters (opaque to Kubernetes). Once the operation completes successfully, the external provisioner creates a `PersistentVolume` object to represent the volume using the information returned in the `CreateVolume` response. The `PersistentVolume` object is bound to the `PersistentVolumeClaim` and available for use.
To provision a new CSI volume, an end user would create a `PersistentVolumeClaim` object referencing this `StorageClass`. The external provisioner will react to the creation of the PVC and issue the `CreateVolume` call against the CSI volume driver to provision the volume. The `CreateVolume` name will be auto-generated as it is for other dynamically provisioned volumes. The `CreateVolume` capacity will be taken from the `PersistentVolumeClaim` object. The `CreateVolume` parameters will be passed through from the `StorageClass` parameters (opaque to Kubernetes).
If the `PersistentVolumeClaim` has the `volume.alpha.kubernetes.io/selected-node` annotation set (only added if delayed volume binding is enabled in the `StorageClass`), the provisioner will get relevant topology keys from the corresponding `CSINodeInfo` instance and the topology values from `Node` labels and use them to generate preferred topology in the `CreateVolume()` request. If the annotation is unset, preferred topology will not be specified (unless the PVC follows StatefulSet naming format, discussed later in this section). `AllowedTopologies` from the `StorageClass` is passed through as requisite topology. If `AllowedTopologies` is unspecified, the provisioner will pass in a set of aggregated topology values across the whole cluster as requisite topology.
To perform this topology aggregation, the external provisioner will cache all existing Node objects. In order to prevent a compromised node from affecting the provisioning process, it will pick a single node as the source of truth for keys, instead of relying on keys stored in `CSINodeInfo` for each node object. For PVCs to be provisioned with late binding, the selected node is the source of truth; otherwise a random node is picked. The provisioner will then iterate through all cached nodes that contain a node ID from the driver, aggregating labels using those keys. Note that if topology keys are different across the cluster, only a subset of nodes matching the topology keys of the chosen node will be considered for provisioning.
To generate preferred topology, the external provisioner will generate N segments for preferred topology in the `CreateVolume()` call, where N is the size of requisite topology. Multiple segments are included to support volumes that are available across multiple topological segments. The topology segment from the selected node will always be the first in preferred topology. All other segments are some reordering of remaining requisite topologies such that given a requisite topology (or any arbitrary reordering of it) and a selected node, the set of preferred topology is guaranteed to always be the same.
If immediate volume binding mode is set and the PVC follows StatefulSet naming format, then the provisioner will choose, as the first segment in preferred topology, a segment from requisite topology based on the PVC name that ensures an even spread of topology across the StatefulSet's volumes. The logic will be similar to the name hashing logic inside the GCE Persistent Disk provisioner. Other segments in preferred topology are ordered the same way as described above. This feature will be flag-gated in the external provisioner provided as part of the recommended deployment method.
Once the operation completes successfully, the external provisioner creates a `PersistentVolume` object to represent the volume using the information returned in the `CreateVolume` response. The topology of the returned volume is translated to the `PersistentVolume` `NodeAffinity` field. The `PersistentVolume` object is then bound to the `PersistentVolumeClaim` and available for use.
The format of topology key/value pairs is defined by the user and must match among the following locations:
* `Node` topology labels
* `PersistentVolume` `NodeAffinity` field
* `StorageClass` `AllowedTopologies` field
When a `StorageClass` has delayed volume binding enabled, the scheduler uses the topology information of a `Node` in the following ways:
1. During dynamic provisioning, the scheduler selects a candidate node for the provisioner by comparing each `Node`'s topology with the `AllowedTopologies` in the `StorageClass`.
1. During volume binding and pod scheduling, the scheduler selects a candidate node for the pod by comparing `Node` topology with `VolumeNodeAffinity` in `PersistentVolume`s.
A more detailed description can be found in the [topology-aware volume scheduling design doc](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/volume-topology-scheduling.md). See [Topology Representation in Node Objects](#topology-representation-in-node-objects) for the format used by the recommended deployment approach.
To delete a CSI volume, an end user would delete the corresponding `PersistentVolumeClaim` object. The external provisioner will react to the deletion of the PVC and based on its reclamation policy it will issue the `DeleteVolume` call against the CSI volume driver commands to delete the volume. It will then delete the `PersistentVolume` object.
@ -131,13 +194,14 @@ Once the following conditions are true, the external-attacher should call `Contr
1. A new `VolumeAttachment` Kubernetes API objects is created by Kubernetes attach/detach controller.
2. The `VolumeAttachment.Spec.Attacher` value in that object corresponds to the name of the external attacher.
3. The `VolumeAttachment.Status.Attached` value is not yet set to true.
4. A Kubernetes Node API object exists with the name matching `VolumeAttachment.Spec.NodeName` and that object contains a `csi.volume.kubernetes.io/nodeid` annotation. This annotation contains a JSON blob, a list of key/value pairs, where one of they keys corresponds with the CSI volume driver name, and the value is the NodeID for that driver. This NodeId mapping can be retrieved and used in the `ControllerPublishVolume` calls.
4. * Either a Kubernetes Node API object exists with the name matching `VolumeAttachment.Spec.NodeName` and that object contains a `csi.volume.kubernetes.io/nodeid` annotation. This annotation contains a JSON blob, a list of key/value pairs, where one of they keys corresponds with the CSI volume driver name, and the value is the NodeID for that driver. This NodeId mapping can be retrieved and used in the `ControllerPublishVolume` calls.
* Or a `CSINodeInfo` API object exists with the name matching `VolumeAttachment.Spec.NodeName` and the object contains `CSIDriverInfo` for the CSI volume driver. The `CSIDriverInfo` contains NodeID for `ControllerPublishVolume` call.
5. The `VolumeAttachment.Metadata.DeletionTimestamp` is not set.
Before starting the `ControllerPublishVolume` operation, the external-attacher should add these finalizers to these Kubernetes API objects:
* To the `VolumeAttachment` so that when the object is deleted, the external-attacher has an opportunity to detach the volume first. External attacher removes this finalizer once the volume is fully detached from the node.
* To the `PersistentVolume` referenced by `VolumeAttachment` so the the PV cannot be deleted while the volume is attached. External attacher needs information from the PV to perform detach operation. The attacher will remove the finalizer once all `VolumeAttachment` objects that refer to the PV are deleted, i.e. the volume is detached from all nodes.
* To the `PersistentVolume` referenced by `VolumeAttachment` so the PV cannot be deleted while the volume is attached. External attacher needs information from the PV to perform detach operation. The attacher will remove the finalizer once all `VolumeAttachment` objects that refer to the PV are deleted, i.e. the volume is detached from all nodes.
If the operation completes successfully, the external-attacher will:
@ -314,7 +378,7 @@ The attach/detach controller,running as part of the kube-controller-manager bina
When the controller decides to attach a CSI volume, it will call the in-tree CSI volume plugins attach method. The in-tree CSI volume plugins attach method will do the following:
1. Create a new `VolumeAttachment` object (defined in the “Communication Channels” section) to attach the volume.
* The name of the of the `VolumeAttachment` object will be `pv-<SHA256(PVName+NodeName)>`.
* The name of the `VolumeAttachment` object will be `pv-<SHA256(PVName+NodeName)>`.
* `pv-` prefix is used to allow using other scheme(s) for inline volumes in the future, with their own prefix.
* SHA256 hash is to reduce length of `PVName` plus `NodeName` string, each of which could be max allowed name length (hexadecimal representation of SHA256 is 64 characters).
* `PVName` is `PV.name` of the attached PersistentVolume.
@ -387,6 +451,127 @@ To deploy a containerized third-party CSI volume driver, it is recommended that
Alternatively, deployment could be simplified by having all components (including external-provisioner and external-attacher) in the same pod (DaemonSet). Doing so, however, would consume more resources, and require a leader election protocol (likely https://git.k8s.io/contrib/election) in the `external-provisioner` and `external-attacher` components.
#### Topology Representation in Node Objects
Topology information will be represented as labels.
Requirements:
* Must adhere to the [label format](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#syntax-and-character-set).
* Must support different drivers on the same node.
* The format of each key/value pair must match those in `PersistentVolume` and `StorageClass` objects, as described in the [Provisioning and Deleting](#provisioning-and-deleting) section.
Proposal: `"com.example.topology/rack": "rack1"`
The list of topology keys known to the driver is stored separately in the `CSINodeInfo` object.
Justifications:
* No strange separators needed, comparing to the alternative. Cleaner format.
* The same topology key could be used across different components (different storage plugin, network plugin, etc.)
* Once NodeRestriction is moved to the newer model (see [here](https://github.com/kubernetes/community/pull/911) for context), for each new label prefix introduced in a new driver, the cluster admin has to configure NodeRestrictions to allow the driver to update labels with the prefix. Cluster installations could include certain prefixes for pre-installed drivers by default. This is less convenient compared to the alternative, which can allow editing of all CSI drivers by default using the “csi.kubernetes.io” prefix, but often times cluster admins have to whitelist those prefixes anyway (for example cloud.google.com)
Considerations:
* Upon driver deletion/upgrade/downgrade, stale labels will be left untouched. Its difficult for the driver to decide whether other components outside CSI rely on this label.
* During driver installation/upgrade/downgrade, controller deployment must be brought down before node deployment, and node deployment must be deployed before the controller deployment, because provisioning relies on up-to-date node information. One possible issue is if only topology values change while keys remain the same, and if AllowedTopologies is not specified, requisite topology will contain both old and new topology values, and CSI driver may fail the CreateVolume() call. Given that CSI driver should be backward compatible, this is more of an issue when a node rolling upgrade happens before the controller update. It's not an issue if keys are changed as well since requisite and preferred topology generation handles it appropriately.
* During driver installation/upgrade/downgrade, if a version of the controller (either old or new) is running while there is an ongoing rolling upgrade with the node deployment, and the new version of the CSI driver reports different topology information, nodes in the cluster may have different versions of topology information. However, this doesn't pose an issue. If AllowedTopologies is specified, a subset of nodes matching the version of topology information in AllowedTopologies will be used as provisioning candidate. If AllowedTopologies is not specified, a single node is used as the source of truth for keys
* Topology keys inside `CSINodeInfo` must reflect the topology keys from drivers currently installed on the node. If no driver is installed, the collection must be empty. However, due to the possible race condition between kubelet (the writer) and the external provisioner (the reader), the provisioner must gracefully handle the case where `CSINodeInfo` is not up-to-date. In the current design, the provisioner will erroneously provision a volume on a node where it's inaccessible.
Alternative:
1. `"csi.kubernetes.io/topology.example.com_rack": "rack1"`
#### Topology Representation in PersistentVolume Objects
There exists multiple ways to represent a single topology as NodeAffinity. For example, suppose a `CreateVolumeResponse` contains the following accessible topology:
```yaml
- zone: "a"
rack: "1"
- zone: "b"
rack: "1"
- zone: "b"
rack: "2"
```
There are at least 3 ways to represent this in NodeAffinity (excluding `nodeAffinity`, `required`, and `nodeSelectorTerms` for simplicity):
Form 1 - `values` contain exactly 1 element.
```yaml
- matchExpressions:
- key: zone
operator: In
values:
- "a"
- key: rack
operator: In
values:
- "1"
- matchExpressions:
- key: zone
operator: In
values:
- "b"
- key: rack
operator: In
values:
- "1"
- matchExpressions:
- key: zone
operator: In
values:
- "b"
- key: rack
operator: In
values:
- "2"
```
Form 2 - Reduced by `rack`.
```yaml
- matchExpressions:
- key: zone
operator: In
values:
- "a"
- "b"
- key: rack
operator: In
values:
- "1"
- matchExpressions:
- key: zone
operator: In
values:
- "b"
- key: rack
operator: In
values:
- "2"
```
Form 3 - Reduced by `zone`.
```yaml
- matchExpressions:
- key: zone
operator: In
values:
- "a"
- key: rack
operator: In
values:
- "1"
- matchExpressions:
- key: zone
operator: In
values:
- "b"
- key: rack
operator: In
values:
- "1"
- "2"
```
The provisioner will always choose Form 1, i.e. all `values` will have at most 1 element. Reduction logic could be added in future versions to arbitrarily choose a valid and simpler form like Forms 2 & 3.
#### Upgrade & Downgrade Considerations
When drivers are uninstalled, topology information stored in Node labels remain untouched. The recommended label format allows multiple sources (such as CSI, networking resources, etc.) to share the same label key, so it's nontrivial to accurately determine whether a label is still used.
In order to upgrade drivers using the recommended driver deployment mechanism, the user is recommended to tear down the StatefulSet (controller components) before the DaemonSet (node components), and deploy the DaemonSet before the StatefulSet. There may be design improvements to eliminate this constraint, but it will be evaluated at a later iteration.
### Example Walkthrough
#### Provisioning Volumes
@ -402,7 +587,7 @@ Alternatively, deployment could be simplified by having all components (includin
#### Deleting Volumes
1. A user deletes a `PersistentVolumeClaim` object bound to a CSI volume.
2. The external-provisioner for the CSI driver sees the the `PersistentVolumeClaim` was deleted and triggers the retention policy:
2. The external-provisioner for the CSI driver sees the `PersistentVolumeClaim` was deleted and triggers the retention policy:
1. If the retention policy is `delete`
1. The external-provisioner triggers volume deletion by issuing a `DeleteVolume` call against the CSI volume plugin container.
2. Once the volume is successfully deleted, the external-provisioner deletes the corresponding `PersistentVolume` object.

View File

@ -0,0 +1,340 @@
# In-tree Storage Plugin to CSI Migration Design Doc
Authors: @davidz627, @jsafrane
This document presents a detailed design for migrating in-tree storage plugins
to CSI. This will be an opt-in feature turned on at cluster creation time that
will redirect in-tree plugin operations to a corresponding CSI Driver.
## Background and Motivations
The Kubernetes volume plugins are currently in-tree meaning all logic and
handling for each plugin lives in the Kubernetes codebase itself. With the
Container Storage Interface (CSI) the goal is to move those plugins out-of-tree.
CSI defines a standard interface for communication between the Container
Orchestrator (CO), Kubernetes in our case, and the storage plugins.
As the CSI Spec moves towards GA and more storage plugins are being created and
becoming production ready, we will want to migrate our in-tree plugin logic to
use CSI plugins instead. This is motivated by the fact that we are currently
supporting two versions of each plugin (one in-tree and one CSI), and that we
want to eventually transition all storage users to CSI.
In order to do this we need to migrate the internals of the in-tree plugins to
call out to CSI Plugins because we will be unable to deprecate the current
internal plugin APIs due to Kubernetes API deprecation policies. This will
lower cost of development as we only have to maintain one version of each
plugin, as well as ease the transition to CSI when we are able to deprecate the
internal APIs.
## Goals
* Compile all requirements for a successful transition of the in-tree plugins to
CSI
* As little code as possible remains in the Kubernetes Repo
* In-tree plugin API is untouched, user Pods and PVs continue working after
upgrades
* Minimize user visible changes
* Design a robust mechanism for redirecting in-tree plugin usage to appropriate
CSI drivers, while supporting seamless upgrade and downgrade between new
Kubernetes version that uses CSI drivers for in-tree volume plugins to an old
Kubernetes version that uses old-fashioned volume plugins without CSI.
* Design framework for migration that allows for easy interface extension by
in-tree plugin authors to “migrate” their plugins.
* Migration must be modular so that each plugin can have migration turned on
and off separately
## Non-Goals
* Design a mechanism for deploying CSI drivers on all systems so that users can
use the current storage system the same way they do today without having to do
extra set up.
* Implementing CSI Drivers for existing plugins
* Define set of volume plugins that should be migrated to CSI
## Implementation Schedule
Alpha [1.14]
* Off by default
* Proof of concept migration of at least 2 storage plugins [AWS, GCE]
* Framework for plugin migration built for Dynamic provisioning, pre-provisioned
volumes, and in-tree volumes
Beta [Target 1.15]
* On by default
* Migrate all of the cloud provider plugins*
GA [TBD]
* Feature on by default, per-plugin toggle on for relevant cloud provider by
default
* CSI Drivers for migrated plugins available on related cloud provider cluster
by default
## Feature Gating
We will have an alpha feature gate for the whole feature that can turn the CSI
migration on or off, when off all code paths should revert/stay with the in-tree
plugins. We will also have individual flags for each driver so that admins can
toggle them on or off.
The feature gate can exist at the interception points in the OperationGenerator
for Attach and Mount, as well as in the PV Controller for Provisioning.
We will also have one feature flag for each drivers migration so that each
driver migration can be turned on and off individually.
The new feature gates for alpha are:
```
// Enables the in-tree storage to CSI Plugin migration feature.
CSIMigration utilfeature.Feature = "CSIMigration"
// Enables the GCE PD in-tree driver to GCE CSI Driver migration feature.
CSIMigrationGCE utilfeature.Feature = "CSIMigrationGCE"
// Enables the AWS in-tree driver to AWS CSI Driver migration feature.
CSIMigrationAWS utilfeature.Feature = "CSIMigrationAWS"
```
## Translation Layer
The main mechanism we will use to migrate plugins is redirecting in-tree
operation calls to the CSI Driver instead of the in-tree driver, the external
components will pick up these in-tree PV's and use a translation library to
translate to CSI Source.
Pros:
* Keeps old API objects as they are
* Facilitates gradual roll-over to CSI
Cons:
* Somewhat complicated and error prone.
* Bespoke translation logic for each in-tree plugin
### Dynamically Provisioned Volumes
#### Kubernetes Changes
Dynamically Provisioned volumes will continue to be provisioned with the in-tree
`PersistentVolumeSource`. The CSI external-provisioner to pick up the
in-tree PVC's when migration is turned on and provision using the CSI Drivers;
it will then use the imported translation library to return with a PV that contains an equivalent of the original
in-tree PV. The PV will then go through all the same steps outlined below in the
"Non-Dynamic Provisioned Volumes" for the rest of the volume lifecycle.
#### Leader Election
There will have to be some mechanism to switch between in-tree and external
provisioner when the migration feature is turned on/off. The two should be
compatible as they both will create the same volume and PV based on the same
PVC, as well as both be able to delete the same PV/PVCs. The in-tree provisioner
will have logic added so that it will stand down and mark the PV as "migrated"
with an annotation when the migration is turned on and the external provisioner
will take care of the PV when it sees the annotation.
### Translation Library
In order to make this on-the-fly translation work we will develop a separate
translation library. This library will have to be able to translate from in-tree
PV Source to the equivalent CSI Source. This library can then be imported by
both Kubernetes and the external CSI Components to translate Volume Sources when
necessary. The cost of doing this translation will be very low as it will be an
imported library and part of whatever binary needs the translation (no extra
API or RPC calls).
#### Library Interface
```
type CSITranslator interface {
// TranslateToCSI takes a volume.Spec and will translate it to a
// CSIPersistentVolumeSource if the translation logic for that
// specific in-tree volume spec has been implemented
TranslateToCSI(spec volume.Spec) (CSIPersistentVolumeSource, error)
// TranslateToIntree takes a CSIPersistentVolumeSource and will translate
// it to a volume.Spec for the specific in-tree volume specified by
//`inTreePlugin`, if that translation logic has been implemented
TranslateToInTree(source CSIPersistentVolumeSource, inTreePlugin string) (volume.Spec, error)
// IsMigrated returns true if the plugin has migration logic
// false if it does not
IsMigrated(inTreePlugin string) bool
}
```
#### Library Versioning
Since the library will be imported by various components it is imperative that
all components import a version of the library that supports in-tree driver x
before the migration feature flag for x is turned on. If not, the TranslateToCSI
function will return an error when the translation is attempted.
### Pre-Provisioned Volumes (and volumes provisioned before migration)
In the OperationGenerator at the start of each volume operation call we will
check to see whether the plugin has been migrated.
For Controller calls, we will call the CSI calls instead of the in-tree calls.
The OperationGenerator can do the translation of the PV Source before handing it
to the CSI calls, therefore the CSI in-tree plugin will only have to deal with
what it sees as a CSI Volume. Special care must be taken that `volumeHandle` is
unique and also deterministic so that we can always find the correct volume.
We also foresee that future controller calls such as resize and snapshot will use a similar mechanism. All these external components
will also need to be updated to accept PV's of any source type when it is given
and use the translation library to translate the in-tree PV Source into a CSI
Source when necessary.
For Node calls, the VolumeToMount object will contain the in-tree PV Source,
this can then be translated by the translation library when needed and
information can be fed to the CSI components when necessary.
Then the rest of the code in the Operation Generator can execute as normal with
the CSI Plugin and the annotation in the requisite locations.
Caveat: For ALL detach calls of plugins that MAY have already been migrated we
have to attempt to DELETE the VolumeAttachment object that would have been
created if that plugin was migrated. This is because Attach after migration
creates a VolumeAttachment object, and if for some reason we are doing a detach
with the in-tree plugin, the VolumeAttachment object becomes orphaned.
### In-Line Volumes
In-line controller calls are a special case because there is no PV. In this case
we will add the CSI Source JSON to the VolumeToAttach object and in Attach we
will put the Source in a new field in the VolumeAttachment object
VolumeAttachment.Spec.Source.VolumeAttachmentSource.InlineVolumeSource. The CSI Attacher will have to
be modified to also check this location for a source before checking the PV
itself.
We need to be careful with naming VolumeAttachments for in-line volumes. The
name needs to be unique and A/D controller must be able to find the right
VolumeAttachment when a pod is deleted (i.e. using only info in Node.Status).
CSI driver in kubelet must be able to find the VolumeAttachment too to get
AttachmentMetadata for NodeStage/NodePublish.
In downgrade scenario where the migration is then turned off we will have to
remove these floating VolumeAttachment objects, the same issue is outlined above
in the Non-Dynamic Provisioned Volumes section.
For more details on this see the PR that specs out CSI Inline Volumes in more detail:
https://github.com/kubernetes/community/pull/2273. Basically we will just translate
the in-tree inline volumes into the format specified/implemented in the
container-storage-interface-inline-volumes proposal.
## Interactions with PV-PVC Protection Finalizers
PV-PVC Protection finalizers prevent deletion of a PV when it is bound to a PVC,
and prevent deletion of a PVC when it is in use by a pod.
There is no known issue with interaction here. The finalizers will still work in
the same ways as we are not removing/adding PVs or PVCs in out of the ordinary
ways.
## Dealing with CSI Driver Failures
Plugin should fail if the CSI Driver is down and migration is turned on. When
the driver recovers we should be able to resume gracefully.
We will also create a playbook entry for how to turn off the CSI Driver
migration gracefully, how to tell when the CSI Driver is broken or non-existent,
and how to redeploy a CSI Driver in a cluster.
## Upgrade/Downgrade, Migrate/Un-migrate
### Kubelet Node Annotation
When the Kubelet starts, it will check whether the feature gate is
enabled and if so will annotate its node with `csi.attach.kubernetes.io/gce-pd`
for example to communicate to the A/D Controller that it supports migration of
the gce-pd to CSI. The A/D Controller will have to choose on a per-node basis
whether to use the CSI or the in-tree plugin for attach based on 3 criterea:
1. Feature gate
2. Plugin Migratable (Implements MigratablePlugin interface)
3. Node to Attach to has requisite Annotation
Note: All 3 criteria must be satisfied for A/D controller to Attach/Detach with
CSI instead of in-tree plugin. For example if a Kubelet has feature on and marks
the annotation, but the A/D Controller does not have the feature gate flipped,
we consider this user error and will throw some errors.
This can cause a race between the A/D Controller and the Kubelet annotating, if
a volume is attached before the Kubelet completes annotation the A/D controller
could attach using in-tree plugin instead of CSI while the Kubelet is expecting
a CSI Attach. The same issue exists on downgrade if the Annotation is not
removed before a volume is attached. An additional consideration is that we
cannot have the Kubelet downgraded to a version that does not have the
Annotation removal code.
### Node Drain Requirement
We require node's to be drained whenever the Kubelet is Upgrade/Downgraded or
Migrated/Unmigrated to ensure that the entire volume lifecycle is maintained
inside one code branch (CSI or In-tree). This simplifies upgrade/downgrade
significantly and reduces chance of error and races.
### Upgrade/Downgrade Migrate/Unmigrate Scenarios
For upgrade, starting from a non-migrated cluster you must turn on migration for
A/D Controller first, then drain your node before turning on migration for the
Kubelet. The workflow is as follows:
1. A/D Controller and Kubelet are both not migrated
2. A/D Controller restarted and migrated (flags flipped)
3. A/D Controller continues to use in-tree code for this node b/c node
annotation doesn't exist
4. Node drained and made unschedulable. All volumes unmounted/detached with in-tree code
5. Kubelet restarted and migrated (flags flipped)
6. Kubelet annotates node to tell A/D controller this node has been migrated
7. Kubelet is made schedulable
8. Both A/D Controller & Kubelet Migrated, node is in "fresh" state so all new
volumes lifecycle is CSI
For downgrade, starting from a fully migrated cluster you must drain your node
first, then turn off migration for your Kubelet, then turn off migration for the
A/D Controller. The workflow is as follows:
1. A/D Controller and Kubelet are both migrated
2. Kubelet drained and made unschedulable, all volumes unmounted/detached with CSI code
3. Kubelet restarted and un-migrated (flags flipped)
4. Kubelet removes node annotation to tell A/D Controller this node is not
migrated. In case kubelet does not have annotation removal code, admin must
remove the annotation manually.
5. Kubelet is made schedulable.
5. At this point all volumes going onto the node would be using in-tree code for
both A/D Controller(b/c of annotation) and Kublet
6. Restart and un-migrate A/D Controller
With these workflows a volume attached with CSI will be handled by CSI code for
its entire lifecycle, and a volume attached with in-tree code will be handled by
in-tree code for its entire lifecycle.
## Cloud Provider Requirements
There is a push to remove CloudProvider code from kubernetes.
There will not be any general auto-deployment mechanism for ALL CSI drivers
covered in this document so the timeline to remove CloudProvider code using this
design is undetermined: For example: At some point GKE could auto-deploy the GCE
PD CSI driver and have migration for that turned on by default, however it may
not deploy any other drivers by default. And at this point we can only remove
the code for the GCE In-tree plugin (this would still break anyone doing their
own deployments while using GCE unless they install the GCE PD CSI Driver).
We could have auto-deploy depending on what cloud provider kubernetes is running
on. But AFAIK there is no standard mechanism to guarantee this on all Cloud
Providers.
For example the requirements for just the GCE Cloud Provider code for storage
with minimal disruption to users would be:
* In-tree to CSI Plugin migration goes GA
* GCE PD CSI Driver deployed on GCE/GKE by default (resource requirements of
driver need to be determined)
* GCE PD CSI Migration turned on by default
* Remove in-tree plugin code and cloud provider code
And at this point users doing their own deployment and not installing the GCE PD
CSI driver encounter an error.
## Testing
### Standard
Good news is that all “normal functionality” can be tested by simply bringing up
a cluster with “migrated” drivers and running the existing e2e tests for that
driver. We will create CI jobs that run in this configuration for each new
volume plugin
### Migration/Non-migration (Upgrade/Downgrade)
Write tests were in a normal workflow of attach/mount/unmount/detach, we have
any one of these operations actually happen with the old volume plugin, not the
CSI one This makes sure that the workflow is resiliant to rollback at any point
in time.
### Version Skew
Master/Node can have up to 2 version skw. Master must always be equal or higher
version than the node. It should be covered by the tests in
Migration/Non-migration section.

View File

@ -0,0 +1,376 @@
Kubernetes CSI Snapshot Proposal
================================
**Authors:** [Jing Xu](https://github.com/jingxu97), [Xing Yang](https://github.com/xing-yang), [Tomas Smetana](https://github.com/tsmetana), [Huamin Chen ](https://github.com/rootfs)
## Background
Many storage systems (GCE PD, Amazon EBS, etc.) provide the ability to create "snapshots" of persistent volumes to protect against data loss. Snapshots can be used in place of a traditional backup system to back up and restore primary and critical data. Snapshots allow for quick data backup (for example, it takes a fraction of a second to create a GCE PD snapshot) and offer fast recovery time objectives (RTOs) and recovery point objectives (RPOs). Snapshots can also be used for data replication, distribution and migration.
As the initial effort to support snapshot in Kubernetes, volume snapshotting has been released as a prototype in Kubernetes 1.8. An external controller and provisioner (i.e. two separate binaries) have been added in the [external storage repo](https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot). The prototype currently supports GCE PD, AWS EBS, OpenStack Cinder, GlusterFS, and Kubernetes hostPath volumes. Volume snapshots APIs are using [CRD](https://kubernetes.io/docs/tasks/access-kubernetes-api/extend-api-custom-resource-definitions/).
To continue that effort, this design is proposed to add the snapshot support for CSI Volume Drivers. Because the overall trend in Kubernetes is to keep the core APIs as small as possible and use CRD for everything else, this proposal adds CRD definitions to represent snapshots, and an external snapshot controller to handle volume snapshotting. Out-of-tree external provisioner can be upgraded to support creating volume from snapshot. In this design, only CSI volume drivers will be supported. The CSI snapshot spec is proposed [here](https://github.com/container-storage-interface/spec/pull/224).
## Objectives
For the first version of snapshotting support in Kubernetes, only on-demand snapshots for CSI Volume Drivers will be supported.
### Goals
* Goal 1: Expose standardized snapshotting operations to create, list, and delete snapshots in Kubernetes REST API.
Currently the APIs will be implemented with CRD (CustomResourceDefinitions).
* Goal 2: Implement CSI volume snapshot support.
An external snapshot controller will be deployed with other external components (e.g., external-attacher, external-provisioner) for each CSI Volume Driver.
* Goal 3: Provide a convenient way of creating new and restoring existing volumes from snapshots.
### Non-Goals
The following are non-goals for the current phase, but will be considered at a later phase.
* Goal 4: Offer application-consistent snapshots by providing pre/post snapshot hooks to freeze/unfreeze applications and/or unmount/mount file system.
* Goal 5: Provide higher-level management, such as backing up and restoring a pod and statefulSet, and creating a consistent group of snapshots.
## Design Details
In this proposal, volume snapshots are considered as another type of storage resources managed by Kubernetes. Therefore the snapshot API and controller follow the design pattern of existing volume management. There are three APIs, VolumeSnapshot and VolumeSnapshotContent, and VolumeSnapshotClass which are similar to the structure of PersistentVolumeClaim and PersistentVolume, and storageClass. The external snapshot controller functions similar to the in-tree PV controller. With the snapshots APIs, we also propose to add a new data source struct in PersistentVolumeClaim (PVC) API in order to support restore snapshots to volumes. The following section explains in more details about the APIs and the controller design.
### Snapshot API Design
The API design of VolumeSnapshot and VolumeSnapshotContent is modeled after PersistentVolumeClaim and PersistentVolume. In the first version, the VolumeSnapshot lifecycle is completely independent of its volumes source (PVC). When PVC/PV is deleted, the corresponding VolumeSnapshot and VolumeSnapshotContents objects will continue to exist. However, for some volume plugins, snapshots have a dependency on their volumes. In a future version, we plan to have a complete lifecycle management which can better handle the relationship between snapshots and their volumes. (e.g., a finalizer to prevent deleting volumes while there are snapshots depending on them).
#### The `VolumeSnapshot` Object
```GO
// +genclient
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// VolumeSnapshot is a user's request for taking a snapshot. Upon successful creation of the actual
// snapshot by the volume provider it is bound to the corresponding VolumeSnapshotContent.
// Only the VolumeSnapshot object is accessible to the user in the namespace.
type VolumeSnapshot struct {
metav1.TypeMeta `json:",inline"`
// Standard object's metadata.
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
// +optional
metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
// Spec defines the desired characteristics of a snapshot requested by a user.
Spec VolumeSnapshotSpec `json:"spec" protobuf:"bytes,2,opt,name=spec"`
// Status represents the latest observed state of the snapshot
// +optional
Status VolumeSnapshotStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
}
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// VolumeSnapshotList is a list of VolumeSnapshot objects
type VolumeSnapshotList struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
// Items is the list of VolumeSnapshots
Items []VolumeSnapshot `json:"items" protobuf:"bytes,2,rep,name=items"`
}
// VolumeSnapshotSpec describes the common attributes of a volume snapshot
type VolumeSnapshotSpec struct {
// Source has the information about where the snapshot is created from.
// In Alpha version, only PersistentVolumeClaim is supported as the source.
// If not specified, user can create VolumeSnapshotContent and bind it with VolumeSnapshot manually.
// +optional
Source *TypedLocalObjectReference `json:"source" protobuf:"bytes,1,opt,name=source"`
// SnapshotContentName binds the VolumeSnapshot object with the VolumeSnapshotContent
// +optional
SnapshotContentName string `json:"snapshotContentName" protobuf:"bytes,2,opt,name=snapshotContentName"`
// Name of the VolumeSnapshotClass used by the VolumeSnapshot. If not specified, a default snapshot class will
// be used if it is available.
// +optional
VolumeSnapshotClassName *string `json:"snapshotClassName" protobuf:"bytes,3,opt,name=snapshotClassName"`
}
// VolumeSnapshotStatus is the status of the VolumeSnapshot
type VolumeSnapshotStatus struct {
// CreationTime is the time the snapshot was successfully created. If it is set,
// it means the snapshot was created; Otherwise the snapshot was not created.
// +optional
CreationTime *metav1.Time `json:"creationTime" protobuf:"bytes,1,opt,name=creationTime"`
// When restoring volume from the snapshot, the volume size should be equal or
// larger than the Restoresize if it is specified. If RestoreSize is set to nil, it means
// that the storage plugin does not have this information available.
// +optional
RestoreSize *resource.Quantity `json:"restoreSize" protobuf:"bytes,2,opt,name=restoreSize"`
// Ready is set to true only if the snapshot is ready to use (e.g., finish uploading if
// there is an uploading phase) and also VolumeSnapshot and its VolumeSnapshotContent
// bind correctly with each other. If any of the above condition is not true, Ready is
// set to false
// +optional
Ready bool `json:"ready" protobuf:"varint,3,opt,name=ready"`
// The last error encountered during create snapshot operation, if any.
// This field must only be set by the entity completing the create snapshot
// operation, i.e. the external-snapshotter.
// +optional
Error *storage.VolumeError
}
```
Note that if an error occurs before the snapshot is cut, `Error` will be set and none of `CreatedAt`/`AvailableAt` will be set. If an error occurs after the snapshot is cut but before it is available, `Error` will be set and `CreatedAt` should still be set, but `AvailableAt` will not be set. If an error occurs after the snapshot is available, `Error` will be set and `CreatedAt` should still be set, but `AvailableAt` will no longer be set.
#### The `VolumeSnapshotContent` Object
```GO
// +genclient
// +genclient:nonNamespaced
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// VolumeSnapshotContent represents the actual snapshot object
type VolumeSnapshotContent struct {
metav1.TypeMeta `json:",inline"`
// Standard object's metadata.
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
// +optional
metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
// Spec defines a specification of a volume snapshot
Spec VolumeSnapshotContentSpec `json:"spec" protobuf:"bytes,2,opt,name=spec"`
}
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// VolumeSnapshotContentList is a list of VolumeSnapshotContent objects
type VolumeSnapshotContentList struct {
metav1.TypeMeta `json:",inline"`
// +optional
metav1.ListMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
// Items is the list of VolumeSnapshotContents
Items []VolumeSnapshotContent `json:"items" protobuf:"bytes,2,rep,name=items"`
}
// VolumeSnapshotContentSpec is the spec of the volume snapshot content
type VolumeSnapshotContentSpec struct {
// Source represents the location and type of the volume snapshot
VolumeSnapshotSource `json:",inline" protobuf:"bytes,1,opt,name=volumeSnapshotSource"`
// VolumeSnapshotRef is part of bi-directional binding between VolumeSnapshot
// and VolumeSnapshotContent. It becomes non-nil when bound.
// +optional
VolumeSnapshotRef *core_v1.ObjectReference `json:"volumeSnapshotRef" protobuf:"bytes,2,opt,name=volumeSnapshotRef"`
// PersistentVolumeRef represents the PersistentVolume that the snapshot has been
// taken from. It becomes non-nil when VolumeSnapshot and VolumeSnapshotContent are bound.
// +optional
PersistentVolumeRef *core_v1.ObjectReference `json:"persistentVolumeRef" protobuf:"bytes,3,opt,name=persistentVolumeRef"`
// Name of the VolumeSnapshotClass used by the VolumeSnapshotContent. If not specified, a default snapshot class will
// be used if it is available.
// +optional
VolumeSnapshotClassName *string `json:"snapshotClassName" protobuf:"bytes,4,opt,name=snapshotClassName"`
}
// VolumeSnapshotSource represents the actual location and type of the snapshot. Only one of its members may be specified.
type VolumeSnapshotSource struct {
// CSI (Container Storage Interface) represents storage that handled by an external CSI Volume Driver (Alpha feature).
// +optional
CSI *CSIVolumeSnapshotSource `json:"csiVolumeSnapshotSource,omitempty"`
}
// Represents the source from CSI volume snapshot
type CSIVolumeSnapshotSource struct {
// Driver is the name of the driver to use for this snapshot.
// Required.
Driver string `json:"driver"`
// SnapshotHandle is the unique snapshot id returned by the CSI volume
// plugins CreateSnapshot to refer to the snapshot on all subsequent calls.
// Required.
SnapshotHandle string `json:"snapshotHandle"`
// Timestamp when the point-in-time snapshot is taken on the storage
// system. This timestamp will be generated by the CSI volume driver after
// the snapshot is cut. The format of this field should be a Unix nanoseconds
// time encoded as an int64. On Unix, the command `date +%s%N` returns
// the current time in nanoseconds since 1970-01-01 00:00:00 UTC.
CreationTime *int64 `json:"creationTime,omitempty" protobuf:"varint,3,opt,name=creationTime"`
// When restoring volume from the snapshot, the volume size should be equal or
// larger than the Restoresize if it is specified. If RestoreSize is set to nil, it means
// that the storage plugin does not have this information available.
// +optional
RestoreSize *resource.Quantity `json:"restoreSize" protobuf:"bytes,2,opt,name=restoreSize"`
}
```
#### The `VolumeSnapshotClass` Object
A new VolumeSnapshotClass API object will be added instead of reusing the existing StorageClass, in order to avoid mixing parameters between snapshots and volumes. Each CSI Volume Driver can have its own default VolumeSnapshotClass. If VolumeSnapshotClass is not provided, a default will be used. It allows to add new parameters for snapshots.
```
// +genclient
// +genclient:nonNamespaced
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// VolumeSnapshotClass describes the parameters used by storage system when
// provisioning VolumeSnapshots from PVCs.
// The name of a VolumeSnapshotClass object is significant, and is how users can request a particular class.
type VolumeSnapshotClass struct {
metav1.TypeMeta `json:",inline"`
// Standard object's metadata.
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
// +optional
metav1.ObjectMeta `json:"metadata,omitempty" protobuf:"bytes,1,opt,name=metadata"`
// Snapshotter is the driver expected to handle this VolumeSnapshotClass.
Snapshotter string `json:"snapshotter" protobuf:"bytes,2,opt,name=snapshotter"`
// Parameters holds parameters for the snapshotter.
// These values are opaque to the system and are passed directly
// to the snapshotter.
// +optional
Parameters map[string]string `json:"parameters,omitempty" protobuf:"bytes,3,rep,name=parameters"`
}
```
### Volume API Changes
With Snapshot API available, users could provision volumes from snapshot and data will be pre-populated to the volumes. Also considering clone and other possible storage operations, there could be many different types of sources used for populating the data to the volumes. In this proposal, we add a general "DataSource" which could be used to represent different types of data sources.
#### The `DataSource` Object in PVC
Add a new `DataSource` field into PVC to represent the source of the data which is populated to the provisioned volume. External-provisioner will check `DataSource` field and try to provision volume from the sources. In the first version, only VolumeSnapshot is the supported `Type` for data source object reference. Other types will be added in a future version. If unsupported `Type` is used, the PV Controller SHALL fail the operation. Please see more details in [here](https://github.com/kubernetes/community/pull/2495)
Possible `DataSource` types may include the following:
* VolumeSnapshot: restore snapshot to a new volume
* PersistentVolumeClaim: clone volume which is represented by PVC
```
type PersistentVolumeClaimSpec struct {
// If specified when creating, volume will be prepopulated with data from the DataSource.
// +optional
DataSource *TypedLocalObjectReference `json:"dataSource" protobuf:"bytes,2,opt,name=dataSource"`
}
```
Add a TypedLocalObjectReference in core API.
```
// TypedLocalObjectReference contains enough information to let you locate the referenced object inside the same namespace.
type TypedLocalObjectReference struct {
// Name of the object reference.
Name string
// Kind indicates the type of the object reference.
Kind string
}
```
### Snapshot Controller Design
As the figure below shows, the CSI snapshot controller architecture consists of an external snapshotter which talks to out-of-tree CSI Volume Driver over socket (/run/csi/socket by default, configurable by -csi-address). External snapshotter is part of Kubernetes implementation of [Container Storage Interface (CSI)](https://github.com/container-storage-interface/spec). It is an external controller that monitors `VolumeSnapshot` and `VolumeSnapshotContent` objects and creates/deletes snapshot.
![CSI Snapshot Diagram](csi-snapshot_diagram.png?raw=true "CSI Snapshot Diagram")
* External snapshotter uses ControllerGetCapabilities to find out if CSI driver supports CREATE_DELETE_SNAPSHOT calls. It degrades to trivial mode if not.
* External snapshotter is responsible for creating/deleting snapshots and binding snapshot and SnapshotContent objects. It follows [controller](https://github.com/kubernetes/community/blob/master/contributors/devel/controllers.md) pattern and uses informers to watch for `VolumeSnapshot` and `VolumeSnapshotContent` create/update/delete events. It filters out `VolumeSnapshot` instances with `Snapshotter==<CSI driver name>` and processes these events in workqueues with exponential backoff.
* For dynamically created snapshot, it should have a VolumeSnapshotClass associated with it. User can explicitly specify a VolumeSnapshotClass in the VolumeSnapshot API object. If user does not specify a VolumeSnapshotClass, a default VolumeSnapshotClass created by the admin will be used. This is similar to how a default StorageClass created by the admin will be used for the provisioning of a PersistentVolumeClaim.
* For statically binding snapshot, user/admin must specify bi-pointers correctly for both VolumeSnapshot and VolumeSnapshotContent, so that the controller knows how to bind them. Otherwise, if VolumeSnapshot points to a non-exist VolumeSnapshotContent, or VolumeSnapshotContent does not point back to the VolumeSnapshot, the Error status will be set for VolumeSnapshot
* External snapshotter is running in the sidecar along with external-attacher and external-provisioner for each CSI Volume Driver.
* In current design, when the storage system fails to create snapshot, retry will not be performed in the controller. This is because users may not want to retry when taking consistent snapshots or scheduled snapshots when the timing of the snapshot creation is important. In a future version, a maxRetries flag or retry termination timestamp will be added to allow users to control whether retries are needed.
#### Changes in CSI External Provisioner
`DataSource` is available in `PersistentVolumeClaim` to represent the source of the data which is prepopulated to the provisioned volume. The operation of the provisioning of a volume from a snapshot data source will be handled by the out-of-tree CSI External Provisioner. The in-tree PV Controller will handle the binding of the PV and PVC once they are ready.
#### CSI Volume Driver Snapshot Support
The out-of-tree CSI Volume Driver creates a snapshot on the backend storage system or cloud provider, and calls CreateSnapshot through CSI ControllerServer and returns CreateSnapshotResponse. The out-of-tree CSI Volume Driver needs to implement the following functions:
* CreateSnapshot, DeleteSnapshot, and create volume from snapshot if it supports CREATE_DELETE_SNAPSHOT.
* ListSnapshots if it supports LIST_SNAPSHOTS.
ListSnapshots can be an expensive operation because it will try to list all snapshots on the storage system. For a storage system that takes nightly periodic snapshots, the total number of snapshots on the system can be huge. Kubernetes should try to avoid this call if possible. Instead, calling ListSnapshots with a specific snapshot_id as filtering to query the status of the snapshot will be more desirable and efficient.
CreateSnapshot is a synchronous function and it must be blocking until the snapshot is cut. For cloud providers that support the uploading of a snapshot as part of creating snapshot operation, CreateSnapshot function must also be blocking until the snapshot is cut and after that it shall return an operation pending gRPC error code until the uploading process is complete.
Refer to [Container Storage Interface (CSI)](https://github.com/container-storage-interface/spec) for detailed instructions on how CSI Volume Driver shall implement snapshot functions.
## Transition to the New Snapshot Support
### Existing Implementation in External Storage Repo
For the snapshot implementation in [external storage repo](https://github.com/kubernetes-incubator/external-storage/tree/master/snapshot), an external snapshot controller and an external provisioner need to be deployed.
* The old implementation does not support CSI volume drivers.
* VolumeSnapshotClass concept does not exist in the old design.
* To restore a volume from the snapshot, however, user needs to create a new StorageClass that is different from the original one for the PVC.
Here is an example yaml file to create a snapshot in the old design:
```GO
apiVersion: volumesnapshot.external-storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: hostpath-test-snapshot
spec:
persistentVolumeClaimName: pvc-test-hostpath
```
### New Snapshot Design for CSI
For the new snapshot model, a sidecar "Kubernetes to CSI" proxy container called "external-snapshotter" needs to be deployed in addition to the sidecar container for the external provisioner. This deployment model is shown in the CSI Snapshot Diagram in the CSI External Snapshot Controller section.
* The new design supports CSI volume drivers.
* To create a snapshot for CSI, a VolumeSnapshotClass can be created and specified in the spec of VolumeSnapshot.
* To restore a volume from the snapshot, users could use the same StorageClass that is used for the original PVC.
Here is an example to create a VolumeSnapshotClass and to create a snapshot in the new design:
```GO
apiVersion: snapshot.storage.k8s.io/v1alpha1
kind: VolumeSnapshotClass
metadata:
name: csi-hostpath-snapclass
snapshotter: csi-hostpath
---
apiVersion:snapshot.storage.k8s.io/v1alpha1
kind: VolumeSnapshot
metadata:
name: snapshot-demo
spec:
snapshotClassName: csi-hostpath-snapclass
source:
name: hpvc
kind: PersistentVolumeClaim
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB

View File

@ -0,0 +1,121 @@
# Add DataSource for Volume Operations
Note: this proposal is part of [Volume Snapshot](https://github.com/kubernetes/community/pull/2335) feature design, and also relevant to recently proposed [Volume Clone](https://github.com/kubernetes/community/pull/2533) feature.
## Goal
Currently in Kubernetes, volume plugin only supports to provision an empty volume. With the new storage features (including [Volume Snapshot](https://github.com/kubernetes/community/pull/2335) and [volume clone](https://github.com/kubernetes/community/pull/2533)) being proposed, there is a need to support data population for volume provisioning. For example, volume can be created from a snapshot source, or volume could be cloned from another volume source. Depending on the sources for creating the volume, there are two scenarios
1. Volume provisioner can recognize the source and be able to create the volume from the source directly (e.g., restore snapshot to a volume or clone volume).
2. Volume provisioner does not recognize the volume source, and create an empty volume. Another external component (data populator) could watch the volume creation and implement the logic to populate/import the data to the volume provisioned. Only after data is populated to the volume, the PVC is ready for use.
There could be many different types of sources used for populating the data to the volumes. In this proposal, we propose to add a generic "DataSource" field to PersistentVolumeClaimSpec to represent different types of data sources.
## Design
### API Change
A new DataSource field is proposed to be added to PVC to represent the source of the data which is pre-populated to the provisioned volume. For DataSource field, we propose to define a new type “TypedLocalObjectReference”. It is similar to “LocalObjectReference” type with additional Kind field in order to support multiple data source types. In the alpha version, this data source is restricted in the same namespace of the PVC. The following are the APIs we propose to add.
```
type PersistentVolumeClaimSpec struct {
// If specified, volume will be pre-populated with data from the specified data source.
// +optional
DataSource *TypedLocalObjectReference `json:"dataSource" protobuf:"bytes,2,opt,name=dataSource"`
}
// TypedLocalObjectReference contains enough information to let you locate the referenced object inside the same namespace.
type TypedLocalObjectReference struct {
// Name of the object reference.
Name string
// Kind indicates the type of the object reference.
Kind string
// APIGroup is the group for the resource being referenced
APIGroup string
}
```
### Design Details
In the first alpha version, we only support data source from Snapshot. So the expected Kind in DataSource has to be "VolumeSnapshot". In this case, provisioner should provision volume and populate data in one step. There is no need for external data populator yet.
For other types of data sources that require external data populator, volume creation and data population are two separate steps. Only when data is ready, PVC/PV can be marked as ready (Bound) so that users can start to use them. We are working on a separate proposal to address this using similar idea from ["Pod Ready++"](https://github.com/kubernetes/community/blob/master/keps/sig-network/0007-pod-ready%2B%2B.md).
Note: In order to use this data source feature, user/admin needs to update to the new external provisioner which can recognize snapshot data source. Otherwise, data source will be ignored and an empty volume will be created
## Use cases
* Use snapshot to backup data: Alice wants to take a snapshot of her Mongo database, and accidentally delete her tables, she wants to restore her volumes from the snapshot.
To create a snapshot for a volume (represented by PVC), use the snapshot.yaml
```
apiVersion: snapshot.storage.k8s.io/v1alpha1
kind: VolumeSnapshot
metadata:
name: snapshot-pd-1
namespace: mynamespace
spec:
source:
kind: PersistentVolumeClaim
name: podpvc
snapshotClassName: snapshot-class
```
After snapshot is ready, create a new volume from the snapshot
```
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: snapshot-pvc
Namespace: mynamespace
spec:
accessModes:
- ReadWriteOnce
storageClassName: csi-gce-pd
dataSource:
kind: VolumeSnapshot
name: snapshot-pd-1
resources:
requests:
storage: 6Gi
```
* Clone volume: Bob want to copy the data from one volume to another by cloning the volume.
```
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: clone-pvc
Namespace: mynamespace
spec:
accessModes:
- ReadWriteOnce
storageClassName: csi-gce-pd
dataSource:
kind: PersistentVolumeClaim
name: pvc-1
resources:
requests:
storage: 10Gi
```
* Import data from Github repo: Alice want to import data from a github repo to her volume. The github repo is represented by a PVC (gitrepo-1). Compare with the user case 2 is that the data source should be the same kind of volume as the provisioned volume for cloning.
```
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: clone-pvc
Namespace: mynamespace
spec:
accessModes:
- ReadWriteOnce
storageClassName: csi-gce-pd
dataSource:
kind: PersistentVolumeClaim
name: gitrepo-1
resources:
requests:
storage: 100Gi
```

View File

@ -0,0 +1,170 @@
# Proposal for Growing FlexVolume Size
**Authors:** [xingzhou](https://github.com/xingzhou)
## Goals
Since PVC resizing is introduced in Kubernetes v1.8, several volume plugins have already supported this feature, e.g. GlusterFS, AWS EBS. In this proposal, we are proposing to support FlexVolume expansion. So when user uses FlexVolume and corresponding volume driver to connect to his/her backend storage system, he/she can expand the PV size by updating PVC in Kubernetes.
## Non Goals
* We only consider expanding FlexVolume size in this proposal. Decreasing size of FlexVolume will be designed in the future.
* In this proposal, user can only expand the FlexVolume size manually by updating PVC. Auto-expansion of FlexVolume based on specific meterings is not considered.
* The proposal only contains the changes made in FlexVolume, volume driver changes which should be made by user are not included.
## Implementation Designs
### Prerequisites
* Kubernetes should be at least v1.8.
* Enable resizing by setting feature gate `ExpandPersistentVolumeGate` to `true`.
* Enable `PersistentVolumeClaimResize` admission plugin(optional).
* Follow the UI of PV resizing, including:
* Only dynamic provisioning supports volume resizing
* Set StorageClass attribute `allowVolumeExpansion` to `true`
### Admission Control Changes
Whether or not a specific volume plugin supports volume expansion is validated and checked in PV resize admission plugin. In general, we can list FlexVolume as the ones that support volume expansion and leave the actual expansion capability check to the underneath volume driver when PV resize controller calls the `ExpandVolumeDevice` method of FlexVolume.
In PV resize admission plugin, add the following check to `checkVolumePlugin` method:
```
// checkVolumePlugin checks whether the volume plugin supports resize
func (pvcr *persistentVolumeClaimResize) checkVolumePlugin(pv *api.PersistentVolume) bool {
...
if pv.Spec.FlexVolume != nil {
return true
}
...
}
```
### FlexVolume Plugin Changes
FlexVolume relies on underneath volume driver to implement various volume functions, e.g. attach/detach. As a result, volume driver will decide whether volume can be expanded or not.
By default, we assume all kinds of flex volume drivers support resizing. If they do not, flex volume plugin can detect this during resizing call to flex volume driver and always throw out error to stop the resizing process. So as a result, to implement resizing feature in flex volume plugin, the plugin itself must implement the following `ExpandableVolumePlugin` interfaces:
#### ExpandVolumeDevice
Volume resizing controller invokes this method while receiving a valid PVC resizing request. FlexVolume plugin calls the underneath volume drivers corresponding `expandvolume` method with three parameters, including new size of volume(number in bytes), old size of volume(number in bytes) and volume spec, to expand PV. Once the expansion is done, volume driver should return the new size(number in bytes) of the volume to FlexVolume.
A sample implementation of `ExpandVolumeDevice` method is like:
```
func (plugin *flexVolumePlugin) ExpandVolumeDevice(spec *volume.Spec, newSize resource.Quantity, oldSize resource.Quantity) (resource.Quantity, error) {
const timeout = 10*time.Minute
call := plugin.NewDriverCallWithTimeout(expandVolumeCmd, timeout)
call.Append(newSize.Value())
call.Append(oldSize.Value())
call.AppendSpec(spec, plugin.host, nil)
// If the volume driver does not support resizing, Flex Volume Plugin can throw out error here
// to stop expand controller's resizing process.
ds, err := call.Run()
if err != nil {
return resource.NewQuantity(0, resource.BinarySI), err
}
return resource.NewQuantity(ds.ActualVolumeSize, resource.BinarySI), nil
}
```
Add a new field in type `DriverStatus` named `ActualVolumeSize` to identify the new expanded size of the volume returned by underneath volume driver:
```
// DriverStatus represents the return value of the driver callout.
type DriverStatus struct {
...
ActualVolumeSize int64 `json:"volumeNewSize,omitempty"`
}
```
#### RequiresFSResize
`RequiresFSResize` is a method to implement `ExpandableVolumePlugin` interface. The return value of this method identifies whether or not a file system resize is required once physical volume get expanded. If the return value is `true`, PV resize controller will consider the volume resize operation is done and then update the PV objects capacity in K8s directly; If the return value is `false`, PV resize controller will leave kubelet to do the file system resize, and kubelet on worker node will call `ExpandFS` method of FlexVolume to finish the file system resize step(at present, only offline FS resize is supportted, online resize support is under community discussion [here](https://github.com/kubernetes/community/pull/1535)).
The return value of `RequiresFSResize` is collected from underneath volume driver when FlexVolume invokes `init` method of volume driver. The sample code of `RequiresFSResize` in FlexVolume looks like:
```
func (plugin *flexVolumePlugin) RequiresFSResize() bool {
return plugin.capabilities.RequiresFSResize
}
```
And as a result, the FlexVolume type `DriverCapability` can be redefined as:
```
type DriverCapabilities struct {
Attach bool `json:"attach"`
RequiresFSResize bool `json:"requiresFSResize"`
SELinuxRelabel bool `json:"selinuxRelabel"`
}
func defaultCapabilities() *DriverCapabilities {
return &DriverCapabilities{
Attach: true,
RequiresFSResize: true, //By default, we require file system resize which will be done by kubelet
SELinuxRelabel: true,
}
}
```
#### ExpandFS
`ExpandFS` is another method to implement `ExpandableVolumePlugin` interface. This method allows volume plugin itself instead of kubelet to resize the file system. If volume plugin returns `true` for `RequiresFSResize`, PV resize controller will leave FS resize to kubelet on worker node. Kubelet then will call FlexVolume `ExpandFS` to resize file system once physical volume expansion is done.
As `ExpandFS` is called on worker node, volume driver can also take this chance to do physical volume resize together with file system resize as well. Also, current code only supports offline FS resize, online resize support is under dicsussion [here](https://github.com/kubernetes/community/pull/1535). Once online resize is implemented, we can also leverage online resize for FlexVolume by `ExpandFS` method.
Note that `ExpandFS` is a new API for `ExpandableVolumeDriver`, the community ticket can be found [here](https://github.com/kubernetes/kubernetes/issues/58786).
`ExpandFS` will call underneath volume driver `expandfs` method to finish FS resize. The sample code looks like:
```
func (plugin *flexVolumePlugin) ExpandFS(spec *volume.Spec, newSize resource.Quantity, oldSize resource.Quantity) error {
const timeout = 10*time.Minute
call := plugin.NewDriverCallWithTimeout(expandFSCmd, timeout)
call.Append(newSize.Value())
call.Append(oldSize.Value())
call.AppendSpec(spec, plugin.host, nil)
_, err := call.Run()
return err
}
```
For more design and details on how kubelet resizes volume file system, please refer to volume resizing proposal at:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/grow-volume-size.md
Based on the above design, the resizing process for flex volume can be summarized as:
* If flex volume driver does not support resizing, driver shall not implement `expandvolume` method and flex volume plugin will throw out error to stop the expand volume controller's resizing process.
* If flex volume driver supports resizing, it shall implement `expandvolume` method and at least, the volume driver shall be installed on master node.
* If flex volume driver supports resizing and does not need file system resizing, it shall set "requiresFSResize" capability to `false`. Otherwise kubelet on worker node will call `ExpandFS` to resize the file system.
* If flex volume driver supports resizing and requires file system resizing(`RequiresFSResize` returns `true`), after the physical volume resizing is done, `ExpandFS` will be called from kubelet on worker node.
* If flex volume driver supports resizing and requires to resize the physical volume from worker node, the driver shall be installed on both master node and worker node. The driver on master node can do a non-op process for `ExpandVolumeDevice` and returns success message. For `RequiresFSResize`, driver on master node must return `true`. This process gives drivers on worker nodes a chance to make `physical volume resize` and `file system resize` together through `ExpandFS` call from kubelet. This scenario is useful for some local storage resizing cases.
### Volume Driver Changes
Volume driver needs to implement two new interfaces: `expandvolume` and `expandfs` to support volume resizing.
For `expandvolume`, it takes three parameters: new size of volume(number in bytes), old size of volume(number in bytes) and volume spec json string. `expandvolume` expands the physical backend volume size and return the new size(number in bytes) of volume.
For those volume plugins who need file system resize after physical volume is expanded, the `expandfs` method can take the FS resize work. If volume driver set the `requiresFSResize` capability to true, this method will be called from kubelet on worker node. Volume driver can do the file system resize (or physical volume resize together with file system resize) inside this method
In addition, those volume drivers who support resizing but do not require fils system resizing shall set `requiresFSResize` capability to `false`:
```
if [ "$op" = "init" ]; then
log '{"status": "Success", "capabilities": {“requiresFSResize”: false}}'
exit 0
fi
```
### UI
Expand FlexVolume size follows the same process as expanding other volume plugins, like GlusterFS. User creates and binds PVC and PV first. Then by using `kubectl edit pvc xxx` command, user can update the new size of PVC.
## References
* [Proposal for Growing Persistent Volume Size](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/grow-volume-size.md)
* [PR for Volume Resizing Controller](https://github.com/kubernetes/kubernetes/commit/cd2a68473a5a5966fa79f455415cb3269a3f7462)
* [Online FS resize support](https://github.com/kubernetes/community/pull/1535)
* [Add “ExpandFS” method to “ExpandableVolumePlugin” interface](https://github.com/kubernetes/kubernetes/issues/58786)

View File

@ -198,7 +198,7 @@ we have considered following options:
Cons:
* I don't know if there is a pattern that exists in kube today for shipping shell scripts that are called out from code in Kubernetes. Flex is
different because, none of the flex scripts are shipped with Kuberntes.
different because, none of the flex scripts are shipped with Kubernetes.
3. Ship resizing tools in a container.

View File

@ -0,0 +1,128 @@
# RBD Volume to PV Mapping
Authors: krmayankk@
### Problem
The RBD Dynamic Provisioner currently generates rbd volume names which are random.
The current implementation generates a UUID and the rbd image name becomes
image := fmt.Sprintf("kubernetes-dynamic-pvc-%s", uuid.NewUUID()). This RBD image
name is stored in the PV. The PV also has a reference to the PVC to which it binds.
The problem with this approach is that if there is a catastrophic etcd data loss
and all PV's are gone, there is no way to recover the mapping from RBD to PVC. The
RBD volumes for the customer still exist, but we have no way to tell which rbd
volumes belong to which customer.
## Goal
We want to store some information about the PVC in RBD image name/metadata, so that
in catastrophic situations, we can derive the PVC name from rbd image name/metadata
and allow customer the following options:
- Backup RBD volume data for specific customers and hand them their copy before deleting
the RBD volume. Without knowing from rbd image name/metadata, which customers they
belong to we cannot hand those customers their data.
- Create PV with the given RBD name and pre-bind it to the desired PVC so that customer
can get its data back.
## Non Goals
This proposal doesnt attempt to undermine the importance of etcd backups to restore
data in catastrophic situations. This is one additional line of defense in case our
backups are not working.
## Motivation
We recently had an etcd data loss which resulted in loss of this rbd to pv mapping
and there was no way to restore customer data. This proposal aims to store pvc name
as metadata in the RBD image so that in catastrophic scenarios, the mapping can be
restored by just looking at the RBD's.
## Current Implementation
```go
func (r *rbdVolumeProvisioner) Provision() (*v1.PersistentVolume, error) {
...
// create random image name
image := fmt.Sprintf("kubernetes-dynamic-pvc-%s", uuid.NewUUID())
r.rbdMounter.Image = image
```
## Finalized Proposal
Use `rbd image-meta set` command to store additional metadata in the RBD image about the PVC which owns
the RBD image.
`rbd image-meta set --pool hdd kubernetes-dynamic-pvc-fabd715f-0d24-11e8-91fa-1418774b3e9d pvcname <pvcname>`
`rbd image-meta set --pool hdd kubernetes-dynamic-pvc-fabd715f-0d24-11e8-91fa-1418774b3e9d pvcnamespace <pvcnamespace>`
### Pros
- Simple to implement
- Does not cause regression in RBD image names, which remains same as earlier.
- The metadata information is not immediately visible to RBD admins
### Cons
- NA
Since this Proposal does not change the RBD image name and is able to store additional metadata about
the PVC to which it belongs, this is preferred over other two proposals. Also it does a better job
of hiding the PVC name in the metadata rather than making it more obvious in the RBD image name. The
metadata can only be seen by admins with appropriate permissions to run the rbd image-meta command. In
addition, this Proposal , doesnt impose any limitations on the length of metadata that can be stored
and hence can accommodate any pvc names and namespaces which are stored as arbitrary key value pairs.
It also leaves room for storing any other metadata about the PVC.
### Upgrade/Downgrade Behavior
#### Upgrading from a K8s version without this metadata to a version with this metadata
The metadata for image is populated on CreateImage. After an upgrade, existing RBD Images will not have that
metadata set. When the next AttachDisk happens, we can check if the metadata is not set, set it. Cluster
administrators could also run a one time script to set this manually. For all newly created RBD images,
the rbd image metadata will be set properly.
#### Downgrade from a K8s version with this metadata to a version without this metadata
After a downgrade, all existing RBD images will have the metadata set. New RBD images created after the
downgrade will not have this metadata.
## Proposal 1
Make the RBD Image name as base64 encoded PVC name(namespace+name)
```go
import b64 "encoding/base64"
...
func (r *rbdVolumeProvisioner) Provision() (*v1.PersistentVolume, error) {
...
// Create a base64 encoding of the PVC Namespace and Name
rbdImageName := b64.StdEncoding.EncodeToString([]byte(r.options.PVC.Name+"/"+r.options.PVC.Namespace))
// Append the base64 encoding to the string `kubernetes-dynamic-pvc-`
rbdImageName = fmt.Sprintf("kubernetes-dynamic-pvc-%s", rbdImageName)
r.rbdMounter.Image = rbdImageName
```
### Pros
- Simple scheme which encodes the fully qualified PVC name in the RBD image name
### Cons
- Causes regression since RBD image names will change from one version of K8s to another.
- Some older versions of librbd/krbd start having issues with names longer than 95 characters.
## Proposal 2
Make the RBD Image name as the stringified PVC namespace plus PVC name.
### Pros
- Simple to implement.
### Cons
- Causes regression since RBD image names will change from one version of K8s to another.
- This exposes the customer name directly to Ceph Admins. Earlier it was hidden as base64 encoding
## Misc
- Document how Pre-Binding of PV to PVC works in dynamic provisioning
- Document/Test if there are other issues with restoring PVC/PV after a
etcd backup is restored

View File

@ -0,0 +1,148 @@
# Service Account Token Volumes
Authors:
@smarterclayton
@liggitt
@mikedanese
## Summary
Kubernetes is able to provide pods with unique identity tokens that can prove
the caller is a particular pod to a Kubernetes API server. These tokens are
injected into pods as secrets. This proposal proposes a new mechanism of
distribution with support for [improved service account tokens][better-tokens]
and explores how to migrate from the existing mechanism backwards compatibly.
## Motivation
Many workloads running on Kubernetes need to prove to external parties who they
are in order to participate in a larger application environment. This identity
must be attested to by the orchestration system in a way that allows a third
party to trust that an arbitrary container on the cluster is who it says it is.
In addition, infrastructure running on top of Kubernetes needs a simple
mechanism to communicate with the Kubernetes APIs and to provide more complex
tooling. Finally, a significant set of security challenges are associated with
storing service account tokens as secrets in Kubernetes and limiting the methods
whereby malicious parties can get access to these tokens will reduce the risk of
platform compromise.
As a platform, Kubernetes should evolve to allow identity management systems to
provide more powerful workload identity without breaking existing use cases, and
provide a simple out of the box workload identity that is sufficient to cover
the requirements of bootstrapping low-level infrastructure running on
Kubernetes. We expect that other systems to cover the more advanced scenarios,
and see this effort as necessary glue to allow more powerful systems to succeed.
With this feature, we hope to provide a backwards compatible replacement for
service account tokens that strengthens the security and improves the
scalability of the platform.
## Proposal
Kubernetes should implement a ServiceAccountToken volume projection that
maintains a service account token requested by the node from the TokenRequest
API.
### Token Volume Projection
A new volume projection will be implemented with an API that closely matches the
TokenRequest API.
```go
type ProjectedVolumeSource struct {
Sources []VolumeProjection
DefaultMode *int32
}
type VolumeProjection struct {
Secret *SecretProjection
DownwardAPI *DownwardAPIProjection
ConfigMap *ConfigMapProjection
ServiceAccountToken *ServiceAccountTokenProjection
}
// ServiceAccountTokenProjection represents a projected service account token
// volume. This projection can be used to insert a service account token into
// the pods runtime filesystem for use against APIs (Kubernetes API Server or
// otherwise).
type ServiceAccountTokenProjection struct {
// Audience is the intended audience of the token. A recipient of a token
// must identify itself with an identifier specified in the audience of the
// token, and otherwise should reject the token. The audience defaults to the
// identifier of the apiserver.
Audience string
// ExpirationSeconds is the requested duration of validity of the service
// account token. As the token approaches expiration, the kubelet volume
// plugin will proactively rotate the service account token. The kubelet will
// start trying to rotate the token if the token is older than 80 percent of
// its time to live or if the token is older than 24 hours.Defaults to 1 hour
// and must be at least 10 minutes.
ExpirationSeconds int64
// Path is the relative path of the file to project the token into.
Path string
}
```
A volume plugin implemented in the kubelet will project a service account token
sourced from the TokenRequest API into volumes created from
ProjectedVolumeSources. As the token approaches expiration, the kubelet volume
plugin will proactively rotate the service account token. The kubelet will start
trying to rotate the token if the token is older than 80 percent of its time to
live or if the token is older than 24 hours.
To replace the current service account token secrets, we also need to inject the
clusters CA certificate bundle. Initially we will deploy to data in a configmap
per-namespace and reference it using a ConfigMapProjection.
A projected volume source that is equivalent to the current service account
secret:
```yaml
sources:
- serviceAccountToken:
expirationSeconds: 3153600000 # 100 years
path: token
- configMap:
name: kube-cacrt
items:
- key: ca.crt
path: ca.crt
- downwardAPI:
items:
- path: namespace
fieldRef: metadata.namespace
```
This fixes one scalability issue with the current service account token
deployment model where secret GETs are a large portion of overall apiserver
traffic.
A projected volume source that requests a token for vault and Istio CA:
```yaml
sources:
- serviceAccountToken:
path: vault-token
audience: vault
- serviceAccountToken:
path: istio-token
audience: ca.istio.io
```
### Alternatives
1. Instead of implementing a service account token volume projection, we could
implement all injection as a flex volume or CSI plugin.
1. Both flex volume and CSI are alpha and are unlikely to graduate soon.
1. Virtual kubelets (like Fargate or ACS) may not be able to run flex
volumes.
1. Service account tokens are a fundamental part of our API.
1. Remove service accounts and service account tokens completely from core, use
an alternate mechanism that sits outside the platform.
1. Other core features need service account integration, leading to all
users needing to install this extension.
1. Complicates installation for the majority of users.
[better-tokens]: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/auth/bound-service-account-tokens.md

View File

@ -86,7 +86,7 @@ We propose that:
### Controller workflow for provisioning volumes
0. Kubernetes administator can configure name of a default StorageClass. This
0. Kubernetes administrator can configure name of a default StorageClass. This
StorageClass instance is then used when user requests a dynamically
provisioned volume, but does not specify a StorageClass. In other words,
`claim.Spec.Class == ""`

View File

@ -196,7 +196,7 @@ Open questions:
* Do we call them snapshots or backups?
* From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is is necessary, but not sufficient, when conducting a backup of a stateful application."
* From the SIG email: "The snapshot should not be suggested to be a backup in any documentation, because in practice is necessary, but not sufficient, when conducting a backup of a stateful application."
* At what minimum granularity should snapshots be allowed?

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 40 KiB

View File

@ -1,14 +1,16 @@
reviewers:
- grodrigues3
- Phillels
- idvoretskyi
- calebamiles
- cblecker
- grodrigues3
- idvoretskyi
- Phillels
- spiffxp
approvers:
- grodrigues3
- Phillels
- idvoretskyi
- calebamiles
- cblecker
- grodrigues3
- idvoretskyi
- lavalamp
- Phillels
- spiffxp
- thockin

View File

@ -11,7 +11,7 @@ Guide](http://kubernetes.io/docs/admin/).
* **Contributor Guide**
([Please start here](/contributors/guide/README.md)) to learn about how to contribute to Kubernetes
* **GitHub Issues** ([issues.md](issues.md)): How incoming issues are triaged.
* **GitHub Issues** ([/contributors/guide/issue-triage.md](/contributors/guide/issue-triage.md)): How incoming issues are triaged.
* **Pull Request Process** ([/contributors/guide/pull-requests.md](/contributors/guide/pull-requests.md)): When and why pull requests are closed.
@ -26,6 +26,9 @@ Guide](http://kubernetes.io/docs/admin/).
* **Testing** ([testing.md](testing.md)): How to run unit, integration, and end-to-end tests in your development sandbox.
* **Conformance Testing** ([conformance-tests.md](conformance-tests.md))
What is conformance testing and how to create/manage them.
* **Hunting flaky tests** ([flaky-tests.md](flaky-tests.md)): We have a goal of 99.9% flake free tests.
Here's how to run your tests many times.

View File

@ -306,34 +306,57 @@ response reduces the complexity of these clients.
##### Typical status properties
**Conditions** represent the latest available observations of an object's
current state. Objects may report multiple conditions, and new types of
conditions may be added in the future. Therefore, conditions are represented
using a list/slice, where all have similar structure.
state. They are an extension mechanism intended to be used when the details of
an observation are not a priori known or would not apply to all instances of a
given Kind. For observations that are well known and apply to all instances, a
regular field is preferred. An example of a Condition that probably should
have been a regular field is Pod's "Ready" condition - it is managed by core
controllers, it is well understood, and it applies to all Pods.
Objects may report multiple conditions, and new types of conditions may be
added in the future or by 3rd party controllers. Therefore, conditions are
represented using a list/slice, where all have similar structure.
The `FooCondition` type for some resource type `Foo` may include a subset of the
following fields, but must contain at least `type` and `status` fields:
```go
Type FooConditionType `json:"type" description:"type of Foo condition"`
Status ConditionStatus `json:"status" description:"status of the condition, one of True, False, Unknown"`
Type FooConditionType `json:"type" description:"type of Foo condition"`
Status ConditionStatus `json:"status" description:"status of the condition, one of True, False, Unknown"`
// +optional
LastHeartbeatTime unversioned.Time `json:"lastHeartbeatTime,omitempty" description:"last time we got an update on a given condition"`
Reason *string `json:"reason,omitempty" description:"one-word CamelCase reason for the condition's last transition"`
// +optional
LastTransitionTime unversioned.Time `json:"lastTransitionTime,omitempty" description:"last time the condition transit from one status to another"`
Message *string `json:"message,omitempty" description:"human-readable message indicating details about last transition"`
// +optional
Reason string `json:"reason,omitempty" description:"one-word CamelCase reason for the condition's last transition"`
LastHeartbeatTime *unversioned.Time `json:"lastHeartbeatTime,omitempty" description:"last time we got an update on a given condition"`
// +optional
Message string `json:"message,omitempty" description:"human-readable message indicating details about last transition"`
LastTransitionTime *unversioned.Time `json:"lastTransitionTime,omitempty" description:"last time the condition transit from one status to another"`
```
Additional fields may be added in the future.
Do not use fields that you don't need - simpler is better.
Use of the `Reason` field is encouraged.
Use the `LastHeartbeatTime` with great caution - frequent changes to this field
can cause a large fan-out effect for some resources.
Conditions should be added to explicitly convey properties that users and
components care about rather than requiring those properties to be inferred from
other observations.
other observations. Once defined, the meaning of a Condition can not be
changed arbitrarily - it becomes part of the API, and has the same backwards-
and forwards-compatibility concerns of any other part of the API.
Condition status values may be `True`, `False`, or `Unknown`. The absence of a
condition should be interpreted the same as `Unknown`.
condition should be interpreted the same as `Unknown`. How controllers handle
`Unknown` depends on the Condition in question.
Condition types should indicate state in the "abnormal-true" polarity. For
example, if the condition indicates when a policy is invalid, the "is valid"
case is probably the norm, so the condition should be called "Invalid".
In general, condition values may change back and forth, but some condition
transitions may be monotonic, depending on the resource and condition type.
@ -742,7 +765,9 @@ APIs may return alternative representations of any resource in response to an
Accept header or under alternative endpoints, but the default serialization for
input and output of API responses MUST be JSON.
Protobuf serialization of API objects are currently **EXPERIMENTAL** and will change without notice.
A protobuf encoding is also accepted for built-in resources. As proto is not
self-describing, there is an envelope wrapper which describes the type of
the contents.
All dates should be serialized as RFC3339 strings.
@ -753,6 +778,9 @@ must be specified as part of the value (e.g., `resource.Quantity`). Which
approach is preferred is TBD, though currently we use the `fooSeconds`
convention for durations.
Duration fields must be represented as integer fields with units being
part of the field name (e.g. `leaseDurationSeconds`). We don't use Duration
in the API since that would require clients to implement go-compatible parsing.
## Selecting Fields
@ -1147,6 +1175,13 @@ be ambiguous and they are not specified by the value or value type.
* The name of a field expressing a boolean property called 'fooable' should be
called `Fooable`, not `IsFooable`.
### Namespace Names
* The name of a namespace must be a
[DNS_LABEL](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/architecture/identifiers.md).
* The `kube-` prefix is reserved for Kubernetes system namespaces, e.g. `kube-system` and `kube-public`.
* See
[the namespace docs](https://kubernetes.io/docs/user-guide/namespaces/) for more information.
## Label, selector, and annotation conventions
Labels are the domain of users. They are intended to facilitate organization and

View File

@ -95,36 +95,49 @@ backward-compatibly.
Before talking about how to make API changes, it is worthwhile to clarify what
we mean by API compatibility. Kubernetes considers forwards and backwards
compatibility of its APIs a top priority.
compatibility of its APIs a top priority. Compatibility is *hard*, especially
handling issues around rollback-safety. This is something every API change
must consider.
An API change is considered forward and backward-compatible if it:
An API change is considered compatible if it:
* adds new functionality that is not required for correct behavior (e.g.,
does not add a new required field)
* does not change existing semantics, including:
* default values *and behavior*
* the semantic meaning of default values *and behavior*
* interpretation of existing API types, fields, and values
* which fields are required and which are not
* mutable fields do not become immutable
* valid values do not become invalid
* explicitly invalid values do not become valid
Put another way:
1. Any API call (e.g. a structure POSTed to a REST endpoint) that worked before
your change must work the same after your change.
2. Any API call that uses your change must not cause problems (e.g. crash or
degrade behavior) when issued against servers that do not include your change.
3. It must be possible to round-trip your change (convert to different API
1. Any API call (e.g. a structure POSTed to a REST endpoint) that succeeded
before your change must succeed after your change.
2. Any API call that does not use your change must behave the same as it did
before your change.
3. Any API call that uses your change must not cause problems (e.g. crash or
degrade behavior) when issued against an API servers that do not include your
change.
4. It must be possible to round-trip your change (convert to different API
versions and back) with no loss of information.
4. Existing clients need not be aware of your change in order for them to
continue to function as they did previously, even when your change is utilized.
5. Existing clients need not be aware of your change in order for them to
continue to function as they did previously, even when your change is in use.
6. It must be possible to rollback to a previous version of API server that
does not include your change and have no impact on API objects which do not use
your change. API objects that use your change will be impacted in case of a
rollback.
If your change does not meet these criteria, it is not considered strictly
compatible, and may break older clients, or result in newer clients causing
undefined behavior.
If your change does not meet these criteria, it is not considered compatible,
and may break older clients, or result in newer clients causing undefined
behavior. Such changes are generally disallowed, though exceptions have been
made in extreme cases (e.g. security or obvious bugs).
Let's consider some examples. In a hypothetical API (assume we're at version
v6), the `Frobber` struct looks something like this:
Let's consider some examples.
In a hypothetical API (assume we're at version v6), the `Frobber` struct looks
something like this:
```go
// API v6.
@ -134,7 +147,7 @@ type Frobber struct {
}
```
You want to add a new `Width` field. It is generally safe to add new fields
You want to add a new `Width` field. It is generally allowed to add new fields
without changing the API version, so you can simply change it to:
```go
@ -146,29 +159,55 @@ type Frobber struct {
}
```
The onus is on you to define a sane default value for `Width` such that rule #1
above is true - API calls and stored objects that used to work must continue to
work.
The onus is on you to define a sane default value for `Width` such that rules
#1 and #2 above are true - API calls and stored objects that used to work must
continue to work.
For your next change you want to allow multiple `Param` values. You can not
simply change `Param string` to `Params []string` (without creating a whole new
API version) - that fails rules #1 and #2. You can instead do something like:
simply remove `Param string` and add `Params []string` (without creating a
whole new API version) - that fails rules #1, #2, #3, and #6. Nor can you
simply add `Params []string` and use it instead - that fails #2 and #6.
You must instead define a new field and the relationship between that field and
the existing field(s). Start by adding the new plural field:
```go
// Still API v6, but kind of clumsy.
// Still API v6.
type Frobber struct {
Height int `json:"height"`
Width int `json:"width"`
Param string `json:"param"` // the first param
ExtraParams []string `json:"extraParams"` // additional params
Params []string `json:"params"` // all of the params
}
```
Now you can satisfy the rules: API calls that provide the old style `Param`
will still work, while servers that don't understand `ExtraParams` can ignore
it. This is somewhat unsatisfying as an API, but it is strictly compatible.
This new field must be inclusive of the singular field. In order to satisfy
the compatibility rules you must handle all the cases of version skew, multiple
clients, and rollbacks. This can be handled by defaulting or admission control
logic linking the fields together with context from the API operation to get as
close as possible to the user's intentions.
Part of the reason for versioning APIs and for using internal structs that are
Upon any mutating API operation:
* If only the singular field is specified (e.g. an older client), API logic
must populate plural[0] from the singular value, and de-dup the plural
field.
* If only the plural field is specified (e.g. a newer client), API logic must
populate the singular value from plural[0].
* If both the singular and plural fields are specified, API logic must
validate that the singular value matches plural[0].
* Any other case is an error and must be rejected.
For this purpose "is specified" means the following:
* On a create or patch operation: the field is present in the user-provided input
* On an update operation: the field is present and has changed from the
current value
Older clients that only know the singular field will continue to succeed and
produce the same results as before the change. Newer clients can use your
change without impacting older clients. The API server can be rolled back and
only objects that use your change will be impacted.
Part of the reason for versioning APIs and for using internal types that are
distinct from any one version is to handle growth like this. The internal
representation can be implemented as:
@ -181,24 +220,26 @@ type Frobber struct {
}
```
The code that converts to/from versioned APIs can decode this into the somewhat
uglier (but compatible!) structures. Eventually, a new API version, let's call
it v7beta1, will be forked and it can use the clean internal structure.
The code that converts to/from versioned APIs can decode this into the
compatible structure. Eventually, a new API version, e.g. v7beta1,
will be forked and it can drop the singular field entirely.
We've seen how to satisfy rules #1 and #2. Rule #3 means that you can not
We've seen how to satisfy rules #1, #2, and #3. Rule #4 means that you can not
extend one versioned API without also extending the others. For example, an
API call might POST an object in API v7beta1 format, which uses the cleaner
`Params` field, but the API server might store that object in trusty old v6
form (since v7beta1 is "beta"). When the user reads the object back in the
v7beta1 API it would be unacceptable to have lost all but `Params[0]`. This
means that, even though it is ugly, a compatible change must be made to the v6
API.
API, as above.
However, this is very challenging to do correctly. It often requires multiple
For some changes, this can be challenging to do correctly. It may require multiple
representations of the same information in the same API resource, which need to
be kept in sync in the event that either is changed. For example, let's say you
decide to rename a field within the same API version. In this case, you add
units to `height` and `width`. You implement this by adding duplicate fields:
be kept in sync should either be changed.
For example, let's say you decide to rename a field within the same API
version. In this case, you add units to `height` and `width`. You implement
this by adding new fields:
```go
type Frobber struct {
@ -211,17 +252,17 @@ type Frobber struct {
You convert all of the fields to pointers in order to distinguish between unset
and set to 0, and then set each corresponding field from the other in the
defaulting pass (e.g., `heightInInches` from `height`, and vice versa), which
runs just prior to conversion. That works fine when the user creates a resource
from a hand-written configuration -- clients can write either field and read
either field, but what about creation or update from the output of GET, or
update via PATCH (see
[In-place updates](https://kubernetes.io/docs/user-guide/managing-deployments/#in-place-updates-of-resources))?
In this case, the two fields will conflict, because only one field would be
updated in the case of an old client that was only aware of the old field (e.g.,
`height`).
defaulting logic (e.g. `heightInInches` from `height`, and vice versa). That
works fine when the user creates a sends a hand-written configuration --
clients can write either field and read either field.
Say the client creates:
But what about creation or update from the output of a GET, or update via PATCH
(see [In-place updates](https://kubernetes.io/docs/user-guide/managing-deployments/#in-place-updates-of-resources))?
In these cases, the two fields will conflict, because only one field would be
updated in the case of an old client that was only aware of the old field
(e.g. `height`).
Suppose the client creates:
```json
{
@ -252,17 +293,16 @@ then PUTs back:
}
```
The update should not fail, because it would have worked before `heightInInches`
was added.
As per the compatibility rules, the update must not fail, because it would have
worked before the change.
## Backward compatibility gotchas
* A single feature/property cannot be represented using multiple spec fields in the same API version
simultaneously, as the example above shows. Only one field can be populated in any resource at a time, and the client
needs to be able to specify which field they expect to use (typically via API version),
on both mutation and read. Old clients must continue to function properly while only manipulating
the old field. New clients must be able to function properly while only manipulating the new
field.
* A single feature/property cannot be represented using multiple spec fields
simultaneously within an API version. Only one representation can be
populated at a time, and the client needs to be able to specify which field
they expect to use (typically via API version), on both mutation and read. As
above, older clients must continue to function properly.
* A new representation, even in a new API version, that is more expressive than an
old one breaks backward compatibility, since clients that only understood the
@ -283,7 +323,7 @@ was added.
be set, it is acceptable to add a new option to the union if the [appropriate
conventions](api-conventions.md#objects) were followed in the original object.
Removing an option requires following the [deprecation process](https://kubernetes.io/docs/reference/deprecation-policy/).
* Changing any validation rules always has the potential of breaking some client, since it changes the
assumptions about part of the API, similar to adding new enum values. Validation rules on spec fields can
neither be relaxed nor strengthened. Strengthening cannot be permitted because any requests that previously
@ -291,23 +331,32 @@ was added.
of the API resource. Status fields whose writers are under our control (e.g., written by non-pluggable
controllers), may potentially tighten validation, since that would cause a subset of previously valid
values to be observable by clients.
* Do not add a new API version of an existing resource and make it the preferred version in the same
release, and do not make it the storage version. The latter is necessary so that a rollback of the
apiserver doesn't render resources in etcd undecodable after rollback.
* Any field with a default value in one API version must have *non-nil default
values* in all API versions. If a default value is added to a field in one API
version, and the field didn't have a default value in previous API versions,
it is required to add a default value semantically equivalent to an unset
value to the field in previous API versions, to preserve the semantic
meaning of the value being unset. This includes:
* a new optional field with a default value is introduced in a new API version
* an old optional field without a default value (i.e. can be nil) has a
default value in a new API version
## Incompatible API changes
There are times when this might be OK, but mostly we want changes that meet this
definition. If you think you need to break compatibility, you should talk to the
Kubernetes team first.
There are times when incompatible changes might be OK, but mostly we want
changes that meet the above definitions. If you think you need to break
compatibility, you should talk to the Kubernetes API reviewers first.
Breaking compatibility of a beta or stable API version, such as v1, is
unacceptable. Compatibility for experimental or alpha APIs is not strictly
required, but breaking compatibility should not be done lightly, as it disrupts
all users of the feature. Experimental APIs may be removed. Alpha and beta API
versions may be deprecated and eventually removed wholesale, as described in the
[versioning document](../design-proposals/release/versioning.md).
all users of the feature. Alpha and beta API versions may be deprecated and
eventually removed wholesale, as described in the [deprecation policy](https://kubernetes.io/docs/reference/deprecation-policy/).
If your change is going to be backward incompatible or might be a breaking
change for API consumers, please send an announcement to
@ -365,10 +414,20 @@ being required otherwise.
### Edit defaults.go
If your change includes new fields for which you will need default values, you
need to add cases to `pkg/apis/<group>/<version>/defaults.go` (the core v1 API
is special, its defaults.go is at `pkg/api/v1/defaults.go`. For simplicity, we
will not mention this special case in the rest of the article). Of course, since
you have added code, you have to add a test:
need to add cases to `pkg/apis/<group>/<version>/defaults.go`.
**Note:** When adding default values to new fields, you *must* also add default
values in all API versions, instead of leaving new fields unset (e.g. `nil`) in
old API versions. This is required because defaulting happens whenever a
serialized version is read (see [#66135]). When possible, pick meaningful values
as sentinels for unset values.
In the past the core v1 API
was special. Its `defaults.go` used to live at `pkg/api/v1/defaults.go`.
If you see code referencing that path, you can be sure its outdated. Now the core v1 api lives at
`pkg/apis/core/v1/defaults.go` which follows the above convention.
Of course, since you have added code, you have to add a test:
`pkg/apis/<group>/<version>/defaults_test.go`.
Do use pointers to scalars when you need to distinguish between an unset value
@ -379,6 +438,8 @@ pick a default.
Don't forget to run the tests!
[#66135]: https://github.com/kubernetes/kubernetes/issues/66135
### Edit conversion.go
Given that you have not yet changed the internal structs, this might feel
@ -601,7 +662,6 @@ Due to the fast changing nature of the project, the following content is probabl
to generate protobuf IDL and marshallers.
* You must add the new version to
[cmd/kube-apiserver/app#apiVersionPriorities](https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.2/cmd/kube-apiserver/app/aggregator.go#L172)
to let the aggregator list it. This list will be removed before release 1.8.
* You must setup storage for the new version in
[pkg/registry/group_name/rest](https://github.com/kubernetes/kubernetes/blob/v1.8.0-alpha.2/pkg/registry/authentication/rest/storage_authentication.go)
@ -788,9 +848,9 @@ For example, consider the following object:
// API v6.
type Frobber struct {
// height ...
Height *int32 `json:"height" protobuf:"varint,1,opt,name=height"`
Height *int32 `json:"height"
// param ...
Param string `json:"param" protobuf:"bytes,2,opt,name=param"`
Param string `json:"param"
}
```
@ -800,11 +860,11 @@ A developer is considering adding a new `Width` parameter, like this:
// API v6.
type Frobber struct {
// height ...
Height *int32 `json:"height" protobuf:"varint,1,opt,name=height"`
Height *int32 `json:"height"
// param ...
Param string `json:"param" protobuf:"bytes,2,opt,name=param"`
Param string `json:"param"
// width ...
Width *int32 `json:"width,omitempty" protobuf:"varint,3,opt,name=width"`
Width *int32 `json:"width,omitempty"
}
```
@ -858,13 +918,13 @@ The preferred approach adds an alpha field to the existing object, and ensures i
// API v6.
type Frobber struct {
// height ...
Height int32 `json:"height" protobuf:"varint,1,opt,name=height"`
Height int32 `json:"height"`
// param ...
Param string `json:"param" protobuf:"bytes,2,opt,name=param"`
Param string `json:"param"`
// width indicates how wide the object is.
// This field is alpha-level and is only honored by servers that enable the Frobber2D feature.
// +optional
Width *int32 `json:"width,omitempty" protobuf:"varint,3,opt,name=width"`
Width *int32 `json:"width,omitempty"`
}
```
@ -931,15 +991,15 @@ In future Kubernetes versions:
Another option is to introduce a new type with an new `alpha` or `beta` version
designator, like this:
```
```go
// API v7alpha1
type Frobber struct {
// height ...
Height *int32 `json:"height" protobuf:"varint,1,opt,name=height"`
Height *int32 `json:"height"`
// param ...
Param string `json:"param" protobuf:"bytes,2,opt,name=param"`
Param string `json:"param"`
// width ...
Width *int32 `json:"width,omitempty" protobuf:"varint,3,opt,name=width"`
Width *int32 `json:"width,omitempty"`
}
```

Some files were not shown because too many files have changed in this diff Show More