Incorrect punctuation types
Signed-off-by: yanan Lee <energylyn@zju.edu.cn> Single quotes incorrect Signed-off-by: yanan Lee <energylyn@zju.edu.cn> Incorrect Signed-off-by: yanan Lee <energylyn@zju.edu.cn> delete Signed-off-by: yanan Lee <energylyn@zju.edu.cn>
This commit is contained in:
parent
c9f881b7bb
commit
1644134799
4
CLA.md
4
CLA.md
|
@ -25,6 +25,6 @@
|
|||
|
||||
**Step 5**: The status on your old PRs will be updated when any new comment is made on it.
|
||||
|
||||
### I’m having issues with signing the CLA.
|
||||
### I'm having issues with signing the CLA.
|
||||
|
||||
If you’re facing difficulty with signing the CNCF CLA, please explain your case on https://github.com/kubernetes/kubernetes/issues/27796 and we (@sarahnovotny and @foxish), along with the CNCF will help sort it out.
|
||||
If you're facing difficulty with signing the CNCF CLA, please explain your case on https://github.com/kubernetes/kubernetes/issues/27796 and we (@sarahnovotny and @foxish), along with the CNCF will help sort it out.
|
||||
|
|
|
@ -36,7 +36,7 @@ Additions include:
|
|||
* In an HA world, API servers may come and go and it is necessary to make sure we are talking to the same cluster as we thought we were talking to.
|
||||
* A _set_ of addresses for finding the cluster.
|
||||
* It is implied that all of these are equivalent and that a client can try multiple until an appropriate target is found.
|
||||
* Initially I’m proposing a flat set here. In the future we can introduce more structure that hints to the user which addresses to try first.
|
||||
* Initially I'm proposing a flat set here. In the future we can introduce more structure that hints to the user which addresses to try first.
|
||||
* Better documentation and exposure of:
|
||||
* The root certificates can be a bundle to enable rotation.
|
||||
* If no root certificates are given (and the insecure bit isn't set) then the client trusts the system managed list of CAs.
|
||||
|
@ -45,7 +45,7 @@ Additions include:
|
|||
|
||||
**This is to be implemented in a later phase**
|
||||
|
||||
Any client of the cluster will want to have this information. As the configuration of the cluster changes we need the client to keep this information up to date. It is assumed that the information here won’t drift so fast that clients won’t be able to find *some* way to connect.
|
||||
Any client of the cluster will want to have this information. As the configuration of the cluster changes we need the client to keep this information up to date. It is assumed that the information here won't drift so fast that clients won't be able to find *some* way to connect.
|
||||
|
||||
In exceptional circumstances it is possible that this information may be out of date and a client would be unable to connect to a cluster. Consider the case where a user has kubectl set up and working well and then doesn't run kubectl for quite a while. It is possible that over this time (a) the set of servers will have migrated so that all endpoints are now invalid or (b) the root certificates will have rotated so that the user can no longer trust any endpoint.
|
||||
|
||||
|
@ -83,7 +83,7 @@ If the user requires some auth to the HTTPS server (to keep the ClusterInfo obje
|
|||
|
||||
### Method: Bootstrap Token
|
||||
|
||||
There won’t always be a trusted external endpoint to talk to and transmitting
|
||||
There won't always be a trusted external endpoint to talk to and transmitting
|
||||
the locator file out of band is a pain. However, we want something more secure
|
||||
than just hitting HTTP and trusting whatever we get back. In this case, we
|
||||
assume we have the following:
|
||||
|
|
|
@ -130,7 +130,7 @@ All functions listed above are expected to be thread-safe.
|
|||
|
||||
### Pod/Container Lifecycle
|
||||
|
||||
The PodSandbox’s lifecycle is decoupled from the containers, i.e., a sandbox
|
||||
The PodSandbox's lifecycle is decoupled from the containers, i.e., a sandbox
|
||||
is created before any containers, and can exist after all containers in it have
|
||||
terminated.
|
||||
|
||||
|
|
|
@ -22,7 +22,7 @@ Approvers:
|
|||
|
||||
Main goal of `ControllerReference` effort is to solve a problem of overlapping controllers that fight over some resources (e.g. `ReplicaSets` fighting with `ReplicationControllers` over `Pods`), which cause serious [problems](https://github.com/kubernetes/kubernetes/issues/24433) such as exploding memory of Controller Manager.
|
||||
|
||||
We don’t want to have (just) an in-memory solution, as we don’t want a Controller Manager crash to cause massive changes in object ownership in the system. I.e. we need to persist the information about "owning controller".
|
||||
We don't want to have (just) an in-memory solution, as we don’t want a Controller Manager crash to cause massive changes in object ownership in the system. I.e. we need to persist the information about "owning controller".
|
||||
|
||||
Secondary goal of this effort is to improve performance of various controllers and schedulers, by removing the need for expensive lookup for all matching "controllers".
|
||||
|
||||
|
@ -75,7 +75,7 @@ and
|
|||
|
||||
By design there are possible races during adoption if multiple controllers can own a given object.
|
||||
|
||||
To prevent re-adoption of an object during deletion the `DeletionTimestamp` will be set when deletion is starting. When a controller has a non-nil `DeletionTimestamp` it won’t take any actions except updating its `Status` (in particular it won’t adopt any objects).
|
||||
To prevent re-adoption of an object during deletion the `DeletionTimestamp` will be set when deletion is starting. When a controller has a non-nil `DeletionTimestamp` it won't take any actions except updating its `Status` (in particular it won't adopt any objects).
|
||||
|
||||
# Implementation plan (sketch):
|
||||
|
||||
|
|
|
@ -46,7 +46,7 @@ For other uses, see the related [feature request](https://issues.k8s.io/1518)
|
|||
The DaemonSet supports standard API features:
|
||||
- create
|
||||
- The spec for DaemonSets has a pod template field.
|
||||
- Using the pod’s nodeSelector field, DaemonSets can be restricted to operate
|
||||
- Using the pod's nodeSelector field, DaemonSets can be restricted to operate
|
||||
over nodes that have a certain label. For example, suppose that in a cluster
|
||||
some nodes are labeled ‘app=database’. You can use a DaemonSet to launch a
|
||||
datastore pod on exactly those nodes labeled ‘app=database’.
|
||||
|
@ -118,7 +118,7 @@ replica of the daemon pod on the node.
|
|||
|
||||
- When a new node is added to the cluster, the DaemonSet controller starts
|
||||
daemon pods on the node for DaemonSets whose pod template nodeSelectors match
|
||||
the node’s labels.
|
||||
the node's labels.
|
||||
- Suppose the user launches a DaemonSet that runs a logging daemon on all
|
||||
nodes labeled “logger=fluentd”. If the user then adds the “logger=fluentd” label
|
||||
to a node (that did not initially have the label), the logging daemon will
|
||||
|
@ -179,7 +179,7 @@ expapi/v1/register.go
|
|||
#### Daemon Manager
|
||||
|
||||
- Creates new DaemonSets when requested. Launches the corresponding daemon pod
|
||||
on all nodes with labels matching the new DaemonSet’s selector.
|
||||
on all nodes with labels matching the new DaemonSet's selector.
|
||||
- Listens for addition of new nodes to the cluster, by setting up a
|
||||
framework.NewInformer that watches for the creation of Node API objects. When a
|
||||
new node is added, the daemon manager will loop through each DaemonSet. If the
|
||||
|
@ -193,7 +193,7 @@ via its hostname.)
|
|||
|
||||
- Does not need to be modified, but health checking will occur for the daemon
|
||||
pods and revive the pods if they are killed (we set the pod restartPolicy to
|
||||
Always). We reject DaemonSet objects with pod templates that don’t have
|
||||
Always). We reject DaemonSet objects with pod templates that don't have
|
||||
restartPolicy set to Always.
|
||||
|
||||
## Open Issues
|
||||
|
|
|
@ -8,7 +8,7 @@ This proposal is an attempt to come up with a means for accounting disk usage in
|
|||
|
||||
### Why is disk accounting necessary?
|
||||
|
||||
As of kubernetes v1.1 clusters become unusable over time due to the local disk becoming full. The kubelets on the node attempt to perform garbage collection of old containers and images, but that doesn’t prevent running pods from using up all the available disk space.
|
||||
As of kubernetes v1.1 clusters become unusable over time due to the local disk becoming full. The kubelets on the node attempt to perform garbage collection of old containers and images, but that doesn't prevent running pods from using up all the available disk space.
|
||||
|
||||
Kubernetes users have no insight into how the disk is being consumed.
|
||||
|
||||
|
@ -42,13 +42,13 @@ Disk can be consumed for:
|
|||
|
||||
1. Container images
|
||||
|
||||
2. Container’s writable layer
|
||||
2. Container's writable layer
|
||||
|
||||
3. Container’s logs - when written to stdout/stderr and default logging backend in docker is used.
|
||||
3. Container's logs - when written to stdout/stderr and default logging backend in docker is used.
|
||||
|
||||
4. Local volumes - hostPath, emptyDir, gitRepo, etc.
|
||||
|
||||
As of Kubernetes v1.1, kubelet exposes disk usage for the entire node and the container’s writable layer for aufs docker storage driver.
|
||||
As of Kubernetes v1.1, kubelet exposes disk usage for the entire node and the container's writable layer for aufs docker storage driver.
|
||||
This information is made available to end users via the heapster monitoring pipeline.
|
||||
|
||||
#### Image layers
|
||||
|
@ -86,7 +86,7 @@ In addition to this, the changes introduced by a pod on the source of a hostPath
|
|||
|
||||
### Docker storage model
|
||||
|
||||
Before we start exploring solutions, let’s get familiar with how docker handles storage for images, writable layer and logs.
|
||||
Before we start exploring solutions, let's get familiar with how docker handles storage for images, writable layer and logs.
|
||||
|
||||
On all storage drivers, logs are stored under `<docker root dir>/containers/<container-id>/`
|
||||
|
||||
|
@ -123,7 +123,7 @@ Everything under `/var/lib/docker/overlay/<id>` are files required for running
|
|||
|
||||
Disk accounting is dependent on the storage driver in docker. A common solution that works across all storage drivers isn't available.
|
||||
|
||||
I’m listing a few possible solutions for disk accounting below along with their limitations.
|
||||
I'm listing a few possible solutions for disk accounting below along with their limitations.
|
||||
|
||||
We need a plugin model for disk accounting. Some storage drivers in docker will require special plugins.
|
||||
|
||||
|
@ -136,7 +136,7 @@ But isolated usage isn't of much use because image layers are shared between con
|
|||
|
||||
Continuing to use the entire partition availability for garbage collection purposes in kubelet, should not affect reliability.
|
||||
We might garbage collect more often.
|
||||
As long as we do not expose features that require persisting old containers, computing image layer usage wouldn’t be necessary.
|
||||
As long as we do not expose features that require persisting old containers, computing image layer usage wouldn't be necessary.
|
||||
|
||||
Main goals for images are
|
||||
1. Capturing total image disk usage
|
||||
|
@ -208,7 +208,7 @@ Both `uids` and `gids` are meant for security. Overloading that concept for disk
|
|||
|
||||
Kubelet needs to define a gid for tracking image layers and make that gid or group the owner of `/var/lib/docker/[aufs | overlayfs]` recursively. Once this is done, the quota sub-system in the kernel will report the blocks being consumed by the storage driver on the underlying partition.
|
||||
|
||||
Since this number also includes the container’s writable layer, we will have to somehow subtract that usage from the overall usage of the storage driver directory. Luckily, we can use the same mechanism for tracking container’s writable layer. Once we apply a different `gid` to the container’s writable layer, which is located under `/var/lib/docker/<storage_driver>/diff/<container_id>`, the quota subsystem will not include the container’s writable layer usage.
|
||||
Since this number also includes the container's writable layer, we will have to somehow subtract that usage from the overall usage of the storage driver directory. Luckily, we can use the same mechanism for tracking container’s writable layer. Once we apply a different `gid` to the container's writable layer, which is located under `/var/lib/docker/<storage_driver>/diff/<container_id>`, the quota subsystem will not include the container's writable layer usage.
|
||||
|
||||
Xfs on the other hand support project quota which lets us track disk usage of arbitrary directories using a project. Support for this feature in ext4 is being reviewed. So on xfs, we can use quota without having to clobber the writable layer's uid and gid.
|
||||
|
||||
|
@ -219,7 +219,7 @@ Xfs on the other hand support project quota which lets us track disk usage of ar
|
|||
|
||||
**Cons**
|
||||
|
||||
* Requires updates to default ownership on docker’s internal storage driver directories. We will have to deal with storage driver implementation details in any approach that is not docker native.
|
||||
* Requires updates to default ownership on docker's internal storage driver directories. We will have to deal with storage driver implementation details in any approach that is not docker native.
|
||||
|
||||
* Requires additional node configuration - quota subsystem needs to be setup on the node. This can either be automated or made a requirement for the node.
|
||||
|
||||
|
@ -238,11 +238,11 @@ Project Quota support for ext4 is currently being reviewed upstream. If that fea
|
|||
|
||||
Devicemapper storage driver will setup two volumes, metadata and data, that will be used to store image layers and container writable layer. The volumes can be real devices or loopback. A Pool device is created which uses the underlying volume for real storage.
|
||||
|
||||
A new thinly-provisioned volume, based on the pool, will be created for running container’s.
|
||||
A new thinly-provisioned volume, based on the pool, will be created for running container's.
|
||||
|
||||
The kernel tracks the usage of the pool device at the block device layer. The usage here includes image layers and container’s writable layers.
|
||||
The kernel tracks the usage of the pool device at the block device layer. The usage here includes image layers and container's writable layers.
|
||||
|
||||
Since the kubelet has to track the writable layer usage anyways, we can subtract the aggregated root filesystem usage from the overall pool device usage to get the image layer’s disk usage.
|
||||
Since the kubelet has to track the writable layer usage anyways, we can subtract the aggregated root filesystem usage from the overall pool device usage to get the image layer's disk usage.
|
||||
|
||||
Linux quota and `du` will not work with device mapper.
|
||||
|
||||
|
@ -253,7 +253,7 @@ A docker dry run option (mentioned above) is another possibility.
|
|||
|
||||
###### Overlayfs / Aufs
|
||||
|
||||
Docker creates a separate directory for the container’s writable layer which is then overlayed on top of read-only image layers.
|
||||
Docker creates a separate directory for the container's writable layer which is then overlayed on top of read-only image layers.
|
||||
|
||||
Both the previously mentioned options of `du` and `Linux Quota` will work for this case as well.
|
||||
|
||||
|
@ -268,14 +268,14 @@ If local disk becomes a schedulable resource, `linux quota` can be used to impos
|
|||
|
||||
FIXME: How to calculate writable layer usage with devicemapper?
|
||||
|
||||
To enforce `limits` the volume created for the container’s writable layer filesystem can be dynamically [resized](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/), to not use more than `limit`. `request` will have to be enforced by the kubelet.
|
||||
To enforce `limits` the volume created for the container's writable layer filesystem can be dynamically [resized](https://jpetazzo.github.io/2014/01/29/docker-device-mapper-resize/), to not use more than `limit`. `request` will have to be enforced by the kubelet.
|
||||
|
||||
|
||||
#### Container logs
|
||||
|
||||
Container logs are not storage driver specific. We can use either `du` or `quota` to track log usage per container. Log files are stored under `/var/lib/docker/containers/<container-id>`.
|
||||
|
||||
In the case of quota, we can create a separate gid for tracking log usage. This will let users track log usage and writable layer’s usage individually.
|
||||
In the case of quota, we can create a separate gid for tracking log usage. This will let users track log usage and writable layer's usage individually.
|
||||
|
||||
For the purposes of enforcing limits though, kubelet will use the sum of logs and writable layer.
|
||||
|
||||
|
@ -340,9 +340,9 @@ In this milestone, we will add support for quota and make it opt-in. There shoul
|
|||
|
||||
* Configure linux quota automatically on startup. Do not set any limits in this phase.
|
||||
|
||||
* Allocate gids for pod volumes, container’s writable layer and logs, and also for image layers.
|
||||
* Allocate gids for pod volumes, container's writable layer and logs, and also for image layers.
|
||||
|
||||
* Update the docker runtime plugin in kubelet to perform the necessary `chown’s` and `chmod’s` between container creation and startup.
|
||||
* Update the docker runtime plugin in kubelet to perform the necessary `chown's` and `chmod's` between container creation and startup.
|
||||
|
||||
* Pass the allocated gids as supplementary gids to containers.
|
||||
|
||||
|
@ -363,7 +363,7 @@ In this milestone, we will make local disk a schedulable resource.
|
|||
|
||||
* Quota plugin sets hard limits equal to user specified `limits`.
|
||||
|
||||
* Devicemapper plugin resizes writable layer to not exceed the container’s disk `limit`.
|
||||
* Devicemapper plugin resizes writable layer to not exceed the container's disk `limit`.
|
||||
|
||||
* Disk manager evicts pods based on `usage` - `request` delta instead of just QoS class.
|
||||
|
||||
|
@ -448,7 +448,7 @@ Track the space occupied by images after it has been pulled locally as follows.
|
|||
|
||||
3. Any new images pulled or containers created will be accounted to the `docker-images` group by default.
|
||||
|
||||
4. Once we update the group ownership on newly created containers to a different gid, the container writable layer’s specific disk usage gets dropped from this group.
|
||||
4. Once we update the group ownership on newly created containers to a different gid, the container writable layer's specific disk usage gets dropped from this group.
|
||||
|
||||
#### Overlayfs
|
||||
|
||||
|
@ -574,7 +574,7 @@ Capacity in MB = 1638400 * 512 * 128 bytes = 100 GB
|
|||
|
||||
##### Testing titbits
|
||||
|
||||
* Ubuntu 15.10 doesn’t ship with the quota module on virtual machines. [Install ‘linux-image-extra-virtual’](http://askubuntu.com/questions/109585/quota-format-not-supported-in-kernel) package to get quota to work.
|
||||
* Ubuntu 15.10 doesn't ship with the quota module on virtual machines. [Install ‘linux-image-extra-virtual’](http://askubuntu.com/questions/109585/quota-format-not-supported-in-kernel) package to get quota to work.
|
||||
|
||||
* Overlay storage driver needs kernels >= 3.18. I used Ubuntu 15.10 to test Overlayfs.
|
||||
|
||||
|
|
|
@ -221,7 +221,7 @@ Unable to join mesh network. Check your token.
|
|||
|
||||
* @jbeda & @philips?
|
||||
|
||||
1. Documentation - so that new users can see this in 1.4 (even if it’s caveated with alpha/experimental labels and flags all over it)
|
||||
1. Documentation - so that new users can see this in 1.4 (even if it's caveated with alpha/experimental labels and flags all over it)
|
||||
|
||||
* @lukemarsden
|
||||
|
||||
|
|
|
@ -18,7 +18,7 @@ control whether he has both enough application replicas running
|
|||
locally in each of the clusters (so that, for example, users are
|
||||
handled by a nearby cluster, with low latency) and globally (so that
|
||||
there is always enough capacity to handle all traffic). If one of the
|
||||
clusters has issues or hasn’t enough capacity to run the given set of
|
||||
clusters has issues or hasn't enough capacity to run the given set of
|
||||
replicas the replicas should be automatically moved to some other
|
||||
cluster to keep the application responsive.
|
||||
|
||||
|
@ -71,7 +71,7 @@ A component that checks how many replicas are actually running in each
|
|||
of the subclusters and if the number matches to the
|
||||
FederatedReplicaSet preferences (by default spread replicas evenly
|
||||
across the clusters but custom preferences are allowed - see
|
||||
below). If it doesn’t and the situation is unlikely to improve soon
|
||||
below). If it doesn't and the situation is unlikely to improve soon
|
||||
then the replicas should be moved to other subclusters.
|
||||
|
||||
### API and CLI
|
||||
|
@ -104,7 +104,7 @@ type FederatedReplicaSetPreferences struct {
|
|||
Rebalance bool `json:"rebalance,omitempty"`
|
||||
|
||||
// Map from cluster name to preferences for that cluster. It is assumed that if a cluster
|
||||
// doesn’t have a matching entry then it should not have local replica. The cluster matches
|
||||
// doesn't have a matching entry then it should not have local replica. The cluster matches
|
||||
// to "*" if there is no entry with the real cluster name.
|
||||
Clusters map[string]LocalReplicaSetPreferences
|
||||
}
|
||||
|
@ -194,7 +194,7 @@ FederatedReplicaSetPreferences {
|
|||
There is a global target for 50, however clusters require 60. So some clusters will have less replicas.
|
||||
Replica layout: A=20 B=20 C=10.
|
||||
|
||||
**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don’t put more than 20 replicas to cluster C.
|
||||
**Scenario 4**. I want to have equal number of replicas in clusters A,B,C, however don't put more than 20 replicas to cluster C.
|
||||
|
||||
```go
|
||||
FederatedReplicaSetPreferences {
|
||||
|
@ -312,7 +312,7 @@ enumerated the key idea elements:
|
|||
+ [E4] LRS is manually deleted from the local cluster. In this case
|
||||
a new LRS should be created. It is the same case as
|
||||
[[E1]](#heading=h.wn3dfsyc4yuh). Any pods that were left behind
|
||||
won’t be killed and will be adopted after the LRS is recreated.
|
||||
won't be killed and will be adopted after the LRS is recreated.
|
||||
|
||||
+ [E5] LRS fails to create (not necessary schedule) the desired
|
||||
number of pods due to master troubles, admission control
|
||||
|
@ -341,7 +341,7 @@ elsewhere. For that purpose FRSC will maintain a data structure
|
|||
where for each FRS controlled LRS we store a list of pods belonging
|
||||
to that LRS along with their current status and status change timestamp.
|
||||
|
||||
+ [I5] If a new cluster is added to the federation then it doesn’t
|
||||
+ [I5] If a new cluster is added to the federation then it doesn't
|
||||
have a LRS and the situation is equal to
|
||||
[[E1]](#heading=h.wn3dfsyc4yuh)/[[E4]](#heading=h.vlyovyh7eef).
|
||||
|
||||
|
@ -350,7 +350,7 @@ to that LRS along with their current status and status change timestamp.
|
|||
a cluster is lost completely then the cluster is removed from the
|
||||
the cluster list (or marked accordingly) so
|
||||
[[E6]](#heading=h.in6ove1c1s8f) and [[E7]](#heading=h.37bnbvwjxeda)
|
||||
don’t need to be handled.
|
||||
don't need to be handled.
|
||||
|
||||
+ [I7] All ToBeChecked FRS are browsed every 1 min (configurable),
|
||||
checked against the current list of clusters, and all missing LRS
|
||||
|
@ -449,7 +449,7 @@ goroutines (however if needed the function can be parallelized for
|
|||
different FRS). It takes data only from store maintained by GR2_* and
|
||||
GR3_*. The external communication is only required to:
|
||||
|
||||
+ Create LRS. If a LRS doesn’t exist it is created after the
|
||||
+ Create LRS. If a LRS doesn't exist it is created after the
|
||||
rescheduling, when we know how much replicas should it have.
|
||||
|
||||
+ Update LRS replica targets.
|
||||
|
@ -470,7 +470,7 @@ as events.
|
|||
## Workflow
|
||||
|
||||
Here is the sequence of tasks that need to be done in order for a
|
||||
typical FRS to be split into a number of LRS’s and to be created in
|
||||
typical FRS to be split into a number of LRS's and to be created in
|
||||
the underlying federated clusters.
|
||||
|
||||
Note a: the reason the workflow would be helpful at this phase is that
|
||||
|
@ -489,7 +489,7 @@ Note c: federation-apiserver populates the clusterid field in the FRS
|
|||
before persisting it into the federation etcd
|
||||
|
||||
Step 3: the federation-level “informer” in FRSC watches federation
|
||||
etcd for new/modified FRS’s, with empty clusterid or clusterid equal
|
||||
etcd for new/modified FRS's, with empty clusterid or clusterid equal
|
||||
to federation ID, and if detected, it calls the scheduling code
|
||||
|
||||
Step 4.
|
||||
|
@ -503,7 +503,7 @@ distribution, i.e., equal weights for all of the underlying clusters
|
|||
Step 5. As soon as the scheduler function returns the control to FRSC,
|
||||
the FRSC starts a number of cluster-level “informer”s, one per every
|
||||
target cluster, to watch changes in every target cluster etcd
|
||||
regarding the posted LRS’s and if any violation from the scheduled
|
||||
regarding the posted LRS's and if any violation from the scheduled
|
||||
number of replicase is detected the scheduling code is re-called for
|
||||
re-scheduling purposes.
|
||||
|
||||
|
|
|
@ -59,7 +59,7 @@ clusters.
|
|||
|
||||
## SCOPE
|
||||
|
||||
It’s difficult to have a perfect design with one click that implements
|
||||
It's difficult to have a perfect design with one click that implements
|
||||
all the above requirements. Therefore we will go with an iterative
|
||||
approach to design and build the system. This document describes the
|
||||
phase one of the whole work. In phase one we will cover only the
|
||||
|
@ -95,7 +95,7 @@ Some design principles we are following in this architecture:
|
|||
1. Keep the Ubernetes API interface compatible with K8S API as much as
|
||||
possible.
|
||||
1. Re-use concepts from K8S as much as possible. This reduces
|
||||
customers’ learning curve and is good for adoption. Below is a brief
|
||||
customers' learning curve and is good for adoption. Below is a brief
|
||||
description of each module contained in above diagram.
|
||||
|
||||
## Ubernetes API Server
|
||||
|
@ -105,7 +105,7 @@ Server in K8S. It talks to a distributed key-value store to persist,
|
|||
retrieve and watch API objects. This store is completely distinct
|
||||
from the kubernetes key-value stores (etcd) in the underlying
|
||||
kubernetes clusters. We still use `etcd` as the distributed
|
||||
storage so customers don’t need to learn and manage a different
|
||||
storage so customers don't need to learn and manage a different
|
||||
storage system, although it is envisaged that other storage systems
|
||||
(consol, zookeeper) will probably be developedand supported over
|
||||
time.
|
||||
|
@ -200,7 +200,7 @@ $version.clusterSpec
|
|||
<td style="padding:5px;">Credential<br>
|
||||
</td>
|
||||
<td style="padding:5px;">the type (e.g. bearer token, client
|
||||
certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
|
||||
certificate etc) and data of the credential used to access cluster. It's used for system routines (not behalf of users)<br>
|
||||
</td>
|
||||
<td style="padding:5px;">yes<br>
|
||||
</td>
|
||||
|
@ -263,7 +263,7 @@ $version.clusterStatus
|
|||
</tbody>
|
||||
</table>
|
||||
|
||||
**For simplicity we didn’t introduce a separate “cluster metrics” API
|
||||
**For simplicity we didn't introduce a separate “cluster metrics” API
|
||||
object here**. The cluster resource metrics are stored in cluster
|
||||
status section, just like what we did to nodes in K8S. In phase one it
|
||||
only contains available CPU resources and memory resources. The
|
||||
|
@ -295,7 +295,7 @@ cases it may be complex. For example:
|
|||
+ This workload can only be scheduled to cluster Foo. It cannot be
|
||||
scheduled to any other clusters. (use case: sensitive workloads).
|
||||
+ This workload prefers cluster Foo. But if there is no available
|
||||
capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
|
||||
capacity on cluster Foo, it's OK to be scheduled to cluster Bar
|
||||
(use case: workload )
|
||||
+ Seventy percent of this workload should be scheduled to cluster Foo,
|
||||
and thirty percent should be scheduled to cluster Bar (use case:
|
||||
|
@ -373,7 +373,7 @@ plane:
|
|||
1. Each cluster control is watching the sub RCs bound to its
|
||||
corresponding cluster. It picks up the newly created sub RC.
|
||||
1. The cluster controller issues requests to the underlying cluster
|
||||
API Server to create the RC. In phase one we don’t support complex
|
||||
API Server to create the RC. In phase one we don't support complex
|
||||
distribution policies. The scheduling rule is basically:
|
||||
1. If a RC does not specify any nodeSelector, it will be scheduled
|
||||
to the least loaded K8S cluster(s) that has enough available
|
||||
|
@ -388,7 +388,7 @@ the cluster is working independently it still accepts workload
|
|||
requests from other K8S clients or even another Cluster Federation control
|
||||
plane. The Cluster Federation scheduling decision is based on this data of
|
||||
available resources. However when the actual RC creation happens to
|
||||
the cluster at time _T2_, the cluster may don’t have enough resources
|
||||
the cluster at time _T2_, the cluster may don't have enough resources
|
||||
at that time. We will address this problem in later phases with some
|
||||
proposed solutions like resource reservation mechanisms.
|
||||
|
||||
|
|
|
@ -83,7 +83,7 @@ The Garbage Collector consists of a scanner, a garbage processor, and a propagat
|
|||
* Worker:
|
||||
* Dequeues an item from the *Event Queue*.
|
||||
* If the item is an creation or update, then updates the DAG accordingly.
|
||||
* If the object has an owner and the owner doesn’t exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
|
||||
* If the object has an owner and the owner doesn't exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
|
||||
* If the item is a deletion, then removes the object from the DAG, and enqueues all its dependent objects to the *Dirty Queue*.
|
||||
* The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier.
|
||||
* With the Propagator, we *only* need to run the Scanner when starting the GC to populate the DAG and the *Dirty Queue*.
|
||||
|
@ -162,8 +162,8 @@ Adding a fourth component to the Garbage Collector, the"orphan" finalizer:
|
|||
## Orphan adoption
|
||||
|
||||
Controllers are responsible for adopting orphaned dependent resources. To do so, controllers
|
||||
* Checks a potential dependent object’s OwnerReferences to determine if it is orphaned.
|
||||
* Fills the OwnerReferences if the object matches the controller’s selector and is orphaned.
|
||||
* Checks a potential dependent object's OwnerReferences to determine if it is orphaned.
|
||||
* Fills the OwnerReferences if the object matches the controller's selector and is orphaned.
|
||||
|
||||
There is a potential race between the "orphan" finalizer removing an owner reference and the controllers adding it back during adoption. Imagining this case: a user deletes an owning object and intends to orphan the dependent objects, so the GC removes the owner from the dependent object's OwnerReferences list, but the controller of the owner resource hasn't observed the deletion yet, so it adopts the dependent again and adds the reference back, resulting in the mistaken deletion of the dependent object. This race can be avoided by implementing Status.ObservedGeneration in all resources. Before updating the dependent Object's OwnerReferences, the "orphan" finalizer checks Status.ObservedGeneration of the owning object to ensure its controller has already observed the deletion.
|
||||
|
||||
|
@ -173,7 +173,7 @@ For the master, after upgrading to a version that supports cascading deletion, t
|
|||
|
||||
For nodes, cascading deletion does not affect them.
|
||||
|
||||
For kubectl, we will keep the kubectl’s cascading deletion logic for one more release.
|
||||
For kubectl, we will keep the kubectl's cascading deletion logic for one more release.
|
||||
|
||||
# End-to-End Examples
|
||||
|
||||
|
@ -299,7 +299,7 @@ The only new component is the Garbage Collector, which consists of a scanner, a
|
|||
* Worker:
|
||||
* Dequeues an item from the *Event Queue*.
|
||||
* If the item is an creation or update, then updates the DAG accordingly.
|
||||
* If the object has a parent and the parent doesn’t exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
|
||||
* If the object has a parent and the parent doesn't exist in the DAG yet, then apart from adding the object to the DAG, also enqueues the object to the *Dirty Queue*.
|
||||
* If the item is a deletion, then removes the object from the DAG, and enqueues all its children to the *Dirty Queue*.
|
||||
* The propagator shouldn't need to do any RPCs, so a single worker should be sufficient. This makes locking easier.
|
||||
* With the Propagator, we *only* need to run the Scanner when starting the Propagator to populate the DAG and the *Dirty Queue*.
|
||||
|
@ -310,14 +310,14 @@ The only new component is the Garbage Collector, which consists of a scanner, a
|
|||
|
||||
* API Server: when handling a deletion request, if DeleteOptions.OrphanChildren is true, then the API Server either creates a tombstone with TTL if the tombstone doesn't exist yet, or updates the TTL of the existing tombstone. The API Server deletes the object after the tombstone is created.
|
||||
|
||||
* Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that don’t have a parent should have the namespace object as the parent.
|
||||
* Controllers: when creating child objects, controllers need to fill up their ObjectMeta.ParentReferences field. Objects that don't have a parent should have the namespace object as the parent.
|
||||
|
||||
## Comparison with the selected design
|
||||
|
||||
The main difference between the two designs is when to update the ParentReferences. In design #1, because a tombstone is created to indicate "orphaning" is desired, the updates to ParentReferences can be deferred until the deletion of the tombstone. In design #2, the updates need to be done before the parent object is deleted from the registry.
|
||||
|
||||
* Advantages of "Tombstone + GC" design
|
||||
* Faster to free the resource name compared to using finalizers. The original object can be deleted to free the resource name once the tombstone is created, rather than waiting for the finalizers to update all children’s ObjectMeta.ParentReferences.
|
||||
* Faster to free the resource name compared to using finalizers. The original object can be deleted to free the resource name once the tombstone is created, rather than waiting for the finalizers to update all children's ObjectMeta.ParentReferences.
|
||||
* Advantages of "Finalizer Framework + GC"
|
||||
* The finalizer framework is needed for other purposes as well.
|
||||
|
||||
|
|
|
@ -49,7 +49,7 @@ It will be covered in a separate doc.
|
|||
All etcd instances will be clustered together and one of them will be an elected master.
|
||||
In order to commit any change quorum of the cluster will have to confirm it. Etcd will be
|
||||
configured in such a way that all writes and reads will go through the master (requests
|
||||
will be forwarded by the local etcd server such that it’s invisible for the user). It will
|
||||
will be forwarded by the local etcd server such that it's invisible for the user). It will
|
||||
affect latency for all operations, but it should not increase by much more than the network
|
||||
latency between master replicas (latency between GCE zones with a region is < 10ms).
|
||||
|
||||
|
@ -57,7 +57,7 @@ Currently etcd exposes port only using localhost interface. In order to allow cl
|
|||
and inter-VM communication we will also have to use public interface. To secure the
|
||||
communication we will use SSL (as described [here](https://coreos.com/etcd/docs/latest/security.html)).
|
||||
|
||||
When generating command line for etcd we will always assume it’s part of a cluster
|
||||
When generating command line for etcd we will always assume it's part of a cluster
|
||||
(initially of size 1) and list all existing kubernetes master replicas.
|
||||
Based on that, we will set the following flags:
|
||||
* `-initial-cluster` - list of all hostnames/DNS names for master replicas (including the new one)
|
||||
|
|
|
@ -5,7 +5,7 @@ and set them before the container is run. This document describes design of the
|
|||
|
||||
## Motivation
|
||||
|
||||
Since we want to make Kubernetes as simple as possible for its users we don’t want to require setting [Resources](../design/resource-qos.md) for container by its owner.
|
||||
Since we want to make Kubernetes as simple as possible for its users we don't want to require setting [Resources](../design/resource-qos.md) for container by its owner.
|
||||
On the other hand having Resources filled is critical for scheduling decisions.
|
||||
Current solution to set up Resources to hardcoded value has obvious drawbacks.
|
||||
We need to implement a component which will set initial Resources to a reasonable value.
|
||||
|
@ -22,7 +22,7 @@ InitialResources will set only [request](../design/resource-qos.md#requests-and-
|
|||
To make the component work with LimitRanger the estimated value will be capped by min and max possible values if defined.
|
||||
It will prevent from situation when the pod is rejected due to too low or too high estimation.
|
||||
|
||||
The container won’t be marked as managed by this component in any way, however appropriate event will be exported.
|
||||
The container won't be marked as managed by this component in any way, however appropriate event will be exported.
|
||||
The predicting algorithm should have very low latency to not increase significantly e2e pod startup latency
|
||||
[#3954](https://github.com/kubernetes/kubernetes/pull/3954).
|
||||
|
||||
|
|
|
@ -160,7 +160,7 @@ arbitrary container logs.
|
|||
**Who should rotate the logs?**
|
||||
|
||||
We assume that a separate task (e.g., cron job) will be configured on the node
|
||||
to rotate the logs periodically, similar to today’s implementation.
|
||||
to rotate the logs periodically, similar to today's implementation.
|
||||
|
||||
We do not rule out the possibility of letting kubelet or a per-node daemon
|
||||
(`DaemonSet`) to take up the responsibility, or even declare rotation policy
|
||||
|
|
|
@ -10,7 +10,7 @@ detailed discussion.
|
|||
|
||||
Currently performance testing happens on ‘live’ clusters of up to 100 Nodes. It takes quite a while to start such cluster or to push
|
||||
updates to all Nodes, and it uses quite a lot of resources. At this scale the amount of wasted time and used resources is still acceptable.
|
||||
In the next quarter or two we’re targeting 1000 Node cluster, which will push it way beyond ‘acceptable’ level. Additionally we want to
|
||||
In the next quarter or two we're targeting 1000 Node cluster, which will push it way beyond ‘acceptable’ level. Additionally we want to
|
||||
enable people without many resources to run scalability tests on bigger clusters than they can afford at given time. Having an ability to
|
||||
cheaply run scalability tests will enable us to run some set of them on "normal" test clusters, which in turn would mean ability to run
|
||||
them on every PR.
|
||||
|
@ -18,7 +18,7 @@ them on every PR.
|
|||
This means that we need a system that will allow for realistic performance testing on (much) smaller number of “real” machines. First
|
||||
assumption we make is that Nodes are independent, i.e. number of existing Nodes do not impact performance of a single Node. This is not
|
||||
entirely true, as number of Nodes can increase latency of various components on Master machine, which in turn may increase latency of Node
|
||||
operations, but we’re not interested in measuring this effect here. Instead we want to measure how number of Nodes and the load imposed by
|
||||
operations, but we're not interested in measuring this effect here. Instead we want to measure how number of Nodes and the load imposed by
|
||||
Node daemons affects the performance of Master components.
|
||||
|
||||
## Kubemark architecture overview
|
||||
|
@ -30,7 +30,7 @@ initial version). To teach Hollow components replaying recorded traffic they wil
|
|||
should die (e.g. observed lifetime). Such data can be extracted e.g. from etcd Raft logs, or it can be reconstructed from Events. In the
|
||||
initial version we only want them to be able to fool Master components and put some configurable (in what way TBD) load on them.
|
||||
|
||||
When we have Hollow Node ready, we’ll be able to test performance of Master Components by creating a real Master Node, with API server,
|
||||
When we have Hollow Node ready, we'll be able to test performance of Master Components by creating a real Master Node, with API server,
|
||||
Controllers, etcd and whatnot, and create number of Hollow Nodes that will register to the running Master.
|
||||
|
||||
To make Kubemark easier to maintain when system evolves Hollow components will reuse real "production" code for Kubelet and KubeProxy, but
|
||||
|
@ -83,8 +83,8 @@ Pod on each Node that exports logs to Elasticsearch (or Google Cloud Logging). B
|
|||
cluster so do not add any load on a Master components by themselves. There can be other systems that scrape Heapster through proxy running
|
||||
on Master, which adds additional load, but they're not the part of default setup, so in the first version we won't simulate this behavior.
|
||||
|
||||
In the first version we’ll assume that all started Pods will run indefinitely if not explicitly deleted. In the future we can add a model
|
||||
of short-running batch jobs, but in the initial version we’ll assume only serving-like Pods.
|
||||
In the first version we'll assume that all started Pods will run indefinitely if not explicitly deleted. In the future we can add a model
|
||||
of short-running batch jobs, but in the initial version we'll assume only serving-like Pods.
|
||||
|
||||
### Heapster
|
||||
|
||||
|
@ -138,7 +138,7 @@ don't need to solve this problem now.
|
|||
- new HollowNode combining the two,
|
||||
- make sure that Master can talk to two HollowKubelets running on the same machine
|
||||
- Make sure that we can run Hollow cluster on top of Kubernetes [option 2](#option-2)
|
||||
- Write a player that will automatically put some predefined load on Master, <- this is the moment when it’s possible to play with it and is useful by itself for
|
||||
- Write a player that will automatically put some predefined load on Master, <- this is the moment when it's possible to play with it and is useful by itself for
|
||||
scalability tests. Alternatively we can just use current density/load tests,
|
||||
- Benchmark our machines - see how many Watch clients we can have before everything explodes,
|
||||
- See how many HollowNodes we can run on a single machine by attaching them to the real master <- this is the moment it starts to useful
|
||||
|
|
|
@ -99,12 +99,12 @@ version of today's Heapster. metrics-server stores locally only latest values an
|
|||
metrics-server exposes the master metrics API. (The configuration described here is similar
|
||||
to the current Heapster in “standalone” mode.)
|
||||
[Discovery summarizer](../../docs/proposals/federated-api-servers.md)
|
||||
makes the master metrics API available to external clients such that from the client’s perspective
|
||||
makes the master metrics API available to external clients such that from the client's perspective
|
||||
it looks the same as talking to the API server.
|
||||
|
||||
Core (system) metrics are handled as described above in all deployment environments. The only
|
||||
easily replaceable part is resource estimator, which could be replaced by power users. In
|
||||
theory, metric-server itself can also be substituted, but it’d be similar to substituting
|
||||
theory, metric-server itself can also be substituted, but it'd be similar to substituting
|
||||
apiserver itself or controller-manager - possible, but not recommended and not supported.
|
||||
|
||||
Eventually the core metrics pipeline might also collect metrics from Kubelet and Docker daemon
|
||||
|
@ -170,7 +170,7 @@ cAdvisor + Heapster + InfluxDB (or any other sink)
|
|||
* snapd + SNAP cluster-level agent
|
||||
* Sysdig
|
||||
|
||||
As an example we’ll describe a potential integration with cAdvisor + Prometheus.
|
||||
As an example we'll describe a potential integration with cAdvisor + Prometheus.
|
||||
|
||||
Prometheus has the following metric sources on a node:
|
||||
* core and non-core system metrics from cAdvisor
|
||||
|
|
|
@ -96,7 +96,7 @@ The user needs a way to explicitly declare which connections are allowed into po
|
|||
This is accomplished through ingress rules on `NetworkPolicy`
|
||||
objects (of which there can be multiple in a single namespace). Pods selected by
|
||||
one or more NetworkPolicy objects should allow any incoming connections that match any
|
||||
ingress rule on those NetworkPolicy objects, per the network plugin’s capabilities.
|
||||
ingress rule on those NetworkPolicy objects, per the network plugin's capabilities.
|
||||
|
||||
NetworkPolicy objects and the above namespace isolation both act on _connections_ rather than individual packets. That is to say that if traffic from pod A to pod B is allowed by the configured
|
||||
policy, then the return packets for that connection from B -> A are also allowed, even if the policy in place would not allow B to initiate a connection to A. NetworkPolicy objects act on a broad definition of _connection_ which includes both TCP and UDP streams. If new network policy is applied that would block an existing connection between two endpoints, the enforcer of policy
|
||||
|
|
|
@ -35,7 +35,7 @@ among other problems.
|
|||
## Container to container
|
||||
|
||||
All containers within a pod behave as if they are on the same host with regard
|
||||
to networking. They can all reach each other’s ports on localhost. This offers
|
||||
to networking. They can all reach each other's ports on localhost. This offers
|
||||
simplicity (static ports know a priori), security (ports bound to localhost
|
||||
are visible within the pod but never outside it), and performance. This also
|
||||
reduces friction for applications moving from the world of uncontainerized apps
|
||||
|
|
|
@ -20,13 +20,13 @@ lost because of this issue.
|
|||
### Adding per-pod probe-time, which increased the number of PodStatus updates, causing major slowdown
|
||||
|
||||
In September 2015 we tried to add per-pod probe times to the PodStatus. It caused (https://github.com/kubernetes/kubernetes/issues/14273) a massive increase in both number and
|
||||
total volume of object (PodStatus) changes. It drastically increased the load on API server which wasn’t able to handle new number of requests quickly enough, violating our
|
||||
total volume of object (PodStatus) changes. It drastically increased the load on API server which wasn't able to handle new number of requests quickly enough, violating our
|
||||
response time SLO. We had to revert this change.
|
||||
|
||||
### Late Ready->Running PodPhase transition caused test failures as it seemed like slowdown
|
||||
|
||||
In late September we encountered a strange problem (https://github.com/kubernetes/kubernetes/issues/14554): we observed an increased observed latencies in small clusters (few
|
||||
Nodes). It turned out that it’s caused by an added latency between PodRunning and PodReady phases. This was not a real regression, but our tests thought it were, which shows
|
||||
Nodes). It turned out that it's caused by an added latency between PodRunning and PodReady phases. This was not a real regression, but our tests thought it were, which shows
|
||||
how careful we need to be.
|
||||
|
||||
### Huge number of handshakes slows down API server
|
||||
|
|
|
@ -350,7 +350,7 @@ Two top level cgroups for `Bu` and `BE` QoS classes are created when Kubelet sta
|
|||
#### Pod level Cgroup creation and deletion (Docker runtime)
|
||||
|
||||
- When a new pod is brought up, its QoS class is firstly determined.
|
||||
- We add an interface to Kubelet’s ContainerManager to create and delete pod level cgroups under the cgroup that matches the pod’s QoS class.
|
||||
- We add an interface to Kubelet's ContainerManager to create and delete pod level cgroups under the cgroup that matches the pod's QoS class.
|
||||
- This interface will be pluggable. Kubelet will support both systemd and raw cgroups based __cgroup__ drivers. We will be using the --cgroup-driver flag proposed in the [Systemd Node Spec](kubelet-systemd.md) to specify the cgroup driver.
|
||||
- We inject creation and deletion of pod level cgroups into the pod workers.
|
||||
- As new pods are added QoS class cgroup parameters are updated to match the resource requests by the Pod.
|
||||
|
@ -365,7 +365,7 @@ We want to have rkt create pods under a root QoS class that kubelet specifies, a
|
|||
|
||||
#### Add Pod level metrics to Kubelet's metrics provider
|
||||
|
||||
Update Kubelet’s metrics provider to include Pod level metrics. Use cAdvisor's cgroup subsystem information to determine various Pod level usage metrics.
|
||||
Update Kubelet's metrics provider to include Pod level metrics. Use cAdvisor's cgroup subsystem information to determine various Pod level usage metrics.
|
||||
|
||||
`Note: Changes to cAdvisor might be necessary.`
|
||||
|
||||
|
@ -393,7 +393,7 @@ Updating QoS limits needs to happen before pod cgroups values are updated. When
|
|||
|
||||
Other smaller work items that we would be good to have before the release of this feature.
|
||||
- [ ] Add Pod UID to the downward api which will help simplify the e2e testing logic.
|
||||
- [ ] Check if parent cgroup exist and error out if they don’t.
|
||||
- [ ] Check if parent cgroup exist and error out if they don't.
|
||||
- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
|
||||
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
|
||||
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.
|
||||
|
@ -403,7 +403,7 @@ Other smaller work items that we would be good to have before the release of thi
|
|||
To better support our requirements we needed to make some changes/add features to Libcontainer as well
|
||||
|
||||
- [x] Allowing or denying all devices by writing 'a' to devices.allow or devices.deny is
|
||||
not possible once the device cgroups has children. Libcontainer doesn’t have the option of skipping updates on parent devices cgroup. opencontainers/runc/pull/958
|
||||
not possible once the device cgroups has children. Libcontainer doesn't have the option of skipping updates on parent devices cgroup. opencontainers/runc/pull/958
|
||||
- [x] To use libcontainer for creating and managing cgroups in the Kubelet, I would like to just create a cgroup with no pid attached and if need be apply a pid to the cgroup later on. But libcontainer did not support cgroup creation without attaching a pid. opencontainers/runc/pull/956
|
||||
|
||||
|
||||
|
|
|
@ -9,7 +9,7 @@ by evicting a critical addon (either manually or as a side effect of an other op
|
|||
which possibly can become pending (for example when the cluster is highly utilized).
|
||||
To avoid such situation we want to have a mechanism which guarantees that
|
||||
critical addons are scheduled assuming the cluster is big enough.
|
||||
This possibly may affect other pods (including production user’s applications).
|
||||
This possibly may affect other pods (including production user's applications).
|
||||
|
||||
## Design
|
||||
|
||||
|
@ -33,13 +33,13 @@ Later we may want to introduce some heuristic:
|
|||
* minimize number of evicted pods with violation of disruption budget or shortened termination grace period
|
||||
* minimize number of affected pods by choosing a node on which we have to evict less pods
|
||||
* increase probability of scheduling of evicted pods by preferring a set of pods with the smallest total sum of requests
|
||||
* avoid nodes which are ‘non-drainable’ (according to drain logic), for example on which there is a pod which doesn’t belong to any RC/RS/Deployment
|
||||
* avoid nodes which are ‘non-drainable’ (according to drain logic), for example on which there is a pod which doesn't belong to any RC/RS/Deployment
|
||||
|
||||
#### Evicting pods
|
||||
|
||||
There are 2 mechanism which possibly can delay a pod eviction: Disruption Budget and Termination Grace Period.
|
||||
|
||||
While removing a pod we will try to avoid violating Disruption Budget, though we can’t guarantee it
|
||||
While removing a pod we will try to avoid violating Disruption Budget, though we can't guarantee it
|
||||
since there is a chance that it would block this operation for longer period of time.
|
||||
We will also try to respect Termination Grace Period, though without any guarantee.
|
||||
In case we have to remove a pod with termination grace period longer than 10s it will be shortened to 10s.
|
||||
|
@ -70,7 +70,7 @@ This situation would be rare and usually an extra node would be anyway needed fo
|
|||
In the worst case CA will add and then remove the node.
|
||||
To not complicate architecture by introducing interaction between those 2 components we accept this overlap.
|
||||
|
||||
We want to ensure that CA won’t remove nodes with critical addons by adding appropriate logic there.
|
||||
We want to ensure that CA won't remove nodes with critical addons by adding appropriate logic there.
|
||||
|
||||
### Rescheduler control loop
|
||||
|
||||
|
|
|
@ -167,7 +167,7 @@ Under system memory pressure, these containers are more likely to be killed once
|
|||
|
||||
Pod OOM score configuration
|
||||
- Note that the OOM score of a process is 10 times the % of memory the process consumes, adjusted by OOM_SCORE_ADJ, barring exceptions (e.g. process is launched by root). Processes with higher OOM scores are killed.
|
||||
- The base OOM score is between 0 and 1000, so if process A’s OOM_SCORE_ADJ - process B’s OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B.
|
||||
- The base OOM score is between 0 and 1000, so if process A's OOM_SCORE_ADJ - process B's OOM_SCORE_ADJ is over a 1000, then process A will always be OOM killed before B.
|
||||
- The final OOM score of a process is also between 0 and 1000
|
||||
|
||||
*Best-effort*
|
||||
|
@ -194,7 +194,7 @@ Pod OOM score configuration
|
|||
- OOM_SCORE_ADJ: -998
|
||||
|
||||
*Kubelet, Docker*
|
||||
- OOM_SCORE_ADJ: -999 (won’t be OOM killed)
|
||||
- OOM_SCORE_ADJ: -999 (won't be OOM killed)
|
||||
- Hack, because these critical tasks might die if they conflict with guaranteed containers. In the future, we should place all user-pods into a separate cgroup, and set a limit on the memory they can consume.
|
||||
|
||||
## Known issues and possible improvements
|
||||
|
@ -203,7 +203,7 @@ The above implementation provides for basic oversubscription with protection, bu
|
|||
|
||||
#### Support for Swap
|
||||
|
||||
- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn’t enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior.
|
||||
- The current QoS policy assumes that swap is disabled. If swap is enabled, then resource guarantees (for pods that specify resource requirements) will not hold. For example, suppose 2 guaranteed pods have reached their memory limit. They can continue allocating memory by utilizing disk space. Eventually, if there isn't enough swap space, processes in the pods might get killed. The node must take into account swap space explicitly for providing deterministic isolation behavior.
|
||||
|
||||
## Alternative QoS Class Policy
|
||||
|
||||
|
|
|
@ -51,7 +51,7 @@ pods and service accounts within a project
|
|||
as a new cluster-scoped object called `PodSecurityPolicy`.
|
||||
1. User information in `user.Info` must be available to admission controllers. (Completed in
|
||||
https://github.com/GoogleCloudPlatform/kubernetes/pull/8203)
|
||||
1. Some authorizers may restrict a user’s ability to reference a service account. Systems requiring
|
||||
1. Some authorizers may restrict a user's ability to reference a service account. Systems requiring
|
||||
the ability to secure service accounts on a user level must be able to add a policy that enables
|
||||
referencing specific service accounts themselves.
|
||||
1. Admission control must validate the creation of Pods against the allowed set of constraints.
|
||||
|
|
|
@ -20,7 +20,7 @@ If a user and/or password is required then this information can be passed using
|
|||
|
||||
`Service Scheme` - Services can be deployed using different schemes. Some popular schemes include `http`,`https`,`file`,`ftp` and `jdbc`.
|
||||
|
||||
`Service Protocol` - Services use different protocols that clients need to speak in order to communicate with the service, some examples of service level protocols are SOAP, REST (Yes, technically REST isn’t a protocol but an architectural style). For service consumers it can be hard to tell what protocol is expected.
|
||||
`Service Protocol` - Services use different protocols that clients need to speak in order to communicate with the service, some examples of service level protocols are SOAP, REST (Yes, technically REST isn't a protocol but an architectural style). For service consumers it can be hard to tell what protocol is expected.
|
||||
|
||||
## Service Description
|
||||
|
||||
|
@ -37,7 +37,7 @@ Kubernetes allows the creation of Service Annotations. Here we propose the use o
|
|||
* `api.service.kubernetes.io/path` - the path part of the service endpoint url. An example value could be `cxfcdi`,
|
||||
* `api.service.kubernetes.io/scheme` - the scheme part of the service endpoint url. Some values could be `http` or `https`.
|
||||
* `api.service.kubernetes.io/protocol` - the protocol of the service. Known values are `SOAP`, `XML-RPC` and `REST`,
|
||||
* `api.service.kubernetes.io/description-path` - the path part of the service description document’s endpoint. It is a pretty safe assumption that the service self-documents. An example value for a swagger 2.0 document can be `cxfcdi/swagger.json`,
|
||||
* `api.service.kubernetes.io/description-path` - the path part of the service description document's endpoint. It is a pretty safe assumption that the service self-documents. An example value for a swagger 2.0 document can be `cxfcdi/swagger.json`,
|
||||
* `api.kubernetes.io/description-language` - the type of Description Language used. Known values are `WSDL`, `WADL`, `SwaggerJSON`, `SwaggerYAML`.
|
||||
|
||||
The fragment below is taken from the service section of the kubernetes.json were these annotations are used
|
||||
|
|
|
@ -106,7 +106,7 @@ When validating the ownerReference, API server needs to query the `Authorizer` t
|
|||
|
||||
**Modifications to processEvent()**
|
||||
|
||||
Currently `processEvent()` manages GC’s internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to:
|
||||
Currently `processEvent()` manages GC's internal owner-dependency relationship graph, `uidToNode`. It updates `uidToNode` according to the Add/Update/Delete events in the cluster. To support synchronous GC, it has to:
|
||||
|
||||
* handle Add or Update events where `obj.Finalizers.Has(GCFinalizer) && obj.DeletionTimestamp != nil`. The object will be added into the `dirtyQueue`. The object will be marked as “GC in progress” in `uidToNode`.
|
||||
* Upon receiving the deletion event of an object, put its owner into the `dirtyQueue` if the owner node is marked as "GC in progress". This is to force the `processItem()` (described next) to re-check if all dependents of the owner is deleted.
|
||||
|
|
|
@ -208,7 +208,7 @@ type Parameter struct {
|
|||
```
|
||||
|
||||
As seen above, parameters allow for metadata which can be fed into client implementations to display information about the
|
||||
parameter’s purpose and whether a value is required. In lieu of type information, two reference styles are offered: `$(PARAM)`
|
||||
parameter's purpose and whether a value is required. In lieu of type information, two reference styles are offered: `$(PARAM)`
|
||||
and `$((PARAM))`. When the single parens option is used, the result of the substitution will remain quoted. When the double
|
||||
parens option is used, the result of the substitution will not be quoted. For example, given a parameter defined with a value
|
||||
of "BAR", the following behavior will be observed:
|
||||
|
@ -402,7 +402,7 @@ The api endpoint will then:
|
|||
|
||||
1. Validate the template including confirming “required” parameters have an explicit value.
|
||||
2. Walk each api object in the template.
|
||||
3. Adding all labels defined in the template’s ObjectLabels field.
|
||||
3. Adding all labels defined in the template's ObjectLabels field.
|
||||
4. For each field, check if the value matches a parameter name and if so, set the value of the field to the value of the parameter.
|
||||
* Partial substitutions are accepted, such as `SOME_$(PARAM)` which would be transformed into `SOME_XXXX` where `XXXX` is the value
|
||||
of the `$(PARAM)` parameter.
|
||||
|
|
|
@ -190,7 +190,7 @@ Open questions:
|
|||
|
||||
* Can the API call methods on VolumePlugins? Yeah via controller
|
||||
|
||||
* The scheduler gives users functionality that doesn’t already exist, but required adding an entirely new controller
|
||||
* The scheduler gives users functionality that doesn't already exist, but required adding an entirely new controller
|
||||
|
||||
* Should the list and restore operations be part of v1?
|
||||
|
||||
|
@ -446,7 +446,7 @@ Users will specify a snapshotting schedule for particular volumes, which Kuberne
|
|||
|
||||
17. If the pod dies do we continue creating snapshots?
|
||||
|
||||
18. How to communicate errors (PD doesn’t support snapshotting, time period unsupported)
|
||||
18. How to communicate errors (PD doesn't support snapshotting, time period unsupported)
|
||||
|
||||
19. Off schedule snapshotting like before an application upgrade
|
||||
|
||||
|
@ -456,7 +456,7 @@ Options, pros, cons, suggestion/recommendation
|
|||
|
||||
Example 1b
|
||||
|
||||
During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod’s associated volume.
|
||||
During pod creation, a user can specify a pod definition in a yaml file. As part of this specification, users should be able to denote a [list of] times at which an existing snapshot command can be executed on the pod's associated volume.
|
||||
|
||||
For a simple example, take the definition of a [pod using a GCE PD](http://kubernetes.io/docs/user-guide/volumes/#example-pod-2):
|
||||
|
||||
|
|
|
@ -18,10 +18,10 @@ Watches, etc, are all merely optimizations of this logic.
|
|||
|
||||
## Guidelines
|
||||
|
||||
When you’re writing controllers, there are few guidelines that will help make sure you get the results and performance
|
||||
you’re looking for.
|
||||
When you're writing controllers, there are few guidelines that will help make sure you get the results and performance
|
||||
you're looking for.
|
||||
|
||||
1. Operate on one item at a time. If you use a `workqueue.Interface`, you’ll be able to queue changes for a
|
||||
1. Operate on one item at a time. If you use a `workqueue.Interface`, you'll be able to queue changes for a
|
||||
particular resource and later pop them in multiple “worker” gofuncs with a guarantee that no two gofuncs will
|
||||
work on the same item at the same time.
|
||||
|
||||
|
@ -37,11 +37,11 @@ you’re looking for.
|
|||
resourceB/Y”, your controller could observe “created resourceB/Y” and “created resourceA/X”.
|
||||
|
||||
|
||||
1. Level driven, not edge driven. Just like having a shell script that isn’t running all the time, your controller
|
||||
1. Level driven, not edge driven. Just like having a shell script that isn't running all the time, your controller
|
||||
may be off for an indeterminate amount of time before running again.
|
||||
|
||||
If an API object appears with a marker value of `true`, you can’t count on having seen it turn from `false` to `true`,
|
||||
only that you now observe it being `true`. Even an API watch suffers from this problem, so be sure that you’re not
|
||||
If an API object appears with a marker value of `true`, you can't count on having seen it turn from `false` to `true`,
|
||||
only that you now observe it being `true`. Even an API watch suffers from this problem, so be sure that you're not
|
||||
counting on seeing a change unless your controller is also marking the information it last made the decision on in
|
||||
the object's status.
|
||||
|
||||
|
@ -61,18 +61,18 @@ you’re looking for.
|
|||
|
||||
|
||||
1. Never mutate original objects! Caches are shared across controllers, this means that if you mutate your "copy"
|
||||
(actually a reference or shallow copy) of an object, you’ll mess up other controllers (not just your own).
|
||||
(actually a reference or shallow copy) of an object, you'll mess up other controllers (not just your own).
|
||||
|
||||
The most common point of failure is making a shallow copy, then mutating a map, like `Annotations`. Use
|
||||
`api.Scheme.Copy` to make a deep copy.
|
||||
|
||||
|
||||
1. Wait for your secondary caches. Many controllers have primary and secondary resources. Primary resources are the
|
||||
resources that you’ll be updating `Status` for. Secondary resources are resources that you’ll be managing
|
||||
resources that you'll be updating `Status` for. Secondary resources are resources that you'll be managing
|
||||
(creating/deleting) or using for lookups.
|
||||
|
||||
Use the `framework.WaitForCacheSync` function to wait for your secondary caches before starting your primary sync
|
||||
functions. This will make sure that things like a Pod count for a ReplicaSet isn’t working off of known out of date
|
||||
functions. This will make sure that things like a Pod count for a ReplicaSet isn't working off of known out of date
|
||||
information that results in thrashing.
|
||||
|
||||
|
||||
|
@ -87,14 +87,14 @@ you’re looking for.
|
|||
1. Percolate errors to the top level for consistent re-queuing. We have a `workqueue.RateLimitingInterface` to allow
|
||||
simple requeuing with reasonable backoffs.
|
||||
|
||||
Your main controller func should return an error when requeuing is necessary. When it isn’t, it should use
|
||||
Your main controller func should return an error when requeuing is necessary. When it isn't, it should use
|
||||
`utilruntime.HandleError` and return nil instead. This makes it very easy for reviewers to inspect error handling
|
||||
cases and to be confident that your controller doesn’t accidentally lose things it should retry for.
|
||||
cases and to be confident that your controller doesn't accidentally lose things it should retry for.
|
||||
|
||||
|
||||
1. Watches and Informers will “sync”. Periodically, they will deliver every matching object in the cluster to your
|
||||
`Update` method. This is good for cases where you may need to take additional action on the object, but sometimes you
|
||||
know there won’t be more work to do.
|
||||
know there won't be more work to do.
|
||||
|
||||
In cases where you are *certain* that you don't need to requeue items when there are no new changes, you can compare the
|
||||
resource version of the old and new objects. If they are the same, you skip requeuing the work. Be careful when you
|
||||
|
|
|
@ -41,19 +41,19 @@ background on k8s networking could be found
|
|||
[here](http://kubernetes.io/docs/admin/networking/)
|
||||
|
||||
## Requirements
|
||||
1. Kubelet expects the runtime shim to manage pod’s network life cycle. Pod
|
||||
1. Kubelet expects the runtime shim to manage pod's network life cycle. Pod
|
||||
networking should be handled accordingly along with pod sandbox operations.
|
||||
* `RunPodSandbox` must set up pod’s network. This includes, but is not limited
|
||||
to allocating a pod IP, configuring the pod’s network interfaces and default
|
||||
* `RunPodSandbox` must set up pod's network. This includes, but is not limited
|
||||
to allocating a pod IP, configuring the pod's network interfaces and default
|
||||
network route. Kubelet expects the pod sandbox to have an IP which is
|
||||
routable within the k8s cluster, if `RunPodSandbox` returns successfully.
|
||||
`RunPodSandbox` must return an error if it fails to set up the pod’s network.
|
||||
If the pod’s network has already been set up, `RunPodSandbox` must skip
|
||||
`RunPodSandbox` must return an error if it fails to set up the pod's network.
|
||||
If the pod's network has already been set up, `RunPodSandbox` must skip
|
||||
network setup and proceed.
|
||||
* `StopPodSandbox` must tear down the pod’s network. The runtime shim
|
||||
must return error on network tear down failure. If pod’s network has
|
||||
* `StopPodSandbox` must tear down the pod's network. The runtime shim
|
||||
must return error on network tear down failure. If pod's network has
|
||||
already been torn down, `StopPodSandbox` must skip network tear down and proceed.
|
||||
* `RemovePodSandbox` may tear down pod’s network, if the networking has
|
||||
* `RemovePodSandbox` may tear down pod's network, if the networking has
|
||||
not been torn down already. `RemovePodSandbox` must return error on
|
||||
network tear down failure.
|
||||
* Response from `PodSandboxStatus` must include pod sandbox network status.
|
||||
|
|
|
@ -43,7 +43,7 @@ Common workflow for Kubemark is:
|
|||
- monitoring test execution and debugging problems
|
||||
- turning down Kubemark cluster
|
||||
|
||||
Included in descriptions there will be comments helpful for anyone who’ll want to
|
||||
Included in descriptions there will be comments helpful for anyone who'll want to
|
||||
port Kubemark to different providers.
|
||||
|
||||
### Starting a Kubemark cluster
|
||||
|
@ -58,7 +58,7 @@ configuration stored in `cluster/kubemark/config-default.sh` - you can tweak it
|
|||
however you want, but note that some features may not be implemented yet, as
|
||||
implementation of Hollow components/mocks will probably be lagging behind ‘real’
|
||||
one. For performance tests interesting variables are `NUM_NODES` and
|
||||
`MASTER_SIZE`. After start-kubemark script is finished you’ll have a ready
|
||||
`MASTER_SIZE`. After start-kubemark script is finished you'll have a ready
|
||||
Kubemark cluster, a kubeconfig file for talking to the Kubemark cluster is
|
||||
stored in `test/kubemark/kubeconfig.kubemark`.
|
||||
|
||||
|
@ -87,7 +87,7 @@ be easy to do outside of GCE*).
|
|||
|
||||
- Creates a ReplicationController for HollowNodes and starts them up. (*will
|
||||
work exactly the same everywhere as long as MASTER_IP will be populated
|
||||
correctly, but you’ll need to update docker image address if you’re not using
|
||||
correctly, but you'll need to update docker image address if you're not using
|
||||
GCR and default image name*)
|
||||
|
||||
- Waits until all HollowNodes are in the Running phase (*will work exactly the
|
||||
|
@ -129,7 +129,7 @@ Master machine (currently) differs from the ordinary one.
|
|||
If you need to debug master machine you can do similar things as you do on your
|
||||
ordinary master. The difference between Kubemark setup and ordinary setup is
|
||||
that in Kubemark etcd is run as a plain docker container, and all master
|
||||
components are run as normal processes. There’s no Kubelet overseeing them. Logs
|
||||
components are run as normal processes. There's no Kubelet overseeing them. Logs
|
||||
are stored in exactly the same place, i.e. `/var/logs/` directory. Because
|
||||
binaries are not supervised by anything they won't be restarted in the case of a
|
||||
crash.
|
||||
|
@ -145,7 +145,7 @@ one of them you need to learn which hollow-node pod corresponds to a given
|
|||
HollowNode known by the Master. During self-registeration HollowNodes provide
|
||||
their cluster IPs as Names, which means that if you need to find a HollowNode
|
||||
named `10.2.4.5` you just need to find a Pod in external cluster with this
|
||||
cluster IP. There’s a helper script
|
||||
cluster IP. There's a helper script
|
||||
`test/kubemark/get-real-pod-for-hollow-node.sh` that does this for you.
|
||||
|
||||
When you have a Pod name you can use `kubectl logs` on external cluster to get
|
||||
|
@ -190,7 +190,7 @@ All those things should work exactly the same on all cloud providers.
|
|||
|
||||
On GCE you just need to execute `test/kubemark/stop-kubemark.sh` script, which
|
||||
will delete HollowNode ReplicationController and all the resources for you. On
|
||||
other providers you’ll need to delete all this stuff by yourself.
|
||||
other providers you'll need to delete all this stuff by yourself.
|
||||
|
||||
## Some current implementation details
|
||||
|
||||
|
@ -198,12 +198,12 @@ Kubemark master uses exactly the same binaries as ordinary Kubernetes does. This
|
|||
means that it will never be out of date. On the other hand HollowNodes use
|
||||
existing fake for Kubelet (called SimpleKubelet), which mocks its runtime
|
||||
manager with `pkg/kubelet/dockertools/fake_manager.go`, where most logic sits.
|
||||
Because there’s no easy way of mocking other managers (e.g. VolumeManager), they
|
||||
are not supported in Kubemark (e.g. we can’t schedule Pods with volumes in them
|
||||
Because there's no easy way of mocking other managers (e.g. VolumeManager), they
|
||||
are not supported in Kubemark (e.g. we can't schedule Pods with volumes in them
|
||||
yet).
|
||||
|
||||
As the time passes more fakes will probably be plugged into HollowNodes, but
|
||||
it’s crucial to make it as simple as possible to allow running a big number of
|
||||
it's crucial to make it as simple as possible to allow running a big number of
|
||||
Hollows on a single core.
|
||||
|
||||
|
||||
|
|
|
@ -17,7 +17,7 @@ They should **NOT**:
|
|||
* Apply priority labels to PRs
|
||||
* Apply cherrypick labels to PRs
|
||||
* Edit text of other people's PRs and issues, including deleting comments
|
||||
* Modify anyone else’s release note
|
||||
* Modify anyone else's release note
|
||||
* Create, edit, delete labels
|
||||
* Create, edit, close, delete milestones
|
||||
* Create, edit, delete releases
|
||||
|
|
|
@ -18,7 +18,7 @@ We focus on the developer and devops experience of running applications in Kuber
|
|||
* Show early features/demos of tools that make running apps easier
|
||||
|
||||
## Non-goals:
|
||||
* Our job is not to go implement stacks. We’re helping people to help themselves. We will help connect people to the right folks * but we do not want to own a set of examples (as a group)
|
||||
* Our job is not to go implement stacks. We're helping people to help themselves. We will help connect people to the right folks * but we do not want to own a set of examples (as a group)
|
||||
* Do not endorse one particular tool
|
||||
* Do not pick which apps to run on top of the platform
|
||||
* Do not recommend one way to do things
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
* Intro from Michelle
|
||||
* Discussion on the future of the SIG:
|
||||
* Mike from Rackspace offered to do a demo of the recursive functionality ([issue](https://github.com/kubernetes/kubernetes/pull/25110))
|
||||
* Idea: solicit the community for cases where their use cases aren’t met.
|
||||
* Idea: solicit the community for cases where their use cases aren't met.
|
||||
* Demo from Prashanth B on PetSets ([issue](https://github.com/kubernetes/kubernetes/issues/260))
|
||||
* Supposed to make deploying and managing stateful apps easier. Will be alpha in 1.3.
|
||||
* Zookeeper, mysql, cassandra are example apps to run in this
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
- Intro by Michelle Noorali
|
||||
- Adnan Abdulhussein, Stacksmith lead at Bitnami, did a demo of Stacksmith
|
||||
- In the container world, updates to your application’s stack or environment are rolled out by bringing down outdated containers and replacing them with an updated container image. Tools like Docker and Kubernetes make it incredibly easy to do this, however, knowing when your stack is outdated or vulnerable and starting the upgrade process is still a manual step. Stacksmith is a service that aims to solve this by maintaining your base Dockerfiles and proactively keeping them up-to-date and secure. This demo walked through how you can use Stacksmith with your application on GitHub to provide continuous delivery of your application container images.
|
||||
- In the container world, updates to your application's stack or environment are rolled out by bringing down outdated containers and replacing them with an updated container image. Tools like Docker and Kubernetes make it incredibly easy to do this, however, knowing when your stack is outdated or vulnerable and starting the upgrade process is still a manual step. Stacksmith is a service that aims to solve this by maintaining your base Dockerfiles and proactively keeping them up-to-date and secure. This demo walked through how you can use Stacksmith with your application on GitHub to provide continuous delivery of your application container images.
|
||||
- Adnan is available as @prydonius on the Kubernetes slack as well as on [twitter](https://twitter.com/prydonius) for questions and feedback.
|
||||
- Feel free to leave feedback on the [Stacksmith](https://stacksmith.bitnami.com/) feedback tab.
|
||||
- Matt Farina gave an update on the SIG-Apps survey.
|
||||
|
|
|
@ -25,7 +25,7 @@ A characteristic list of issues (as not all of them were well captured in GitHub
|
|||
5. External object bleeding. Much of the logic was centered on a state machine that lived in the kubelet. Other kube components had to be aware of the state machine and other aspects of the binding framework to use Volumes.
|
||||
6. Maintenance was difficult as this work was implemented in three different controllers that spread the logic for provisioning, binding, and recycling Volumes.
|
||||
7. Kubelet failures on the Node could “strand” storage. Requiring users to manually unmount storage.
|
||||
8. A pod’s long running detach routine could impact other pods as the operations run synchronously in the kubelet sync loop.
|
||||
8. A pod's long running detach routine could impact other pods as the operations run synchronously in the kubelet sync loop.
|
||||
9. Nodes required elevated privileges to be able to trigger attach/detach. Ideally attach/detach should be triggered from master which is considered more secure (see Issue [#12399](https://github.com/kubernetes/kubernetes/issues/12399)).
|
||||
|
||||
Below are the Github Issues that were filed for this area:
|
||||
|
@ -41,7 +41,7 @@ Below are the Github Issues that were filed for this area:
|
|||
## How Did We Solve the Problem?
|
||||
Addressing these issues was the main deliverable for storage in 1.3. This required an in depth rewrite of several components.
|
||||
|
||||
Early in the 1.3 development cycle (March 28 to April 1, 2016) several community members in the Storage SIG met at a week long face-to-face summit at Google’s office in Mountain View to address these issues. A plan was established to approach the attach/detach/mount/unmount issues as a deliberate effort with contributors already handling the design. Since that work was already in flight and a plan established, the majority of the summit was devoted to resolving the PV/PVC controller issues. Meeting notes were captured [in this document](https://github.com/kubernetes/community/blob/master/sig-storage/1.3-retrospective/2016-03-28_Storage-SIG-F2F_Notes.pdf).
|
||||
Early in the 1.3 development cycle (March 28 to April 1, 2016) several community members in the Storage SIG met at a week long face-to-face summit at Google's office in Mountain View to address these issues. A plan was established to approach the attach/detach/mount/unmount issues as a deliberate effort with contributors already handling the design. Since that work was already in flight and a plan established, the majority of the summit was devoted to resolving the PV/PVC controller issues. Meeting notes were captured [in this document](https://github.com/kubernetes/community/blob/master/sig-storage/1.3-retrospective/2016-03-28_Storage-SIG-F2F_Notes.pdf).
|
||||
|
||||
Three projects were planned to fix the issues outlined above:
|
||||
* PV/PVC Controller Redesign (a.k.a. Provisioner/Binder/Recycler controller)
|
||||
|
@ -64,7 +64,7 @@ The Kubelet Volume Redesign involved changing fundamental assumptions of data fl
|
|||
|
||||
1. **Release delay**
|
||||
* The large amount of churn so late in the release with little stabilization time resulted in the delay of the release by one week: The Kubernetes 1.3 release [was targeted](https://github.com/kubernetes/features/blob/master/release-1.3/release-1.3.md) for June 20 to June 24, 2016. It ended up [going out on July 1, 2016](https://github.com/kubernetes/kubernetes/releases/tag/v1.3.0). This was mostly due to the time to resolve a data corruption issue on ungracefully terminated pods caused by detaching of mounted volumes ([#27691](https://github.com/kubernetes/kubernetes/issues/27691)). A large number of the bugs introduced in the release were fixed in the 1.3.4 release which [was cut on August 1, 2016](https://github.com/kubernetes/kubernetes/releases/tag/v1.3.4).
|
||||
2. **Instability in 1.3’s Storage stack**
|
||||
2. **Instability in 1.3's Storage stack**
|
||||
* The Kubelet volume redesign shipped in 1.3.0 with several bugs. These were mostly due to unexpected interactions between the new functionality and other Kubernetes components. For example, secrets were handled serially not in parallel, namespace dependencies were not well understood, etc. Most of these issues were quickly identified and addressed but waited for 1.3 patch releases.
|
||||
* Issues related to this include:
|
||||
* PVC Volume will not detach if PVC or PV is deleted before pod ([#29051](https://github.com/kubernetes/kubernetes/issues/29051))
|
||||
|
@ -93,4 +93,4 @@ The value of the feature freeze date is to ensure the release has time to stabil
|
|||
2. Establish a formal exception process for merging large changes after feature complete dates.
|
||||
* Status: [Drafted as of 1.4](https://github.com/kubernetes/features/blob/master/EXCEPTIONS.md)
|
||||
|
||||
Kubernetes is an incredibly fast moving project, with hundreds of active contributors creating a solution that thousands of organization rely on. Stability, trust, and openness are paramount in both the product and the community around Kubernetes. We undertook this retrospective effort to learn from the 1.3 release’s shipping delay. These action items and other work in the upcoming releases are part of our commitment to continually improve our project, our community, and our ability to deliver production-grade infrastructure platform software.
|
||||
Kubernetes is an incredibly fast moving project, with hundreds of active contributors creating a solution that thousands of organization rely on. Stability, trust, and openness are paramount in both the product and the community around Kubernetes. We undertook this retrospective effort to learn from the 1.3 release's shipping delay. These action items and other work in the upcoming releases are part of our commitment to continually improve our project, our community, and our ability to deliver production-grade infrastructure platform software.
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# sig-testing
|
||||
|
||||
The Kubernetes Testing SIG (sig-testing) is a working group within the Kubernetes contributor community interested in how we can most effectively test Kubernetes. We’re interested specifically in making it easier for the community to run tests and contribute test results, to ensure Kubernetes is stable across a variety of cluster configurations and cloud providers.
|
||||
The Kubernetes Testing SIG (sig-testing) is a working group within the Kubernetes contributor community interested in how we can most effectively test Kubernetes. We're interested specifically in making it easier for the community to run tests and contribute test results, to ensure Kubernetes is stable across a variety of cluster configurations and cloud providers.
|
||||
|
||||
## video conference
|
||||
|
||||
|
@ -10,7 +10,7 @@ We meet weekly on Tuesdays at 9:30am PDT (16:30 UTC) at [this zoom room](https:/
|
|||
|
||||
We use [a public google doc](https://docs.google.com/document/d/1z8MQpr_jTwhmjLMUaqQyBk1EYG_Y_3D4y4YdMJ7V1Kk) to track proposed agenda items, as well as take notes during meetings.
|
||||
|
||||
The agenda is open for comment. Please contact the organizers listed below if you’d like to propose a topic. Typically in the absence of anything formal we poll attendees for topics, and discuss tactical work.
|
||||
The agenda is open for comment. Please contact the organizers listed below if you'd like to propose a topic. Typically in the absence of anything formal we poll attendees for topics, and discuss tactical work.
|
||||
|
||||
## slack
|
||||
|
||||
|
@ -22,7 +22,7 @@ Signup for access at http://slack.kubernetes.io/
|
|||
- [our github team: @kubernetes/sig-testing](https://github.com/orgs/kubernetes/teams/sig-testing)
|
||||
- [issues mentioning @kubernetes/sig-testing](https://github.com/issues?q=is%3Aopen+team%3Akubernetes%2Fsig-testing)
|
||||
|
||||
We use the @kubernetes/sig-testing team to notify SIG members of particular issues or PR’s of interest. If you would like to be added to this team, please contact the organizers listed below.
|
||||
We use the @kubernetes/sig-testing team to notify SIG members of particular issues or PR's of interest. If you would like to be added to this team, please contact the organizers listed below.
|
||||
|
||||
## google group
|
||||
|
||||
|
|
Loading…
Reference in New Issue