Fix examples, remove node annotation on PVC, update block device

handling.
This commit is contained in:
Michelle Au 2017-03-08 17:45:15 -08:00
parent 1a68dab16b
commit 8c08c3cf2e
1 changed files with 136 additions and 127 deletions

View File

@ -73,12 +73,12 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
metadata: metadata:
name: foo name: foo
status: status:
capacity: capacity:
storage.kubernetes.io/runtime: 100Gi storage.kubernetes.io/runtime: 100Gi
storage.kubernetes.io/root: 100Gi storage.kubernetes.io/root: 100Gi
allocatable: allocatable:
storage.kubernetes.io/runtime: 100Gi storage.kubernetes.io/runtime: 100Gi
storage.kubernetes.io/root: 90Gi storage.kubernetes.io/root: 90Gi
``` ```
2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes. 2. Alice adds new storage resource requirements to her pod, specifying limits for the container's writeable and overlay layers, and emptyDir volumes.
@ -92,15 +92,18 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
containers: containers:
- name: fooc - name: fooc
resources: resources:
limits: limits:
storage.kubernetes.io/logs: 500Mi storage.kubernetes.io/logs: 500Mi
storage.kubernetes.io/root: 1Gi storage.kubernetes.io/overlay: 1Gi
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes: volumes:
- name: myEmptyDir - name: myEmptyDir
emptyDir: emptyDir:
resources: resources:
limits limits:
storage.kubernetes.io: 20Gi size: 1Gi
``` ```
3. Alices pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi. 3. Alices pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
@ -123,6 +126,9 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
spec: spec:
containers: containers:
- name: fooc - name: fooc
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes: volumes:
- name: myEmptyDir - name: myEmptyDir
emptyDir: emptyDir:
@ -141,7 +147,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
storage.kubernetes.io/overlay: 200Mi storage.kubernetes.io/overlay: 200Mi
type: Container type: Container
- default: - default:
storage.kubernetes.io: 1Gi size: 1Gi
type: EmptyDir type: EmptyDir
``` ```
@ -159,12 +165,15 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
limits: limits:
storage.kubernetes.io/logs: 200Mi storage.kubernetes.io/logs: 200Mi
storage.kubernetes.io/overlay: 200Mi storage.kubernetes.io/overlay: 200Mi
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes: volumes:
- name: myEmptyDir - name: myEmptyDir
emptyDir: emptyDir:
resources: resources:
limits: limits:
storage.kubernetes.io: 1Gi size: 1Gi
``` ```
4. Bobs “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume. 4. Bobs “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume.
@ -179,15 +188,18 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
containers: containers:
- name: fooc - name: fooc
resources: resources:
requests: requests:
storage.kubernetes.io/logs: 500Mi storage.kubernetes.io/logs: 500Mi
storage.kubernetes.io/overlay: 500Mi storage.kubernetes.io/overlay: 500Mi
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes: volumes:
- name: myEmptyDir - name: myEmptyDir
emptyDir: emptyDir:
resources: resources:
limits: limits:
storage.kubernetes.io/logs: 2Gi size: 2Gi
``` ```
6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Durable Volumes as much as possible and avoid primary partitions. 6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Durable Volumes as much as possible and avoid primary partitions.
@ -195,7 +207,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
### Alice manages a Database which needs access to “durable” and fast scratch space ### Alice manages a Database which needs access to “durable” and fast scratch space
1. Cluster administrator provisions machines with local SSDs and brings up the cluster 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesnt exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node. Storage classes and labels may also be specified. The volume consumes the entire partition. 2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesnt exist already. The PVs will include a path to the secondary device mount points, and a new node annotation that ties the volume to a specific node. A StorageClass name that is prefixed with "local-" is required for the system to be able to differentiate between local and remote storage. Labels may also be specified. The volume consumes the entire partition.
```yaml ```yaml
kind: PersistentVolume kind: PersistentVolume
@ -204,7 +216,6 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
name: local-pv-1 name: local-pv-1
annotations: annotations:
volume.kubernetes.io/node: node-1 volume.kubernetes.io/node: node-1
volume.beta.kubernetes.io/storage-class: local-fast
spec: spec:
capacity: capacity:
storage: 100Gi storage: 100Gi
@ -213,6 +224,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
accessModes: accessModes:
- ReadWriteOnce - ReadWriteOnce
persistentVolumeReclaimPolicy: Delete persistentVolumeReclaimPolicy: Delete
storageClassName: local-fast
``` ```
``` ```
$ kubectl get pv $ kubectl get pv
@ -224,7 +236,7 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
local-pv-1 100Gi RWO Delete Available node-3 local-pv-1 100Gi RWO Delete Available node-3
local-pv-2 10Gi RWO Delete Available node-3 local-pv-2 10Gi RWO Delete Available node-3
``` ```
3. Alice creates a StatefulSet that uses local PVCs. The annotation `volume.kubernetes.io/node = ""` is specified to indicate that the requested volume should be local to a node. The PVC will only be bound to PVs that also have the node annotation set and vice versa. 3. Alice creates a StatefulSet that uses local PVCs. The StorageClass prefix of "local-" indicates that the user wants local storage. The PVC will only be bound to PVs that match the StorageClass name.
```yaml ```yaml
apiVersion: apps/v1beta1 apiVersion: apps/v1beta1
@ -254,27 +266,23 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
volumeClaimTemplates: volumeClaimTemplates:
- metadata: - metadata:
name: www name: www
annotations:
volume.kubernetes.io/node: ""
volume.beta.kubernetes.io/storage-class: local-fast
spec: spec:
accessModes: [ "ReadWriteOnce" ] accessModes: [ "ReadWriteOnce" ]
storageClassName: local-fast
resources: resources:
requests: requests:
storage: 100Gi storage: 100Gi
- metadata: - metadata:
name: log name: log
annotations:
volume.kubernetes.io/node: ""
volume.beta.kubernetes.io/storage-class: local-slow
spec: spec:
accessModes: [ "ReadWriteOnce" ] accessModes: [ "ReadWriteOnce" ]
storageClassName: local-slow
resources: resources:
requests: requests:
storage: 1Gi storage: 1Gi
``` ```
4. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pods PVCs to specific PVs on the node and then binds the pod to the node. The annotation `volume.kubernetes.io/node` will be filled in with the chosen node name. 4. The scheduler identifies nodes for each pod that can satisfy cpu, memory, storage requirements and also contains available local PVs to satisfy the pod's PVC claims. It then binds the pods PVCs to specific PVs on the node and then binds the pod to the node.
``` ```
$ kubectl get pvc $ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES … NODE NAME STATUS VOLUME CAPACITY ACCESSMODES … NODE
@ -297,35 +305,27 @@ Since local PVs are only accessible from specific nodes, a new PV-node associati
``` ```
5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node. 5. If a pod dies and is replaced by a new one that reuses existing PVCs, the pod will be placed on the same node where the corresponding PVs exist. Stateful Pods are expected to have a high enough priority which will result in such pods preempting other low priority pods if necessary to run on a specific node.
6. To workaround situations when a pod cannot get access to its existing local PV due to resource unvavilability on the local PV's node, pods can choose to opt for switching to use a new local PV after a `timeout`. If the scheduler cannot bind the pod to the node where the local PV exists before `timeout` elapses since the pod's creation then the corresponding PVC will be unbound by the scheduler and the pod will then be bound to a different node where the pod would fit and local PV requirements are met. 6. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario. No toleration specified means that the failure is not tolerated. In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV. If a toleration is set, by default, it will be tolerated forever. `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
```yaml Node taints already exist today. New PV and scheduling taints can be added to handle additional failure use cases when using local storage. A new PV taint will be introduced to handle unhealthy volumes. The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy. A scheduling taint could signal a scheduling failure for the pod due to being unable to fit on the node.
apiVersion: v1
type: pod
spec:
volumes:
- name: myDurableVolume
persistentVolumeClaim:
claimName: foo
accessTimeoutSeconds: 30
```
7. Forgiveness policies can be specified as tolerations in the pod spec for each failure scenario. No toleration specified means that the failure is not tolerated. In that case, the PVC will immediately be unbound, and the pod will be rescheduled to obtain a new PV. If a toleration is set, by default, it will be tolerated forever. `tolerationSeconds` can be specified to allow for a timeout period before the PVC gets unbound.
A new PV taint will be introduced to handle unhealthy volumes. The addon or another external entity can monitor the volumes and add a taint when it detects that it is unhealthy.
```yaml ```yaml
tolerations: nodeTolerations:
- key: node.alpha.kubernetes.io/notReady - key: node.alpha.kubernetes.io/notReady
operator: TolerationOpExists operator: TolerationOpExists
tolerationSeconds: 600 tolerationSeconds: 600
- key: node.alpha.kubernetes.io/unreachable - key: node.alpha.kubernetes.io/unreachable
operator: TolerationOpExists operator: TolerationOpExists
tolerationSeconds: 1200 tolerationSeconds: 1200
pvTolerations:
- key: storage.kubernetes.io/pvUnhealthy - key: storage.kubernetes.io/pvUnhealthy
operator: TolerationOpExists operator: TolerationOpExists
schedulingTolerations:
- key: scheduler.kubernetes.io/podCannotFit
operator: TolerationOpExists
tolerationSeconds: 600
``` ```
8. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs. The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster. 7. Once Alice decides to delete the database, she destroys the StatefulSet, and then destroys the PVCs. The PVs will then get deleted and cleaned up according to the reclaim policy, and the addon adds it back to the cluster.
### Bob manages a distributed filesystem which needs access to all available storage on each node ### Bob manages a distributed filesystem which needs access to all available storage on each node
@ -354,20 +354,20 @@ volumes:
- name: fooc - name: fooc
resources: resources:
limits: limits:
storageLogs: 500Mi storage.kubernetes.io/logs: 500Mi
storageOverlay: 1Gi storage.kubernetes.io/overlay: 1Gi
volumeMounts:
- name: myEphemeralPersistentVolume
mountPath: /mnt/tmpdata
volumes: volumes:
- name: myEphemeralPeristentVolume - name: myEphemeralPeristentVolume
inline: inline:
metadata: spec:
labels: accessModes: [ "ReadWriteOnce" ]
storage.kubernetes.io/medium: local-ssd storageClassName: local-fast
storage.kubernetes.io/volume-type: local resources:
spec: limits:
accessModes: [ "ReadWriteOnce" ] size: 1Gi
resources:
requests:
storage: 1Gi
``` ```
4. Phippy notices some of her pods are experiencing spurious downtimes. With the help of monitoring (`iostat`), she notices that the nodes pods are running on are overloaded with I/O operations. She then updates her pods to use Logging Volumes which are backed by persistent storage. If a logging volumeMount is associated with a container, Kubelet will place log data from stdout & stderr of the container under the volume mount path within the container. Kubelet will continue to expose stdout/stderr log data to external logging agents using symlinks as it does already. 4. Phippy notices some of her pods are experiencing spurious downtimes. With the help of monitoring (`iostat`), she notices that the nodes pods are running on are overloaded with I/O operations. She then updates her pods to use Logging Volumes which are backed by persistent storage. If a logging volumeMount is associated with a container, Kubelet will place log data from stdout & stderr of the container under the volume mount path within the container. Kubelet will continue to expose stdout/stderr log data to external logging agents using symlinks as it does already.
@ -380,39 +380,36 @@ volumes:
spec: spec:
containers: containers:
- name: fooc - name: fooc
volumeMounts: volumeMounts:
name: myLoggingVolume - name: myLoggingVolume
path: /var/log/ mountPath: /var/log/
policy: policy:
logDir: logDir:
subDir: foo subDir: foo
glob: *.log glob: *.log
- name: barc - name: barc
volumeMounts: volumeMounts:
name: myInMemoryLoggVolume - name: myInMemoryLoggVolume
path: /var/log/ mountPath: /var/log/
policy: policy:
logDir: logDir:
subDir: bar subDir: bar
glob: *.log glob: *.log
volumes: volumes:
- name: myLoggingVolume - name: myLoggingVolume
inline: inline:
metadata: spec:
labels: accessModes: [ "ReadWriteOnce" ]
storage.kubernetes.io/medium: local-ssd storageClassName: local-slow
storage.kubernetes.io/volume-type: local resources:
spec: requests:
accessModes: [ "ReadWriteOnce" ] storage: 1Gi
resources: - name: myInMemoryLogVolume
requests: emptyDir:
storage: 1Gi medium: memory
- name: myInMemoryLogVolume resources:
emptyDir: limits:
medium: memory size: 100Mi
resources:
limits:
storage: 100Mi
``` ```
5. Phippy notices some of her pods are suffering hangs by while writing to their writable layer. Phippy again notices that I/O contention is the root cause and then updates her Pod Spec to use memory backed or persistent volumes for her pods writable layer. Kubelet will instruct the runtimes to overlay the volume with `overlay` policy over the writable layer of the container. 5. Phippy notices some of her pods are suffering hangs by while writing to their writable layer. Phippy again notices that I/O contention is the root cause and then updates her Pod Spec to use memory backed or persistent volumes for her pods writable layer. Kubelet will instruct the runtimes to overlay the volume with `overlay` policy over the writable layer of the container.
@ -425,41 +422,38 @@ volumes:
spec: spec:
containers: containers:
- name: fooc - name: fooc
volumeMounts: volumeMounts:
name: myWritableLayer - name: myWritableLayer
policy: policy:
overlay: overlay:
subDir: foo subDir: foo
- name: barc - name: barc
volumeMounts: volumeMounts:
name: myDurableWritableLayer - name: myDurableWritableLayer
policy: policy:
overlay: overlay:
subDir: bar subDir: bar
volumes: volumes:
- name: myWritableLayer - name: myWritableLayer
emptyDir: emptyDir:
medium: memory medium: memory
resources: resources:
limits: limits:
storage: 100Mi storage: 100Mi
- name: myDurableWritableLayer - name: myDurableWritableLayer
inline: inline:
metadata: spec:
labels: accessModes: [ "ReadWriteOnce" ]
storage.kubernetes.io/medium: local-ssd storageClassName: local-fast
storage.kubernetes.io/volume-type: local resources:
spec: requests:
accessModes: [ "ReadWriteOnce" ] storage: 1Gi
resources: ```
requests:
storage: 1Gi
```
### Bob manages a specialized application that needs access to Block level storage ### Bob manages a specialized application that needs access to Block level storage
1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet. 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
2. The cluster admin creates a DaemonSet addon that discovers all the raw block devices on a node that are available within a specific directory and creates corresponding PVs for them with a new `volumeType = block` field. 2. The same addon DaemonSet can discover block devices in the same directory as the filesystem mount points and creates corresponding PVs for them with a new `volumeType = block` field. This field indicates the original volume type upon PV creation.
```yaml ```yaml
kind: PersistentVolume kind: PersistentVolume
@ -468,9 +462,8 @@ volumes:
name: foo name: foo
annotations: annotations:
storage.kubernetes.io/node: node-1 storage.kubernetes.io/node: node-1
volume.beta.kubernetes.io/storage-class: local-fast
spec: spec:
volumeType: block volumeType: block
capacity: capacity:
storage: 100Gi storage: 100Gi
local: local:
@ -478,6 +471,7 @@ volumes:
accessModes: accessModes:
- ReadWriteOnce - ReadWriteOnce
persistentVolumeReclaimPolicy: Delete persistentVolumeReclaimPolicy: Delete
storageClassName: local-fast
``` ```
3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request. The block devices will not be formatted to allow the application to handle the device using their own methods. 3. Bob creates a pod with a PVC that requests for block level access and similar to a Stateful Set scenario the scheduler will identify nodes that can satisfy the pods request. The block devices will not be formatted to allow the application to handle the device using their own methods.
@ -487,17 +481,36 @@ volumes:
apiVersion: v1 apiVersion: v1
metadata: metadata:
name: myclaim name: myclaim
annotations:
volume.beta.kubernetes.io/node: ""
volume.beta.kubernetes.io/storage-class: local-fast
spec: spec:
volumeType: block volumeType: block
storageClassName: local-fast
accessModes: accessModes:
- ReadWriteOnce - ReadWriteOnce
resources: resources:
requests: requests:
storage: 80Gi storage: 80Gi
``` ```
4. It is also possible for a PVC that requests `volumeType: file` to also use a PV with `volumeType: block`, if no file-based PVs are available. In this situation, the block device would get formatted with the filesystem type specified in the PV spec. And when the PV gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
```yaml
kind: PersistentVolume
apiVersion: v1
metadata:
name: foo
annotations:
storage.kubernetes.io/node: node-1
spec:
volumeType: block
capacity:
storage: 100Gi
local:
path: /var/lib/kubelet/storage-raw-devices/foo
fsType: ext4
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-fast
```
*The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.* *The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.*
@ -508,10 +521,6 @@ volumes:
* Local Persistent Volume bindings happening in the scheduler vs in PV controller * Local Persistent Volume bindings happening in the scheduler vs in PV controller
* Should the PV controller fold into the scheduler * Should the PV controller fold into the scheduler
* This will help spread PVs and pods across matching zones. * This will help spread PVs and pods across matching zones.
* Should block level storage devices be auto formatted to be used as file level storage instead of having the filesystems precreated by the admin?
* It would match behavior with GCE PD and EBS where the volume plugin will create the filesystem first.
* It can allow for more comprehensive (but slower) volume cleanup options. The filesystem can be destroyed and then the partition can be zeroed.
* It limits the filesystem choices to those that k8 supports.
* Repair/replace scenarios. * Repair/replace scenarios.
* What are the implications of removing a disk and replacing it with a new one? * What are the implications of removing a disk and replacing it with a new one?
* We may not do anything in the system, but may need a special workflow * We may not do anything in the system, but may need a special workflow