Merge pull request #699 from ddysher/update-local-storage-overview

Update local storage overview
This commit is contained in:
Michelle Au 2017-06-16 10:57:41 -07:00 committed by GitHub
commit 77fb10f952
1 changed files with 51 additions and 47 deletions

View File

@ -47,13 +47,13 @@ A nodes local storage can be broken into primary and secondary partitions.
Primary partitions are shared partitions that can provide ephemeral local storage. The two supported primary partitions are:
### Root
This partition holds the kubelets root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition.
This partition holds the kubelets root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition.
### Runtime
This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition. Container image layers and writable later is stored here. If the runtime partition exists, `root` parition will not hold any image layer or writable layers.
## Secondary Partitions
All other partitions are exposed as local persistent volumes. Each local volume uses an entire partition. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod. All the local PVs can be queried and viewed from a cluster level using the existing PV object. Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
All other partitions are exposed as local persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod. All the local PVs can be queried and viewed from a cluster level using the existing PV object. Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
The local PVs can be precreated by an addon DaemonSet that discovers all the secondary partitions at well-known directories, and can create new PVs as partitions are added to the node. A default addon can be provided to handle common configurations.
@ -61,6 +61,8 @@ Local PVs can only provide semi-persistence, and are only suitable for specific
Since local PVs are only accessible from specific nodes, the scheduler needs to take into account a PV's node constraint when placing pods. This can be generalized to a storage toplogy constraint, which can also work with zones, and in the future: racks, clusters, etc.
The term `Partitions` are used here to describe the main use cases for local storage. However, the proposal doesn't require a local volume to be an entire disk or a partition - it supports arbitrary directory. This implies that cluster administrator can create multiple local volumes in one partition, each has the capacity of the partition, or even create local volume under primary partitions. Unless strictly required, e.g. if you have only one partition in your host, this is strongly discouraged. For this reason, following description will use `partition` or `mount point` exclusively.
# User Workflows
### Alice manages a deployment and requires “Guaranteed” ephemeral storage
@ -94,20 +96,20 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
resources:
limits:
storage.kubernetes.io/logs: 500Mi
storage.kubernetes.io/writable: 1Gi
storage.kubernetes.io/overlay: 1Gi
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes:
- name: myEmptyDir
emptyDir:
sizeLimit: 20Gi
sizeLimit: 20Gi
```
3. Alices pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/writable` is meant for writable layer.
4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer.
5. `storage.kubernetes.io/logs` is satisfied by `storage.kubernetes.io/scratch`.
6. `storage.kubernetes.io/writable` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
6. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
7. EmptyDir.size is both a request and limit that is satisfied by `storage.kubernetes.io/scratch`.
8. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi
9. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
@ -145,7 +147,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
spec:
- default:
storage.kubernetes.io/logs: 200Mi
storage.kubernetes.io/writable: 200Mi
storage.kubernetes.io/overlay: 200Mi
type: Container
- default:
sizeLimit: 1Gi
@ -165,14 +167,14 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
resources:
limits:
storage.kubernetes.io/logs: 200Mi
storage.kubernetes.io/writable: 200Mi
storage.kubernetes.io/overlay: 200Mi
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes:
- name: myEmptyDir
emptyDir:
sizeLimit: 1Gi
sizeLimit: 1Gi
```
4. Bobs “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume.
@ -189,29 +191,28 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
resources:
requests:
storage.kubernetes.io/logs: 500Mi
storage.kubernetes.io/writable: 500Mi
storage.kubernetes.io/overlay: 500Mi
volumeMounts:
- name: myEmptyDir
mountPath: /mnt/data
volumes:
- name: myEmptyDir
emptyDir:
sizeLimit: 2Gi
sizeLimit: 2Gi
```
6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Volumes as much as possible and avoid primary partitions.
6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. It is recommended to use Persistent Volumes as much as possible and avoid primary partitions.
### Alice manages a Database which needs access to “durable” and fast scratch space
1. Cluster administrator provisions machines with local SSDs and brings up the cluster
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesnt exist already. The PVs will include a path to the secondary device mount points, and a hostname label ties the volume to a specific node. A StorageClass is required and will have a new optional field `toplogyKey`. This field tells the scheduler to filter PVs with the same `topologyKey` value on the node. The `topologyKey` can be any label key applied to a node. For the local storage case, the `topologyKey` is `kubernetes.io/hostname`, but the same mechanism could be used for zone constraints as well.
2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesnt exist already. The PVs will include a path to the secondary device mount points, and a node affinity ties the volume to a specific node. The node affinity specification tells the scheduler to filter PVs with the same affinity key/value on the node. For the local storage case, the key is `kubernetes.io/hostname`, but the same mechanism could be used for zone constraints as well.
```yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-fast
provisioner: ""
toplogyKey: kubernetes.io/hostname
```
```yaml
@ -219,14 +220,19 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
apiVersion: v1
metadata:
name: local-pv-1
labels:
kubernetes.io/hostname: node-1
spec:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-1
capacity:
storage: 100Gi
localStorage:
fs:
path: /var/lib/kubelet/storage-partitions/local-pv-1
local:
path: /var/lib/kubelet/storage-partitions/local-pv-1
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
@ -289,9 +295,10 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
```
4. The scheduler identifies nodes for each pod that can satisfy all the existing predicates.
5. The nodes list is further filtered by looking at the PVC's StorageClass `topologyKey`, and checking if there are enough available PVs that have the same `topologyKey` value as the node. In the case of local PVs, it checks that there are enough PVs with the same `kubernetes.io/hostname` value as the node.
5. The nodes list is further filtered by looking at the PVC's StorageClass, and checking if there is available PV of the same StorageClass on a node.
6. The scheduler chooses a node for the pod based on a ranking algorithm.
7. Once the pod is assigned to a node, then the pods local PVCs get bound to specific local PVs on the node.
```
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESSMODES … NODE
@ -376,7 +383,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
resources:
limits:
storage.kubernetes.io/logs: 500Mi
storage.kubernetes.io/writable: 1Gi
storage.kubernetes.io/overlay: 1Gi
volumeMounts:
- name: myEphemeralPersistentVolume
mountPath: /mnt/tmpdata
@ -475,7 +482,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
Note: Block access will be considered as a separate feature because it can work for both remote and local storage. The examples here are a suggestion on how such a feature can be applied to this local storage model, but is subject to change based on the final design for block access.
1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
2. The same addon DaemonSet can also discover block devices and creates corresponding PVs for them with the `block` field.
2. The same addon DaemonSet can also discover block devices and creates corresponding PVs for them with the `volumeType: block` spec. `path` is overloaded here to mean both fs path and block device path.
```yaml
kind: PersistentVolume
@ -487,9 +494,9 @@ Note: Block access will be considered as a separate feature because it can work
spec:
capacity:
storage: 100Gi
localStorage:
block:
device: /var/lib/kubelet/storage-raw-devices/foo
volumeType: block
local:
path: /var/lib/kubelet/storage-raw-devices/foo
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
@ -512,26 +519,23 @@ Note: Block access will be considered as a separate feature because it can work
requests:
storage: 80Gi
```
4. It is also possible for a PVC that requests `volumeType: file` to also use a block-based PV. In this situation, the block device would get formatted with the filesystem type specified in the PV spec. And when the PV gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
4. It is also possible for a PVC that requests `volumeType: block` to also use file-based bolume. In this situation, the block device would get formatted with the filesystem type specified in the PVC spec. And when the PVC gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
```yaml
kind: PersistentVolume
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: foo
labels:
kubernetes.io/hostname: node-1
name: myclaim
spec:
capacity:
storage: 100Gi
local:
block:
path: /var/lib/kubelet/storage-raw-devices/foo
fsType: ext4
volumeType: block
fsType: ext4
storageClassName: local-fast
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Delete
storageClassName: local-fast
resources:
requests:
storage: 80Gi
```
*The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.*