Merge pull request #699 from ddysher/update-local-storage-overview

Update local storage overview
2017-06-16 10:57:41 -07:00 · 2017-06-16 10:57:41 -07:00 · 77fb10f952
parent 12b2ce9357 717a2ec125
commit 77fb10f952
1 changed files with 51 additions and 47 deletions
--- a/contributors/design-proposals/local-storage-overview.md
+++ b/contributors/design-proposals/local-storage-overview.md
@ -47,13 +47,13 @@ A node’s local storage can be broken into primary and secondary partitions.
 Primary partitions are shared partitions that can provide ephemeral local storage.  The two supported primary partitions are:

 ### Root
- This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition.
+This partition holds the kubelet’s root directory (`/var/lib/kubelet` by default) and `/var/log` directory. This partition may be shared between user pods, OS and Kubernetes system daemons. This partition can be consumed by pods via EmptyDir volumes, container logs, image layers and container writable layers. Kubelet will manage shared access and isolation of this partition. This partition is “ephemeral” and applications cannot expect any performance SLAs (Disk IOPS for example) from this partition.

 ### Runtime
 This is an optional partition which runtimes can use for overlay filesystems. Kubelet will attempt to identify and provide shared access along with isolation to this partition. Container image layers and writable later is stored here. If the runtime partition exists, `root` parition will not hold any image layer or writable layers.

 ## Secondary Partitions
-All other partitions are exposed as local persistent volumes.  Each local volume uses an entire partition.  The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  All the local PVs can be queried and viewed from a cluster level using the existing PV object.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.
+All other partitions are exposed as local persistent volumes. The PV interface allows for varying storage configurations to be supported, while hiding specific configuration details to the pod.  All the local PVs can be queried and viewed from a cluster level using the existing PV object.  Applications can continue to use their existing PVC specifications with minimal changes to request local storage.

 The local PVs can be precreated by an addon DaemonSet that discovers all the secondary partitions at well-known directories, and can create new PVs as partitions are added to the node.  A default addon can be provided to handle common configurations.

@ -61,6 +61,8 @@ Local PVs can only provide semi-persistence, and are only suitable for specific

 Since local PVs are only accessible from specific nodes, the scheduler needs to take into account a PV's node constraint when placing pods.  This can be generalized to a storage toplogy constraint, which can also work with zones, and in the future: racks, clusters, etc.

+The term `Partitions` are used here to describe the main use cases for local storage. However, the proposal doesn't require a local volume to be an entire disk or a partition - it supports arbitrary directory.  This implies that cluster administrator can create multiple local volumes in one partition, each has the capacity of the partition, or even create local volume under primary partitions. Unless strictly required, e.g. if you have only one partition in your host, this is strongly discouraged.  For this reason, following description will use `partition` or `mount point` exclusively.
+
 # User Workflows

 ### Alice manages a deployment and requires “Guaranteed” ephemeral storage
@ -94,20 +96,20 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
       resources:
         limits:
           storage.kubernetes.io/logs: 500Mi
-           storage.kubernetes.io/writable: 1Gi
+           storage.kubernetes.io/overlay: 1Gi
       volumeMounts:
       - name: myEmptyDir
         mountPath: /mnt/data
     volumes:
     - name: myEmptyDir
       emptyDir:
-	     sizeLimit: 20Gi
+         sizeLimit: 20Gi
    ```

 3. Alice’s pod “foo” is Guaranteed a total of “21.5Gi” of local storage. The container “fooc” in her pod cannot consume more than 1Gi for writable layer and 500Mi for logs, and “myEmptyDir” volume cannot consume more than 20Gi.
-4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/writable` is meant for writable layer.
+4. For the pod resources, `storage.kubernetes.io/logs` resource is meant for logs. `storage.kubernetes.io/overlay` is meant for writable layer.
 5. `storage.kubernetes.io/logs` is satisfied by `storage.kubernetes.io/scratch`.
-6. `storage.kubernetes.io/writable` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
+6. `storage.kubernetes.io/overlay` resource can be satisfied by `storage.kubernetes.io/overlay` if exposed by nodes or by `storage.kubernetes.io/scratch` otherwise. The scheduler follows this policy to find an appropriate node which can satisfy the storage resource requirements of the pod.
 7. EmptyDir.size is both a request and limit that is satisfied by `storage.kubernetes.io/scratch`.
 8. Kubelet will rotate logs to keep scratch space usage of “fooc” under 500Mi
 9. Kubelet will track the usage of pods across logs and overlay filesystem and restart the container if it's total usage exceeds it's storage limits. If usage on `EmptyDir` volume exceeds its `limit`, then the pod will be evicted by the kubelet. By performing soft limiting, users will be able to easily identify pods that run out of storage.
@ -145,7 +147,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
    spec:
       - default:
         storage.kubernetes.io/logs: 200Mi
-         storage.kubernetes.io/writable: 200Mi
+         storage.kubernetes.io/overlay: 200Mi
         type: Container
       - default:
         sizeLimit: 1Gi
@ -165,14 +167,14 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
       resources:
         limits:
           storage.kubernetes.io/logs: 200Mi
-           storage.kubernetes.io/writable: 200Mi
+           storage.kubernetes.io/overlay: 200Mi
       volumeMounts:
       - name: myEmptyDir
         mountPath: /mnt/data
     volumes:
     - name: myEmptyDir
       emptyDir:
-	     sizeLimit: 1Gi
+         sizeLimit: 1Gi
    ```

 4. Bob’s “foo” pod can use upto “200Mi” for its containers logs and writable layer each, and “1Gi” for its “myEmptyDir” volume.
@ -189,29 +191,28 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
     resources:
       requests:
         storage.kubernetes.io/logs: 500Mi
-         storage.kubernetes.io/writable: 500Mi
+         storage.kubernetes.io/overlay: 500Mi
     volumeMounts:
     - name: myEmptyDir
       mountPath: /mnt/data
   volumes:
   - name: myEmptyDir
     emptyDir:
-		sizeLimit: 2Gi
+       sizeLimit: 2Gi
  ```

-6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. it is recommended to use Persistent Volumes as much as possible and avoid primary partitions.
+6. It is recommended to require `limits` to be specified for `storage` in all pods. `storage` will not affect the `QoS` Class of a pod since no SLA is intended to be provided for storage capacity isolation. It is recommended to use Persistent Volumes as much as possible and avoid primary partitions.

 ### Alice manages a Database which needs access to “durable” and fast scratch space

 1. Cluster administrator provisions machines with local SSDs and brings up the cluster
-2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a hostname label ties the volume to a specific node.  A StorageClass is required and will have a new optional field `toplogyKey`.  This field tells the scheduler to filter PVs with the same `topologyKey` value on the node. The `topologyKey` can be any label key applied to a node.  For the local storage case, the `topologyKey` is `kubernetes.io/hostname`, but the same mechanism could be used for zone constraints as well.
+2. When a new node instance starts up, an addon DaemonSet discovers local “secondary” partitions which are mounted at a well known location and creates Local PVs for them if one doesn’t exist already. The PVs will include a path to the secondary device mount points, and a node affinity ties the volume to a specific node.  The node affinity specification tells the scheduler to filter PVs with the same affinity key/value on the node.  For the local storage case, the key is `kubernetes.io/hostname`, but the same mechanism could be used for zone constraints as well.

    ```yaml
    kind: StorageClass
    apiVersion: storage.k8s.io/v1
    metadata:
      name: local-fast
-    provisioner: ""
    toplogyKey: kubernetes.io/hostname
    ```
    ```yaml
@ -219,14 +220,19 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
    apiVersion: v1
    metadata:
      name: local-pv-1
-      labels:
-        kubernetes.io/hostname: node-1
    spec:
+      nodeAffinity:
+        requiredDuringSchedulingIgnoredDuringExecution:
+          nodeSelectorTerms:
+          - matchExpressions:
+            - key: kubernetes.io/hostname
+              operator: In
+              values:
+              - node-1
      capacity:
        storage: 100Gi
-      localStorage:
-        fs:
-          path: /var/lib/kubelet/storage-partitions/local-pv-1
+      local:
+        path: /var/lib/kubelet/storage-partitions/local-pv-1
      accessModes:
        - ReadWriteOnce
      persistentVolumeReclaimPolicy: Delete
@ -289,9 +295,10 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
    ```

 4. The scheduler identifies nodes for each pod that can satisfy all the existing predicates.
-5. The nodes list is further filtered by looking at the PVC's StorageClass `topologyKey`, and checking if there are enough available PVs that have the same `topologyKey` value as the node.  In the case of local PVs, it checks that there are enough PVs with the same `kubernetes.io/hostname` value as the node.
+5. The nodes list is further filtered by looking at the PVC's StorageClass, and checking if there is available PV of the same StorageClass on a node.
 6. The scheduler chooses a node for the pod based on a ranking algorithm.
 7. Once the pod is assigned to a node, then the pod’s local PVCs get bound to specific local PVs on the node.
+
    ```
    $ kubectl get pvc
    NAME            STATUS VOLUME     CAPACITY ACCESSMODES … NODE
@ -376,7 +383,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
       resources:
       limits:
         storage.kubernetes.io/logs: 500Mi
-         storage.kubernetes.io/writable: 1Gi
+         storage.kubernetes.io/overlay: 1Gi
       volumeMounts:
       - name: myEphemeralPersistentVolume
         mountPath: /mnt/tmpdata
@ -475,7 +482,7 @@ Since local PVs are only accessible from specific nodes, the scheduler needs to
 Note: Block access will be considered as a separate feature because it can work for both remote and local storage.  The examples here are a suggestion on how such a feature can be applied to this local storage model, but is subject to change based on the final design for block access.

 1. The cluster that Bob uses has nodes that contain raw block devices that have not been formatted yet.
-2. The same addon DaemonSet can also discover block devices and creates corresponding PVs for them with the `block` field.  
+2. The same addon DaemonSet can also discover block devices and creates corresponding PVs for them with the `volumeType: block` spec. `path` is overloaded here to mean both fs path and block device path.

    ```yaml
    kind: PersistentVolume
@ -487,9 +494,9 @@ Note: Block access will be considered as a separate feature because it can work
    spec:
      capacity:
        storage: 100Gi
-      localStorage:
-        block:
-          device: /var/lib/kubelet/storage-raw-devices/foo
+      volumeType: block
+      local:
+        path: /var/lib/kubelet/storage-raw-devices/foo
      accessModes:
        - ReadWriteOnce
      persistentVolumeReclaimPolicy: Delete
@ -512,26 +519,23 @@ Note: Block access will be considered as a separate feature because it can work
        requests:
          storage: 80Gi
    ```
-4. It is also possible for a PVC that requests `volumeType: file` to also use a block-based PV.  In this situation, the block device would get formatted with the filesystem type specified in the PV spec.  And when the PV gets destroyed, then the filesystem also gets destroyed to return back to the original block state.
+
+4. It is also possible for a PVC that requests `volumeType: block` to also use file-based bolume.  In this situation, the block device would get formatted with the filesystem type specified in the PVC spec.  And when the PVC gets destroyed, then the filesystem also gets destroyed to return back to the original block state.

    ```yaml
-    kind: PersistentVolume
+    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
-      name: foo
-      labels:
-        kubernetes.io/hostname: node-1
+      name: myclaim
    spec:
-      capacity:
-        storage: 100Gi
-      local:
-        block:
-          path: /var/lib/kubelet/storage-raw-devices/foo
-          fsType: ext4
+      volumeType: block
+      fsType: ext4
+      storageClassName: local-fast
      accessModes:
        - ReadWriteOnce
-      persistentVolumeReclaimPolicy: Delete
-      storageClassName: local-fast
+      resources:
+        requests:
+          storage: 80Gi
    ```

 *The lifecycle of the block level PV will be similar to that of the scenarios explained earlier.*