diff --git a/contributors/design-proposals/storage/local-storage-pv.md b/contributors/design-proposals/storage/local-storage-pv.md new file mode 100644 index 000000000..5bbc91f0a --- /dev/null +++ b/contributors/design-proposals/storage/local-storage-pv.md @@ -0,0 +1,659 @@ +# Local Storage Persistent Volumes + +Authors: @msau42, @vishh + +This document presents a detailed design for supporting persistent local storage, +as outlined in [Local Storage Overview](local-storage-overview.md). +Supporting all the use cases for persistent local storage will take many releases, +so this document will be extended for each new release as we add more features. + +## Goals + +* Allow pods to mount any local block or filesystem based volume. +* Allow pods to mount dedicated local disks, or channeled partitions as volumes for +IOPS isolation. +* Allow pods do access local volumes without root privileges. +* Allow pods to access local volumes without needing to undestand the storage +layout on every node. +* Persist local volumes and provide data gravity for pods. Any pod +using the local volume will be scheduled to the same node that the local volume +is on. +* Allow pods to release their local volume bindings and lose that volume's data +during failure conditions, such as node, storage or scheduling failures, where +the volume is not accessible for some user-configurable time. +* Allow pods to specify local storage as part of a Deployment or StatefulSet. +* Allow administrators to set up and configure local volumes with simple methods. +* Do not require administrators to manage the local volumes once provisioned +for a node. + +## Non-Goals + +* Provide data availability for a local volume beyond its local node. +* Support the use of HostPath volumes and Local PVs on the same volume. + +## Background + +In Kubernetes, there are two main types of storage: remote and local. + +Remote storage is typically used with persistent volumes where the data can +persist beyond the lifetime of the pod. + +Local storage is typically used with ephemeral volumes where the data only +persists during the lifetime of the pod. + +There is increasing demand for using local storage as persistent volumes, +especially for distributed filesystems and databases such as GlusterFS and +Cassandra. The main motivations for using persistent local storage, instead +of persistent remote storage include: + +* Performance: Local SSDs achieve higher IOPS and throughput than many +remote storage solutions. + +* Cost: Operational costs may be reduced by leveraging existing local storage, +especially in bare metal environments. Network storage can be expensive to +setup and maintain, and it may not be necessary for certain applications. + +## Use Cases + +### Distributed filesystems and databases + +Many distributed filesystem and database implementations, such as Cassandra and +GlusterFS, utilize the local storage on each node to form a storage cluster. +These systems typically have a replication feature that sends copies of the data +to other nodes in the cluster in order to provide fault tolerance in case of +node failures. Non-distributed, but replicated databases, like MySQL, can also +utilize local storage to store replicas. + +The main motivations for using local persistent storage are performance and +cost. Since the application handles data replication and fault tolerance, these +application pods do not need networked storage to provide shared access to data. +In addition, installing a high-performing NAS or SAN solution can be more +expensive, and more complex to configure and maintain than utilizing local +disks, especially if the node was already pre-installed with disks. Datacenter +infrastructure and operational costs can be reduced by increasing storage +utilization. + +These distributed systems are generally stateful, infrastructure applications +that provide data services to higher-level applications. They are expected to +run in a cluster with many other applications potentially sharing the same +nodes. Therefore, they expect to have high priority and node resource +guarantees. They typically are deployed using StatefulSets, custom +controllers, or operators. + +### Caching + +Caching is one of the recommended use cases for ephemeral local storage. The +cached data is backed by persistent storage, so local storage data durability is +not required. However, there is a use case for persistent local storage to +achieve data gravity for large caches. For large caches, if a pod restarts, +rebuilding the cache can take a long time. As an example, rebuilding a 100GB +cache from a hard disk with 150MB/s read throughput can take around 10 minutes. +If the service gets restarted and all the pods have to restart, then performance +and availability can be impacted while the pods are rebuilding. If the cache is +persisted, then cold startup latencies are reduced. + +Content-serving applications and producer/consumer workflows commonly utilize +caches for better performance. They are typically deployed using Deployments, +and could be isolated in its own cluster, or shared with other applications. + +## Environments + +### Baremetal + +In a baremetal environment, nodes may be configured with multiple local disks of +varying capacity, speeds and mediums. Mediums include spinning disks (HDDs) and +solid-state drives (SSDs), and capacities of each disk can range from hundreds +of GBs to tens of TB. Multiple disks may be arranged in JBOD or RAID configurations +to consume as persistent storage. + +Currently, the methods to use the additional disks are to: + +* Configure a distributed filesystem +* Configure a HostPath volume + +It is also possible to configure a NAS or SAN on a node as well. Speeds and +capacities will widely vary depending on the solution. + +### GCE/GKE + +GCE and GKE both have a local SSD feature that can create a VM instance with up +to 8 fixed-size 375GB local SSDs physically attached to the instance host and +appears as additional disks in the instance. The local SSDs have to be +configured at the VM creation time and cannot be dynamically attached to an +instance later. If the VM gets shutdown, terminated, pre-empted, or the host +encounters a non-recoverable error, then the SSD data will be lost. If the +guest OS reboots, or a live migration occurs, then the SSD data will be +preserved. + +### EC2 + +In EC2, the instance store feature attaches local HDDs or SSDs to a new instance +as additional disks. HDD capacities can go up to 24 2TB disks for the largest +configuration. SSD capacities can go up to 8 800GB disks or 2 2TB disks for the +largest configurations. Data on the instance store only persists across +instance reboot. + +## Limitations of current volumes + +The following is an overview of existing volume types in Kubernetes, and how +they cannot completely address the use cases for local persistent storage. + +* EmptyDir: A temporary directory for a pod that is created under the kubelet +root directory. The contents are deleted when a pod dies. Limitations: + + * Volume lifetime is bound to the pod lifetime. Pod failure is more likely +than node failure, so there can be increased network and storage activity to +recover data via replication and data backups when a replacement pod is started. + * Multiple disks are not supported unless the administrator aggregates them +into a spanned or RAID volume. In this case, all the storage is shared, and +IOPS guarantees cannot be provided. + * There is currently no method of distinguishing between HDDs and SDDs. The +“medium” field could be expanded, but it is not easily generalizable to +arbitrary types of mediums. + +* HostPath: A direct mapping to a specified directory on the node. The +directory is not managed by the cluster. Limitations: + + * Admin needs to manually setup directory permissions for the volume’s users. + * Admin has to manage the volume lifecycle manually and do cleanup of the data and +directories. + * All nodes have to have their local storage provisioned the same way in order to +use the same pod template. + * There can be path collision issues if multiple pods get scheduled to the same +node that want the same path + * If node affinity is specified, then the user has to do the pod scheduling +manually. + +* Provider’s block storage (GCE PD, AWS EBS, etc): A remote disk that can be +attached to a VM instance. The disk’s lifetime is independent of the pod’s +lifetime. Limitations: + + * Doesn’t meet performance requirements. +[Performance benchmarks on GCE](https://cloud.google.com/compute/docs/disks/performance) +show that local SSD can perform better than SSD persistent disks: + + * 16x read IOPS + * 11x write IOPS + * 6.5x read throughput + * 4.5x write throughput + +* Networked filesystems (NFS, GlusterFS, etc): A filesystem reachable over the +network that can provide shared access to data. Limitations: + + * Requires more configuration and setup, which adds operational burden and +cost. + * Requires a high performance network to achieve equivalent performance as +local disks, especially when compared to high-performance SSDs. + +Due to the current limitations in the existing volume types, a new method for +providing persistent local storage should be considered. + +## Feature Plan + +A detailed implementation plan can be found in the +[Storage SIG planning spreadhseet](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/view#gid=1566770776). +The following is a high level summary of the goals in each phase. + +### Phase 1 + +* Support Pod, Deployment, and StatefulSet requesting a single local volume +* Support pre-configured, statically partitioned, filesystem-based local volumes + +### Phase 2 + +* Block devices and raw partitions +* Smarter PV binding to consider local storage and pod scheduling constraints, +such as pod affinity/anti-affinity, and requesting multiple local volumes + +### Phase 3 + +* Support common partitioning patterns +* Volume taints and tolerations for unbinding volumes in error conditions + +### Phase 4 + +* Dynamic provisioning + +## Design + +A high level proposal with user workflows is available in the +[Local Storage Overview](local-storage-overview.md). + +This design section will focus on one phase at a time. Each new release will +extend this section. + +### Phase 1: 1.7 alpha + +#### Local Volume Plugin + +A new volume plugin will be introduced to represent logical block partitions and +filesystem mounts that are local to a node. Some examples include whole disks, +disk partitions, RAID volumes, LVM volumes, or even directories in a shared +partition. Multiple Local volumes can be created on a node, and is +accessed through a local mount point or path that is bind-mounted into the +container. It is only consumable as a PersistentVolumeSource because the PV +interface solves the pod spec portability problem and provides the following: + +* Abstracts volume implementation details for the pod and expresses volume +requirements in terms of general concepts, like capacity and class. This allows +for portable configuration, as the pod is not tied to specific volume instances. +* Allows volume management to be independent of the pod lifecycle. The volume can +survive container, pod and node restarts. +* Allows volume classification by StorageClass. +* Is uniquely identifiable within a cluster and is managed from a cluster-wide +view. + +There are major changes in PV and pod semantics when using Local volumes +compared to the typical remote storage volumes. + +* Since Local volumes are fixed to a node, a pod using that volume has to +always be scheduled on that node. +* Volume availability is tied to the node’s availability. If the node is +unavailable, then the volume is also unavailable, which impacts pod +availability. +* The volume’s data durability characteristics are determined by the underlying +storage system, and cannot be guaranteed by the plugin. A Local volume +in one environment can provide data durability, but in another environment may +only be ephemeral. As an example, in the GCE/GKE/AWS cloud environments, the +data in directly attached, physical SSDs is immediately deleted when the VM +instance terminates or becomes unavailable. + +Due to these differences in behaviors, Local volumes are not suitable for +general purpose use cases, and are only suitable for specific applications that +need storage performance and data gravity, and can tolerate data loss or +unavailability. Applications need to be aware of, and be able to handle these +differences in data durability and availability. + +Local volumes are similar to HostPath volumes in the following ways: + +* Partitions need to be configured by the storage administrator beforehand. +* Volume is referenced by the path to the partition. +* Provides the same underlying partition’s support for IOPS isolation. +* Volume is permanently attached to one node. +* Volume can be mounted by multiple pods on the same node. + +However, Local volumes will address these current issues with HostPath +volumes: + +* Security concerns allowing a pod to access any path in a node. Local +volumes cannot be consumed directly by a pod. They must be specified as a PV +source, so only users with storage provisioning privileges can determine which +paths on a node are available for consumption. +* Difficulty in permissions setup. Local volumes will support fsGroup so +that the admins do not need to setup the permissions beforehand, tying that +particular volume to a specific user/group. During the mount, the fsGroup +settings will be applied on the path. However, multiple pods +using the same volume should use the same fsGroup. +* Volume lifecycle is not clearly defined, and the volume has to be manually +cleaned up by users. For Local volumes, the PV has a clearly defined +lifecycle. Upon PVC deletion, the PV will be released (if it has the Delete +policy), and all the contents under the path will be deleted. In the future, +advanced cleanup options, like zeroing can also be specified for a more +comprehensive cleanup. + +##### API Changes + +All new changes are protected by a new feature gate, `PersistentLocalVolumes`. + +A new `LocalVolumeSource` type is added as a `PersistentVolumeSource`. For this +initial phase, the path can only be a mount point or a directory in a shared +filesystem. + +``` +type LocalVolumeSource struct { + // The full path to the volume on the node + // For alpha, this path must be a directory + // Once block as a source is supported, then this path can point to a block device + Path string +} + +type PersistentVolumeSource struct { + + // Local represents directly-attached storage with node affinity. + // +optional + Local *LocalVolumeSource +} +``` + +The relationship between a Local volume and its node will be expressed using +PersistentVolume node affinity, described in the following section. + +Users request Local volumes using PersistentVolumeClaims in the same manner as any +other volume type. The PVC will bind to a matching PV with the appropriate capacity, +AccessMode, and StorageClassName. Then the user specifies that PVC in their +Pod spec. There are no special annotations or fields that need to be set in the Pod +or PVC to distinguish between local and remote storage. It is abstracted by the +StorageClass. + +#### PersistentVolume Node Affinity + +PersistentVolume node affinity is a new concept and is similar to Pod node affinity, +except instead of specifying which nodes a Pod has to be scheduled to, it specifies which nodes +a PersistentVolume can be attached and mounted to, influencing scheduling of Pods that +use local volumes. + +For a Pod that uses a PV with node affinity, a new scheduler predicate +will evaluate that node affinity against the node's labels. For this initial phase, the +PV node affinity is only considered by the scheduler for already-bound PVs. It is not +considered during the initial PVC/PV binding, which will be addressed in a future release. + +Only the `requiredDuringSchedulingIgnoredDuringExecution` field will be supported. + +##### API Changes + +For the initial alpha phase, node affinity is expressed as an optional +annotation in the PersistentVolume object. + +``` +// AlphaStorageNodeAffinityAnnotation defines node affinity policies for a PersistentVolume. +// Value is a string of the json representation of type NodeAffinity +AlphaStorageNodeAffinityAnnotation = "volume.alpha.kubernetes.io/node-affinity" +``` + +#### Local volume initial configuration + +There are countless ways to configure local storage on a node, with different patterns to +follow depending on application requirements and use cases. Some use cases may require +dedicated disks; others may only need small partitions and are ok with sharing disks. +Instead of forcing a partitioning scheme on storage administrators, the Local volume +is represented by a path, and lets the administrators partition their storage however they +like, with a few minimum requirements: + +* The paths to the mount points are always consistent, even across reboots or when storage +is added or removed. +* The paths are backed by a filesystem (block devices or raw partitions are not supported for +the first phase) +* The directories have appropriate permissions for the provisioner to be able to set owners and +cleanup the volume. + +#### Local volume management + +Local PVs are statically created and not dynamically provisioned for the first phase. +To mitigate the amount of time an administrator has to spend managing Local volumes, +a Local static provisioner application will be provided to handle common scenarios. For +uncommon scenarios, a specialized provisioner can be written. + +The Local static provisioner will be developed in the +[kubernetes-incubator/external-storage](https://github.com/kubernetes-incubator) +repository, and will loosely follow the external provisioner design, with a few differences: + +* A provisioner instance needs to run on each node and only manage the local storage on its node. +* For phase 1, it does not handle dynamic provisioning. Instead, it performs static provisioning +by discovering available partitions mounted under configurable discovery directories. + +The basic design of the provisioner will have two separate handlers: one for PV deletion and +cleanup, and the other for static PV creation. A PersistentVolume informer will be created +and its cache will be used by both handlers. + +PV deletion will operate on the Update event. If the PV it provisioned changes to the “Released” +state, and if the reclaim policy is Delete, then it will cleanup the volume and then delete the PV, +removing it from the cache. + +PV creation does not operate on any informer events. Instead, it periodically monitors the discovery +directories, and will create a new PV for each path in the directory that is not in the PV cache. It +sets the "pv.kubernetes.io/provisioned-by" annotation so that it can distinguish which PVs it created. + +For phase 1, the allowed discovery file types are directories and mount points. The PV capacity +will be the capacity of the underlying filesystem. Therefore, PVs that are backed by shared +directories will report its capacity as the entire filesystem, potentially causing overcommittment. +Separate partitions are recommended for capacity isolation. + +The name of the PV needs to be unique across the cluster. The provisioner will hash the node name, +StorageClass name, and base file name in the volume path to generate a unique name. + +##### Packaging + +The provisioner is packaged as a container image and will run on each node in the cluster as part of +a DaemonSet. It needs to be run with a user or service account with the following permissions: + +* Create/delete/list/get PersistentVolumes - Can use the `system:persistentvolumeprovisioner` ClusterRoleBinding +* Get ConfigMaps - To access user configuration for the provisioner +* Get Nodes - To get the node's UID and labels + +These are broader permissions than necessary (a node's access to PVs should be restricted to only +those local to the node). A redesign will be considered in a future release to address this issue. + +In addition, it should run with high priority so that it can reliably handle all the local storage +partitions on each node, and with enough permissions to be able to cleanup volume contents upon +deletion. + +The provisioner DaemonSet requires the following configuration: + +* The node's name set as the MY_NODE_NAME environment variable +* ConfigMap with StorageClass -> discovery directory mappings +* Each mapping in the ConfigMap needs a hostPath volume +* User/service account with all the required permissions + +Here is an example ConfigMap: + +``` +kind: ConfigMap +metadata: + name: local-volume-config + namespace: kube-system +data: + "local-fast": | + { + "hostDir": "/mnt/ssds", + "mountDir": "/local-ssds" + } + "local-slow": | + { + "hostDir": "/mnt/hdds", + "mountDir": "/local-hdds" + } +``` + +The `hostDir` is the discovery path on the host, and the `mountDir` is the path it is mounted to in +the provisioner container. The `hostDir` is required because the provisioner needs to create Local PVs +with the `Path` based off of `hostDir`, not `mountDir`. + +The DaemonSet for this example looks like: +``` + +apiVersion: extensions/v1beta1 +kind: DaemonSet +metadata: + name: local-storage-provisioner + namespace: kube-system +spec: + template: + metadata: + labels: + system: local-storage-provisioner + spec: + containers: + - name: provisioner + image: "gcr.io/google_containers/local-storage-provisioner:v1.0" + imagePullPolicy: Always + volumeMounts: + - name: vol1 + mountPath: "/local-ssds" + - name: vol2 + mountPath: "/local-hdds" + env: + - name: MY_NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + volumes: + - name: vol1 + hostPath: + path: "/mnt/ssds" + - name: vol2 + hostPath: + path: "/mnt/hdds" + serviceAccount: local-storage-admin +``` + +##### Provisioner Boostrapper + +Manually setting up this DaemonSet spec can be tedious and it requires duplicate specification +of the StorageClass -> directory mappings both in the ConfigMap and as hostPath volumes. To +make it simpler and less error prone, a boostrapper application will be provided to generate +and launch the provisioner DaemonSet based off of the ConfigMap. It can also create a service +account with all the required permissions. + +The boostrapper accepts the following optional arguments: + +* -image: Name of local volume provisioner image (default +"quay.io/external_storage/local-volume-provisioner:latest") +* -volume-config: Name of the local volume configuration configmap. The configmap must reside in the same +namespace as the bootstrapper. (default "local-volume-default-config") +* -serviceaccount: Name of the service accout for local volume provisioner (default "local-storage-admin") + +The boostrapper requires the following permissions: + +* Get/Create/Update ConfigMap +* Create ServiceAccount +* Create ClusterRoleBindings +* Create DaemonSet + +Since the boostrapper generates the DaemonSet spec, the ConfigMap can be simplifed to just specify the +host directories: + +``` +kind: ConfigMap +metadata: + name: local-volume-config + namespace: kube-system +data: + "local-fast": | + { + "hostDir": "/mnt/ssds", + } + "local-slow": | + { + "hostDir": "/mnt/hdds", + } +``` + +The boostrapper will update the ConfigMap with the generated `mountDir`. It generates the `mountDir` +by stripping off the initial "/" in `hostDir`, replacing the remaining "/" with "~", and adding the +prefix path "/mnt/local-storage". + +In the above example, the generated `mountDir` is `/mnt/local-storage/mnt ~ssds` and +`/mnt/local-storage/mnt~hdds`, respectively. + +#### Use Case Deliverables + +This alpha phase for Local PV support will provide the following capabilities: + +* Local directories to be specified as Local PVs with node affinity +* Pod using a PVC that is bound to a Local PV will always be scheduled to that node +* External static provisioner DaemonSet that discovers local directories, creates, cleans up, +and deletes Local PVs + +#### Limitations + +However, some use cases will not work: + +* Specifying multiple Local PVCs in a pod. Most likely, the PVCs will be bound to Local PVs on +different nodes, +making the pod unschedulable. +* Specifying Pod affinity/anti-affinity with Local PVs. PVC binding does not look at Pod scheduling +constraints at all. +* Using Local PVs in a highly utilized cluster. PVC binding does not look at Pod resource requirements +and Node resource availability. + +These issues will be solved in a future release with advanced storage topology scheduling. + +As a workaround, PVCs can be manually prebound to Local PVs to essentially manually schedule Pods to +specific nodes. + +#### Test Cases + +##### API unit tests + +* LocalVolumeSource cannot be specified without the feature gate +* Non-empty PV node affinity is required for LocalVolumeSource +* Preferred node affinity is not allowed +* Path is required to be non-empty +* Invalid json representation of type NodeAffinity returns error + +##### PV node affinity unit tests + +* Nil or empty node affinity evalutes to true for any node +* Node affinity specifying existing node labels evalutes to true +* Node affinity specifying non-existing node label keys evaluates to false +* Node affinity specifying non-existing node label values evaluates to false + +##### Local volume plugin unit tests + +* Plugin can support PersistentVolumeSource +* Plugin cannot support VolumeSource +* Plugin supports ReadWriteOnce access mode +* Plugin does not support remaining access modes +* Plugin supports Mounter and Unmounter +* Plugin does not support Provisioner, Recycler, Deleter +* Plugin supports readonly +* Plugin GetVolumeName() returns PV name +* Plugin ConstructVolumeSpec() returns PV info +* Plugin disallows backsteps in the Path + +##### Local volume provisioner unit tests + +* Directory not in the cache and PV should be created +* Directory is in the cache and PV should not be created +* Directories created later are discovered and PV is created +* Unconfigured directories are ignored +* PVs are created with the configured StorageClass +* PV name generation hashed correctly using node name, storageclass and filename +* PV creation failure should not add directory to cache +* Non-directory type should not create a PV +* PV is released, PV should be deleted +* PV should not be deleted for any other PV phase +* PV deletion failure should not remove PV from cache +* PV cleanup failure should not delete PV or remove from cache + +##### E2E tests + +* Pod that is bound to a Local PV is scheduled to the correct node +and can mount, read, and write +* Two pods serially accessing the same Local PV can mount, read, and write +* Two pods simultaneously accessing the same Local PV can mount, read, and write +* Test both directory-based Local PV, and mount point-based Local PV +* Launch local volume provisioner, create some directories under the discovery path, +and verify that PVs are created and a Pod can mount, read, and write. +* After destroying a PVC managed by the local volume provisioner, it should cleanup +the volume and recreate a new PV. +* Pod using a Local PV with non-existant path fails to mount +* Pod that sets nodeName to a different node than the PV node affinity cannot schedule. + + +### Phase 2: 1.9 alpha + +#### Smarter PV binding + +The issue of PV binding not taking into account pod scheduling requirements affects any +type of volume that imposes topology constraints, such as local storage and zonal disks. + +Because this problem affects more than just local volumes, it will be treated as a +separate feature with a separate proposal. Once that feature is implemented, then the +limitations outlined above wil be fixed. + +#### Block devices and raw partitions + +Pods accessing raw block storage is a new alpha feature in 1.8. Changes are required in +the Local volume plugin and provisioner to be able to support raw block devices. + +Design TBD + +#### Provisioner redesign for stricter K8s API access control + +In 1.7, each instance of the provisioner on each node has full permissions to create and +delete all PVs in the system. This is unnecessary and potentially a vulnerability if the +node gets compromised. + +To address this issue, the provisioner will be redesigned into two major components: + +1. A central manager pod that handles the creation and deletion of PV objects. +This central pod can run on a trusted node and be given PV create/delete permissions. +2. Worker pods on each node, run as a DaemonSet, that discovers and cleans up the local +volumes on that node. These workers do not interact with PV objects, however +they still require permissions to be able to read the `Node.Labels` on their node. + +The central manager will poll each worker for their discovered volumes and create PVs for +them. When a PV is released, then it will send the cleanup request to the worker. + +Detailed design TBD