1.7 Local Persistent Volume design doc
This commit is contained in:
		
							parent
							
								
									2920f3430d
								
							
						
					
					
						commit
						c9faebcf94
					
				|  | @ -0,0 +1,659 @@ | |||
| # Local Storage Persistent Volumes | ||||
| 
 | ||||
| Authors: @msau42, @vishh | ||||
| 
 | ||||
| This document presents a detailed design for supporting persistent local storage, | ||||
| as outlined in [Local Storage Overview](local-storage-overview.md). | ||||
| Supporting all the use cases for persistent local storage will take many releases, | ||||
| so this document will be extended for each new release as we add more features. | ||||
| 
 | ||||
| ## Goals | ||||
| 
 | ||||
| * Allow pods to mount any local block or filesystem based volume. | ||||
| * Allow pods to mount dedicated local disks, or channeled partitions as volumes for | ||||
| IOPS isolation. | ||||
| * Allow pods do access local volumes without root privileges. | ||||
| * Allow pods to access local volumes without needing to undestand the storage | ||||
| layout on every node. | ||||
| * Persist local volumes and provide data gravity for pods.  Any pod | ||||
| using the local volume will be scheduled to the same node that the local volume | ||||
| is on. | ||||
| * Allow pods to release their local volume bindings and lose that volume's data | ||||
| during failure conditions, such as node, storage or scheduling failures, where | ||||
| the volume is not accessible for some user-configurable time. | ||||
| * Allow pods to specify local storage as part of a Deployment or StatefulSet. | ||||
| * Allow administrators to set up and configure local volumes with simple methods. | ||||
| * Do not require administrators to manage the local volumes once provisioned | ||||
| for a node. | ||||
| 
 | ||||
| ## Non-Goals | ||||
| 
 | ||||
| * Provide data availability for a local volume beyond its local node. | ||||
| * Support the use of HostPath volumes and Local PVs on the same volume. | ||||
| 
 | ||||
| ## Background | ||||
| 
 | ||||
| In Kubernetes, there are two main types of storage: remote and local. | ||||
| 
 | ||||
| Remote storage is typically used with persistent volumes where the data can | ||||
| persist beyond the lifetime of the pod. | ||||
| 
 | ||||
| Local storage is typically used with ephemeral volumes where the data only | ||||
| persists during the lifetime of the pod. | ||||
| 
 | ||||
| There is increasing demand for using local storage as persistent volumes, | ||||
| especially for distributed filesystems and databases such as GlusterFS and | ||||
| Cassandra.  The main motivations for using persistent local storage, instead | ||||
| of persistent remote storage include: | ||||
| 
 | ||||
| * Performance:  Local SSDs achieve higher IOPS and throughput than many | ||||
| remote storage solutions. | ||||
| 
 | ||||
| * Cost: Operational costs may be reduced by leveraging existing local storage, | ||||
| especially in bare metal environments.  Network storage can be expensive to | ||||
| setup and maintain, and it may not be necessary for certain applications. | ||||
| 
 | ||||
| ## Use Cases | ||||
| 
 | ||||
| ### Distributed filesystems and databases | ||||
| 
 | ||||
| Many distributed filesystem and database implementations, such as Cassandra and | ||||
| GlusterFS, utilize the local storage on each node to form a storage cluster. | ||||
| These systems typically have a replication feature that sends copies of the data | ||||
| to other nodes in the cluster in order to provide fault tolerance in case of | ||||
| node failures.  Non-distributed, but replicated databases, like MySQL, can also | ||||
| utilize local storage to store replicas. | ||||
| 
 | ||||
| The main motivations for using local persistent storage are performance and | ||||
| cost.  Since the application handles data replication and fault tolerance, these | ||||
| application pods do not need networked storage to provide shared access to data. | ||||
| In addition, installing a high-performing NAS or SAN solution can be more | ||||
| expensive, and more complex to configure and maintain than utilizing local | ||||
| disks, especially if the node was already pre-installed with disks.  Datacenter | ||||
| infrastructure and operational costs can be reduced by increasing storage | ||||
| utilization. | ||||
| 
 | ||||
| These distributed systems are generally stateful, infrastructure applications | ||||
| that provide data services to higher-level applications.  They are expected to | ||||
| run in a cluster with many other applications potentially sharing the same | ||||
| nodes.  Therefore, they expect to have high priority and node resource | ||||
| guarantees.  They typically are deployed using StatefulSets, custom | ||||
| controllers, or operators. | ||||
| 
 | ||||
| ### Caching | ||||
| 
 | ||||
| Caching is one of the recommended use cases for ephemeral local storage.  The | ||||
| cached data is backed by persistent storage, so local storage data durability is | ||||
| not required.  However, there is a use case for persistent local storage to | ||||
| achieve data gravity for large caches.  For large caches, if a pod restarts, | ||||
| rebuilding the cache can take a long time.  As an example, rebuilding a 100GB | ||||
| cache from a hard disk with 150MB/s read throughput can take around 10 minutes. | ||||
| If the service gets restarted and all the pods have to restart, then performance | ||||
| and availability can be impacted while the pods are rebuilding.  If the cache is | ||||
| persisted, then cold startup latencies are reduced. | ||||
| 
 | ||||
| Content-serving applications and producer/consumer workflows commonly utilize | ||||
| caches for better performance.  They are typically deployed using Deployments, | ||||
| and could be isolated in its own cluster, or shared with other applications. | ||||
| 
 | ||||
| ## Environments | ||||
| 
 | ||||
| ### Baremetal | ||||
| 
 | ||||
| In a baremetal environment, nodes may be configured with multiple local disks of | ||||
| varying capacity, speeds and mediums.  Mediums include spinning disks (HDDs) and | ||||
| solid-state drives (SSDs), and capacities of each disk can range from hundreds | ||||
| of GBs to tens of TB. Multiple disks may be arranged in JBOD or RAID configurations  | ||||
| to consume as persistent storage. | ||||
| 
 | ||||
| Currently, the methods to use the additional disks are to: | ||||
| 
 | ||||
| * Configure a distributed filesystem | ||||
| * Configure a HostPath volume | ||||
| 
 | ||||
| It is also possible to configure a NAS or SAN on a node as well.  Speeds and | ||||
| capacities will widely vary depending on the solution. | ||||
| 
 | ||||
| ### GCE/GKE | ||||
| 
 | ||||
| GCE and GKE both have a local SSD feature that can create a VM instance with up | ||||
| to 8 fixed-size 375GB local SSDs physically attached to the instance host and | ||||
| appears as additional disks in the instance.  The local SSDs have to be | ||||
| configured at the VM creation time and cannot be dynamically attached to an | ||||
| instance later.  If the VM gets shutdown, terminated, pre-empted, or the host | ||||
| encounters a non-recoverable error, then the SSD data will be lost.  If the | ||||
| guest OS reboots, or a live migration occurs, then the SSD data will be | ||||
| preserved. | ||||
| 
 | ||||
| ### EC2 | ||||
| 
 | ||||
| In EC2, the instance store feature attaches local HDDs or SSDs to a new instance | ||||
| as additional disks.  HDD capacities can go up to 24 2TB disks for the largest | ||||
| configuration.  SSD capacities can go up to 8 800GB disks or 2 2TB disks for the | ||||
| largest configurations.  Data on the instance store only persists across | ||||
| instance reboot. | ||||
| 
 | ||||
| ## Limitations of current volumes | ||||
| 
 | ||||
| The following is an overview of existing volume types in Kubernetes, and how | ||||
| they cannot completely address the use cases for local persistent storage. | ||||
| 
 | ||||
| * EmptyDir: A temporary directory for a pod that is created under the kubelet | ||||
| root directory.  The contents are deleted when a pod dies.  Limitations: | ||||
| 
 | ||||
|   * Volume lifetime is bound to the pod lifetime.  Pod failure is more likely | ||||
| than node failure, so there can be increased network and storage activity to | ||||
| recover data via replication and data backups when a replacement pod is started. | ||||
|   * Multiple disks are not supported unless the administrator aggregates them | ||||
| into a spanned or RAID volume.  In this case, all the storage is shared, and | ||||
| IOPS guarantees cannot be provided. | ||||
|   * There is currently no method of distinguishing between HDDs and SDDs.  The | ||||
| “medium” field could be expanded, but it is not easily generalizable to | ||||
| arbitrary types of mediums. | ||||
| 
 | ||||
| * HostPath: A direct mapping to a specified directory on the node.  The | ||||
| directory is not managed by the cluster.  Limitations: | ||||
| 
 | ||||
|   * Admin needs to manually setup directory permissions for the volume’s users. | ||||
|   * Admin has to manage the volume lifecycle manually and do cleanup of the data and | ||||
| directories. | ||||
|   * All nodes have to have their local storage provisioned the same way in order to | ||||
| use the same pod template. | ||||
|   * There can be path collision issues if multiple pods get scheduled to the same | ||||
| node that want the same path | ||||
|   * If node affinity is specified, then the user has to do the pod scheduling | ||||
| manually. | ||||
| 
 | ||||
| * Provider’s block storage (GCE PD, AWS EBS, etc): A remote disk that can be | ||||
| attached to a VM instance.  The disk’s lifetime is independent of the pod’s | ||||
| lifetime.  Limitations: | ||||
| 
 | ||||
|   * Doesn’t meet performance requirements. | ||||
| [Performance benchmarks on GCE](https://cloud.google.com/compute/docs/disks/performance) | ||||
| show that local SSD can perform better than SSD persistent disks: | ||||
| 
 | ||||
|     * 16x read IOPS | ||||
|     * 11x write IOPS | ||||
|     * 6.5x read throughput | ||||
|     * 4.5x write throughput | ||||
| 
 | ||||
| * Networked filesystems (NFS, GlusterFS, etc): A filesystem reachable over the | ||||
| network that can provide shared access to data.  Limitations: | ||||
| 
 | ||||
|   * Requires more configuration and setup, which adds operational burden and | ||||
| cost. | ||||
|   * Requires a high performance network to achieve equivalent performance as | ||||
| local disks, especially when compared to high-performance SSDs. | ||||
| 
 | ||||
| Due to the current limitations in the existing volume types, a new method for | ||||
| providing persistent local storage should be considered. | ||||
| 
 | ||||
| ## Feature Plan | ||||
| 
 | ||||
| A detailed implementation plan can be found in the | ||||
| [Storage SIG planning spreadhseet](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/view#gid=1566770776). | ||||
| The following is a high level summary of the goals in each phase. | ||||
| 
 | ||||
| ### Phase 1 | ||||
| 
 | ||||
| * Support Pod, Deployment, and StatefulSet requesting a single local volume | ||||
| * Support pre-configured, statically partitioned, filesystem-based local volumes | ||||
| 
 | ||||
| ### Phase 2 | ||||
| 
 | ||||
| * Block devices and raw partitions | ||||
| * Smarter PV binding to consider local storage and pod scheduling constraints, | ||||
| such as pod affinity/anti-affinity, and requesting multiple local volumes | ||||
| 
 | ||||
| ### Phase 3 | ||||
| 
 | ||||
| * Support common partitioning patterns | ||||
| * Volume taints and tolerations for unbinding volumes in error conditions | ||||
| 
 | ||||
| ### Phase 4 | ||||
| 
 | ||||
| * Dynamic provisioning | ||||
| 
 | ||||
| ## Design | ||||
| 
 | ||||
| A high level proposal with user workflows is available in the | ||||
| [Local Storage Overview](local-storage-overview.md). | ||||
| 
 | ||||
| This design section will focus on one phase at a time.  Each new release will | ||||
| extend this section. | ||||
| 
 | ||||
| ### Phase 1: 1.7 alpha | ||||
| 
 | ||||
| #### Local Volume Plugin | ||||
| 
 | ||||
| A new volume plugin will be introduced to represent logical block partitions and | ||||
| filesystem mounts that are local to a node.  Some examples include whole disks, | ||||
| disk partitions, RAID volumes, LVM volumes, or even directories in a shared | ||||
| partition.  Multiple Local volumes can be created on a node, and is | ||||
| accessed through a local mount point or path that is bind-mounted into the | ||||
| container.  It is only consumable as a PersistentVolumeSource because the PV | ||||
| interface solves the pod spec portability problem and provides the following: | ||||
| 
 | ||||
| * Abstracts volume implementation details for the pod and expresses volume | ||||
| requirements in terms of general concepts, like capacity and class.  This allows | ||||
| for portable configuration, as the pod is not tied to specific volume instances. | ||||
| * Allows volume management to be independent of the pod lifecycle.  The volume can | ||||
| survive container, pod and node restarts. | ||||
| * Allows volume classification by StorageClass. | ||||
| * Is uniquely identifiable within a cluster and is managed from a cluster-wide | ||||
| view. | ||||
| 
 | ||||
| There are major changes in PV and pod semantics when using Local volumes | ||||
| compared to the typical remote storage volumes. | ||||
| 
 | ||||
| * Since Local volumes are fixed to a node, a pod using that volume has to | ||||
| always be scheduled on that node. | ||||
| * Volume availability is tied to the node’s availability.  If the node is | ||||
| unavailable, then the volume is also unavailable, which impacts pod | ||||
| availability. | ||||
| * The volume’s data durability characteristics are determined by the underlying | ||||
| storage system, and cannot be guaranteed by the plugin.  A Local volume | ||||
| in one environment can provide data durability, but in another environment may | ||||
| only be ephemeral.  As an example, in the GCE/GKE/AWS cloud environments, the | ||||
| data in directly attached, physical SSDs is immediately deleted when the VM | ||||
| instance terminates or becomes unavailable. | ||||
| 
 | ||||
| Due to these differences in behaviors, Local volumes are not suitable for | ||||
| general purpose use cases, and are only suitable for specific applications that | ||||
| need storage performance and data gravity, and can tolerate data loss or | ||||
| unavailability.  Applications need to be aware of, and be able to handle these | ||||
| differences in data durability and availability. | ||||
| 
 | ||||
| Local volumes are similar to HostPath volumes in the following ways: | ||||
| 
 | ||||
| * Partitions need to be configured by the storage administrator beforehand. | ||||
| * Volume is referenced by the path to the partition. | ||||
| * Provides the same underlying partition’s support for IOPS isolation. | ||||
| * Volume is permanently attached to one node. | ||||
| * Volume can be mounted by multiple pods on the same node. | ||||
| 
 | ||||
| However, Local volumes will address these current issues with HostPath | ||||
| volumes: | ||||
| 
 | ||||
| * Security concerns allowing a pod to access any path in a node.  Local | ||||
| volumes cannot be consumed directly by a pod.  They must be specified as a PV | ||||
| source, so only users with storage provisioning privileges can determine which | ||||
| paths on a node are available for consumption. | ||||
| * Difficulty in permissions setup.  Local volumes will support fsGroup so | ||||
| that the admins do not need to setup the permissions beforehand, tying that | ||||
| particular volume to a specific user/group.  During the mount, the fsGroup | ||||
| settings will be applied on the path.  However, multiple pods | ||||
| using the same volume should use the same fsGroup. | ||||
| * Volume lifecycle is not clearly defined, and the volume has to be manually | ||||
| cleaned up by users.  For Local volumes, the PV has a clearly defined | ||||
| lifecycle.  Upon PVC deletion, the PV will be released (if it has the Delete | ||||
| policy), and all the contents under the path will be deleted.  In the future, | ||||
| advanced cleanup options, like zeroing can also be specified for a more | ||||
| comprehensive cleanup. | ||||
| 
 | ||||
| ##### API Changes | ||||
| 
 | ||||
| All new changes are protected by a new feature gate, `PersistentLocalVolumes`. | ||||
| 
 | ||||
| A new `LocalVolumeSource` type is added as a `PersistentVolumeSource`.  For this | ||||
| initial phase, the path can only be a mount point or a directory in a shared | ||||
| filesystem. | ||||
| 
 | ||||
| ``` | ||||
| type LocalVolumeSource struct { | ||||
|         // The full path to the volume on the node | ||||
|         // For alpha, this path must be a directory | ||||
|         // Once block as a source is supported, then this path can point to a block device | ||||
|         Path string | ||||
| } | ||||
| 
 | ||||
| type PersistentVolumeSource struct { | ||||
|     <snip> | ||||
|     // Local represents directly-attached storage with node affinity. | ||||
|     // +optional | ||||
|     Local *LocalVolumeSource | ||||
| } | ||||
| ``` | ||||
| 
 | ||||
| The relationship between a Local volume and its node will be expressed using | ||||
| PersistentVolume node affinity, described in the following section. | ||||
| 
 | ||||
| Users request Local volumes using PersistentVolumeClaims in the same manner as any | ||||
| other volume type. The PVC will bind to a matching PV with the appropriate capacity, | ||||
| AccessMode, and StorageClassName.  Then the user specifies that PVC in their | ||||
| Pod spec.  There are no special annotations or fields that need to be set in the Pod | ||||
| or PVC to distinguish between local and remote storage.  It is abstracted by the | ||||
| StorageClass. | ||||
| 
 | ||||
| #### PersistentVolume Node Affinity | ||||
| 
 | ||||
| PersistentVolume node affinity is a new concept and is similar to Pod node affinity, | ||||
| except instead of specifying which nodes a Pod has to be scheduled to, it specifies which nodes | ||||
| a PersistentVolume can be attached and mounted to, influencing scheduling of Pods that | ||||
| use local volumes. | ||||
| 
 | ||||
| For a Pod that uses a PV with node affinity, a new scheduler predicate | ||||
| will evaluate that node affinity against the node's labels.  For this initial phase, the | ||||
| PV node affinity is only considered by the scheduler for already-bound PVs.  It is not | ||||
| considered during the initial PVC/PV binding, which will be addressed in a future release. | ||||
| 
 | ||||
| Only the `requiredDuringSchedulingIgnoredDuringExecution` field will be supported. | ||||
| 
 | ||||
| ##### API Changes | ||||
| 
 | ||||
| For the initial alpha phase, node affinity is expressed as an optional | ||||
| annotation in the PersistentVolume object. | ||||
| 
 | ||||
| ``` | ||||
| // AlphaStorageNodeAffinityAnnotation defines node affinity policies for a PersistentVolume. | ||||
| // Value is a string of the json representation of type NodeAffinity | ||||
| AlphaStorageNodeAffinityAnnotation = "volume.alpha.kubernetes.io/node-affinity" | ||||
| ``` | ||||
| 
 | ||||
| #### Local volume initial configuration | ||||
| 
 | ||||
| There are countless ways to configure local storage on a node, with different patterns to | ||||
| follow depending on application requirements and use cases.  Some use cases may require | ||||
| dedicated disks; others may only need small partitions and are ok with sharing disks. | ||||
| Instead of forcing a partitioning scheme on storage administrators, the Local volume | ||||
| is represented by a path, and lets the administrators partition their storage however they | ||||
| like, with a few minimum requirements: | ||||
| 
 | ||||
| * The paths to the mount points are always consistent, even across reboots or when storage | ||||
| is added or removed. | ||||
| * The paths are backed by a filesystem (block devices or raw partitions are not supported for | ||||
| the first phase) | ||||
| * The directories have appropriate permissions for the provisioner to be able to set owners and | ||||
| cleanup the volume. | ||||
| 
 | ||||
| #### Local volume management | ||||
| 
 | ||||
| Local PVs are statically created and not dynamically provisioned for the first phase. | ||||
| To mitigate the amount of time an administrator has to spend managing Local volumes, | ||||
| a Local static provisioner application will be provided to handle common scenarios.  For | ||||
| uncommon scenarios, a specialized provisioner can be written. | ||||
| 
 | ||||
| The Local static provisioner will be developed in the | ||||
| [kubernetes-incubator/external-storage](https://github.com/kubernetes-incubator) | ||||
| repository, and will loosely follow the external provisioner design, with a few differences: | ||||
| 
 | ||||
| * A provisioner instance needs to run on each node and only manage the local storage on its node. | ||||
| * For phase 1, it does not handle dynamic provisioning.  Instead, it performs static provisioning | ||||
| by discovering available partitions mounted under configurable discovery directories. | ||||
| 
 | ||||
| The basic design of the provisioner will have two separate handlers: one for PV deletion and | ||||
| cleanup, and the other for static PV creation.  A PersistentVolume informer will be created | ||||
| and its cache will be used by both handlers. | ||||
| 
 | ||||
| PV deletion will operate on the Update event.  If the PV it provisioned changes to the “Released” | ||||
| state, and if the reclaim policy is Delete, then it will cleanup the volume and then delete the PV, | ||||
| removing it from the cache. | ||||
| 
 | ||||
| PV creation does not operate on any informer events.  Instead, it periodically monitors the discovery | ||||
| directories, and will create a new PV for each path in the directory that is not in the PV cache.  It | ||||
| sets the "pv.kubernetes.io/provisioned-by" annotation so that it can distinguish which PVs it created. | ||||
| 
 | ||||
| For phase 1, the allowed discovery file types are directories and mount points.  The PV capacity | ||||
| will be the capacity of the underlying filesystem.  Therefore, PVs that are backed by shared | ||||
| directories will report its capacity as the entire filesystem, potentially causing overcommittment. | ||||
| Separate partitions are recommended for capacity isolation. | ||||
| 
 | ||||
| The name of the PV needs to be unique across the cluster.  The provisioner will hash the node name, | ||||
| StorageClass name, and base file name in the volume path to generate a unique name. | ||||
| 
 | ||||
| ##### Packaging | ||||
| 
 | ||||
| The provisioner is packaged as a container image and will run on each node in the cluster as part of | ||||
| a DaemonSet.  It needs to be run with a user or service account with the following permissions: | ||||
| 
 | ||||
| * Create/delete/list/get PersistentVolumes - Can use the `system:persistentvolumeprovisioner` ClusterRoleBinding | ||||
| * Get ConfigMaps - To access user configuration for the provisioner | ||||
| * Get Nodes - To get the node's UID and labels | ||||
| 
 | ||||
| These are broader permissions than necessary (a node's access to PVs should be restricted to only | ||||
| those local to the node).  A redesign will be considered in a future release to address this issue. | ||||
| 
 | ||||
| In addition, it should run with high priority so that it can reliably handle all the local storage | ||||
| partitions on each node, and with enough permissions to be able to cleanup volume contents upon | ||||
| deletion. | ||||
| 
 | ||||
| The provisioner DaemonSet requires the following configuration: | ||||
| 
 | ||||
| * The node's name set as the MY_NODE_NAME environment variable | ||||
| * ConfigMap with StorageClass -> discovery directory mappings | ||||
| * Each mapping in the ConfigMap needs a hostPath volume | ||||
| * User/service account with all the required permissions | ||||
| 
 | ||||
| Here is an example ConfigMap: | ||||
| 
 | ||||
| ``` | ||||
| kind: ConfigMap | ||||
| metadata: | ||||
|   name: local-volume-config | ||||
|   namespace: kube-system | ||||
| data: | ||||
|   "local-fast": | | ||||
|     { | ||||
|       "hostDir": "/mnt/ssds", | ||||
|       "mountDir": "/local-ssds" | ||||
|     } | ||||
|   "local-slow": | | ||||
|     { | ||||
|       "hostDir": "/mnt/hdds", | ||||
|       "mountDir": "/local-hdds" | ||||
|     } | ||||
| ``` | ||||
| 
 | ||||
| The `hostDir` is the discovery path on the host, and the `mountDir` is the path it is mounted to in | ||||
| the provisioner container.  The `hostDir` is required because the provisioner needs to create Local PVs | ||||
| with the `Path` based off of `hostDir`, not `mountDir`. | ||||
| 
 | ||||
| The DaemonSet for this example looks like: | ||||
| ``` | ||||
| 
 | ||||
| apiVersion: extensions/v1beta1 | ||||
| kind: DaemonSet | ||||
| metadata: | ||||
|   name: local-storage-provisioner | ||||
|   namespace: kube-system | ||||
| spec: | ||||
|   template: | ||||
|     metadata: | ||||
|       labels: | ||||
|         system: local-storage-provisioner | ||||
|     spec: | ||||
|       containers: | ||||
|       - name: provisioner | ||||
|         image: "gcr.io/google_containers/local-storage-provisioner:v1.0" | ||||
|         imagePullPolicy: Always | ||||
|         volumeMounts: | ||||
|         - name: vol1 | ||||
|           mountPath: "/local-ssds" | ||||
|         - name: vol2 | ||||
|           mountPath: "/local-hdds" | ||||
|         env: | ||||
|         - name: MY_NODE_NAME | ||||
|           valueFrom: | ||||
|             fieldRef: | ||||
|               fieldPath: spec.nodeName | ||||
|       volumes: | ||||
|       - name: vol1 | ||||
|         hostPath: | ||||
|           path: "/mnt/ssds" | ||||
|       - name: vol2 | ||||
|         hostPath: | ||||
|           path: "/mnt/hdds" | ||||
|       serviceAccount: local-storage-admin | ||||
| ``` | ||||
| 
 | ||||
| ##### Provisioner Boostrapper | ||||
| 
 | ||||
| Manually setting up this DaemonSet spec can be tedious and it requires duplicate specification | ||||
| of the StorageClass -> directory mappings both in the ConfigMap and as hostPath volumes. To | ||||
| make it simpler and less error prone, a boostrapper application will be provided to generate | ||||
| and launch the provisioner DaemonSet based off of the ConfigMap.  It can also create a service | ||||
| account with all the required permissions. | ||||
| 
 | ||||
| The boostrapper accepts the following optional arguments: | ||||
| 
 | ||||
| * -image: Name of local volume provisioner image (default | ||||
| "quay.io/external_storage/local-volume-provisioner:latest") | ||||
| * -volume-config: Name of the local volume configuration configmap. The configmap must reside in the same | ||||
| namespace as the bootstrapper. (default "local-volume-default-config") | ||||
| * -serviceaccount: Name of the service accout for local volume provisioner (default "local-storage-admin") | ||||
| 
 | ||||
| The boostrapper requires the following permissions: | ||||
| 
 | ||||
| * Get/Create/Update ConfigMap | ||||
| * Create ServiceAccount | ||||
| * Create ClusterRoleBindings | ||||
| * Create DaemonSet | ||||
| 
 | ||||
| Since the boostrapper generates the DaemonSet spec, the ConfigMap can be simplifed to just specify the | ||||
| host directories: | ||||
| 
 | ||||
| ``` | ||||
| kind: ConfigMap | ||||
| metadata: | ||||
|   name: local-volume-config | ||||
|   namespace: kube-system | ||||
| data: | ||||
|   "local-fast": | | ||||
|     { | ||||
|       "hostDir": "/mnt/ssds", | ||||
|     } | ||||
|   "local-slow": | | ||||
|     { | ||||
|       "hostDir": "/mnt/hdds", | ||||
|     } | ||||
| ``` | ||||
| 
 | ||||
| The boostrapper will update the ConfigMap with the generated `mountDir`.  It generates the `mountDir` | ||||
| by stripping off the initial "/" in `hostDir`, replacing the remaining "/" with "~", and adding the | ||||
| prefix path "/mnt/local-storage". | ||||
| 
 | ||||
| In the above example, the generated `mountDir` is `/mnt/local-storage/mnt ~ssds` and | ||||
| `/mnt/local-storage/mnt~hdds`, respectively. | ||||
| 
 | ||||
| #### Use Case Deliverables | ||||
| 
 | ||||
| This alpha phase for Local PV support will provide the following capabilities: | ||||
| 
 | ||||
| * Local directories to be specified as Local PVs with node affinity | ||||
| * Pod using a PVC that is bound to a Local PV will always be scheduled to that node | ||||
| * External static provisioner DaemonSet that discovers local directories, creates, cleans up, | ||||
| and deletes Local PVs | ||||
| 
 | ||||
| #### Limitations | ||||
| 
 | ||||
| However, some use cases will not work: | ||||
| 
 | ||||
| * Specifying multiple Local PVCs in a pod.  Most likely, the PVCs will be bound to Local PVs on | ||||
| different nodes, | ||||
| making the pod unschedulable. | ||||
| * Specifying Pod affinity/anti-affinity with Local PVs.  PVC binding does not look at Pod scheduling | ||||
| constraints at all. | ||||
| * Using Local PVs in a highly utilized cluster.  PVC binding does not look at Pod resource requirements | ||||
| and Node resource availability. | ||||
| 
 | ||||
| These issues will be solved in a future release with advanced storage topology scheduling. | ||||
| 
 | ||||
| As a workaround, PVCs can be manually prebound to Local PVs to essentially manually schedule Pods to | ||||
| specific nodes. | ||||
| 
 | ||||
| #### Test Cases | ||||
| 
 | ||||
| ##### API unit tests | ||||
| 
 | ||||
| * LocalVolumeSource cannot be specified without the feature gate | ||||
| * Non-empty PV node affinity is required for LocalVolumeSource | ||||
| * Preferred node affinity is not allowed | ||||
| * Path is required to be non-empty | ||||
| * Invalid json representation of type NodeAffinity returns error | ||||
| 
 | ||||
| ##### PV node affinity unit tests | ||||
| 
 | ||||
| * Nil or empty node affinity evalutes to true for any node | ||||
| * Node affinity specifying existing node labels evalutes to true | ||||
| * Node affinity specifying non-existing node label keys evaluates to false | ||||
| * Node affinity specifying non-existing node label values evaluates to false | ||||
| 
 | ||||
| ##### Local volume plugin unit tests | ||||
| 
 | ||||
| * Plugin can support PersistentVolumeSource | ||||
| * Plugin cannot support VolumeSource | ||||
| * Plugin supports ReadWriteOnce access mode | ||||
| * Plugin does not support remaining access modes | ||||
| * Plugin supports Mounter and Unmounter | ||||
| * Plugin does not support Provisioner, Recycler, Deleter | ||||
| * Plugin supports readonly | ||||
| * Plugin GetVolumeName() returns PV name | ||||
| * Plugin ConstructVolumeSpec() returns PV info | ||||
| * Plugin disallows backsteps in the Path | ||||
| 
 | ||||
| ##### Local volume provisioner unit tests | ||||
| 
 | ||||
| * Directory not in the cache and PV should be created | ||||
| * Directory is in the cache and PV should not be created | ||||
| * Directories created later are discovered and PV is created | ||||
| * Unconfigured directories are ignored | ||||
| * PVs are created with the configured StorageClass | ||||
| * PV name generation hashed correctly using node name, storageclass and filename | ||||
| * PV creation failure should not add directory to cache | ||||
| * Non-directory type should not create a PV | ||||
| * PV is released, PV should be deleted | ||||
| * PV should not be deleted for any other PV phase | ||||
| * PV deletion failure should not remove PV from cache | ||||
| * PV cleanup failure should not delete PV or remove from cache | ||||
| 
 | ||||
| ##### E2E tests | ||||
| 
 | ||||
| * Pod that is bound to a Local PV is scheduled to the correct node | ||||
| and can mount, read, and write | ||||
| * Two pods serially accessing the same Local PV can mount, read, and write | ||||
| * Two pods simultaneously accessing the same Local PV can mount, read, and write | ||||
| * Test both directory-based Local PV, and mount point-based Local PV | ||||
| * Launch local volume provisioner, create some directories under the discovery path, | ||||
| and verify that PVs are created and a Pod can mount, read, and write. | ||||
| * After destroying a PVC managed by the local volume provisioner, it should cleanup | ||||
| the volume and recreate a new PV. | ||||
| * Pod using a Local PV with non-existant path fails to mount | ||||
| * Pod that sets nodeName to a different node than the PV node affinity cannot schedule. | ||||
| 
 | ||||
| 
 | ||||
| ### Phase 2: 1.9 alpha | ||||
| 
 | ||||
| #### Smarter PV binding | ||||
| 
 | ||||
| The issue of PV binding not taking into account pod scheduling requirements affects any | ||||
| type of volume that imposes topology constraints, such as local storage and zonal disks. | ||||
| 
 | ||||
| Because this problem affects more than just local volumes, it will be treated as a | ||||
| separate feature with a separate proposal.  Once that feature is implemented, then the | ||||
| limitations outlined above wil be fixed. | ||||
| 
 | ||||
| #### Block devices and raw partitions | ||||
| 
 | ||||
| Pods accessing raw block storage is a new alpha feature in 1.8.  Changes are required in | ||||
| the Local volume plugin and provisioner to be able to support raw block devices. | ||||
| 
 | ||||
| Design TBD | ||||
| 
 | ||||
| #### Provisioner redesign for stricter K8s API access control | ||||
| 
 | ||||
| In 1.7, each instance of the provisioner on each node has full permissions to create and | ||||
| delete all PVs in the system.  This is unnecessary and potentially a vulnerability if the | ||||
| node gets compromised. | ||||
| 
 | ||||
| To address this issue, the provisioner will be redesigned into two major components: | ||||
| 
 | ||||
| 1. A central manager pod that handles the creation and deletion of PV objects. | ||||
| This central pod can run on a trusted node and be given PV create/delete permissions. | ||||
| 2. Worker pods on each node, run as a DaemonSet, that discovers and cleans up the local | ||||
| volumes on that node.  These workers do not interact with PV objects, however | ||||
| they still require permissions to be able to read the `Node.Labels` on their node. | ||||
| 
 | ||||
| The central manager will poll each worker for their discovered volumes and create PVs for | ||||
| them.  When a PV is released, then it will send the cleanup request to the worker. | ||||
| 
 | ||||
| Detailed design TBD | ||||
		Loading…
	
		Reference in New Issue