1.7 Local Persistent Volume design doc

This commit is contained in:
Michelle Au 2017-08-26 12:57:41 -07:00
parent 2920f3430d
commit c9faebcf94
1 changed files with 659 additions and 0 deletions

View File

@ -0,0 +1,659 @@
# Local Storage Persistent Volumes
Authors: @msau42, @vishh
This document presents a detailed design for supporting persistent local storage,
as outlined in [Local Storage Overview](local-storage-overview.md).
Supporting all the use cases for persistent local storage will take many releases,
so this document will be extended for each new release as we add more features.
## Goals
* Allow pods to mount any local block or filesystem based volume.
* Allow pods to mount dedicated local disks, or channeled partitions as volumes for
IOPS isolation.
* Allow pods do access local volumes without root privileges.
* Allow pods to access local volumes without needing to undestand the storage
layout on every node.
* Persist local volumes and provide data gravity for pods. Any pod
using the local volume will be scheduled to the same node that the local volume
is on.
* Allow pods to release their local volume bindings and lose that volume's data
during failure conditions, such as node, storage or scheduling failures, where
the volume is not accessible for some user-configurable time.
* Allow pods to specify local storage as part of a Deployment or StatefulSet.
* Allow administrators to set up and configure local volumes with simple methods.
* Do not require administrators to manage the local volumes once provisioned
for a node.
## Non-Goals
* Provide data availability for a local volume beyond its local node.
* Support the use of HostPath volumes and Local PVs on the same volume.
## Background
In Kubernetes, there are two main types of storage: remote and local.
Remote storage is typically used with persistent volumes where the data can
persist beyond the lifetime of the pod.
Local storage is typically used with ephemeral volumes where the data only
persists during the lifetime of the pod.
There is increasing demand for using local storage as persistent volumes,
especially for distributed filesystems and databases such as GlusterFS and
Cassandra. The main motivations for using persistent local storage, instead
of persistent remote storage include:
* Performance: Local SSDs achieve higher IOPS and throughput than many
remote storage solutions.
* Cost: Operational costs may be reduced by leveraging existing local storage,
especially in bare metal environments. Network storage can be expensive to
setup and maintain, and it may not be necessary for certain applications.
## Use Cases
### Distributed filesystems and databases
Many distributed filesystem and database implementations, such as Cassandra and
GlusterFS, utilize the local storage on each node to form a storage cluster.
These systems typically have a replication feature that sends copies of the data
to other nodes in the cluster in order to provide fault tolerance in case of
node failures. Non-distributed, but replicated databases, like MySQL, can also
utilize local storage to store replicas.
The main motivations for using local persistent storage are performance and
cost. Since the application handles data replication and fault tolerance, these
application pods do not need networked storage to provide shared access to data.
In addition, installing a high-performing NAS or SAN solution can be more
expensive, and more complex to configure and maintain than utilizing local
disks, especially if the node was already pre-installed with disks. Datacenter
infrastructure and operational costs can be reduced by increasing storage
utilization.
These distributed systems are generally stateful, infrastructure applications
that provide data services to higher-level applications. They are expected to
run in a cluster with many other applications potentially sharing the same
nodes. Therefore, they expect to have high priority and node resource
guarantees. They typically are deployed using StatefulSets, custom
controllers, or operators.
### Caching
Caching is one of the recommended use cases for ephemeral local storage. The
cached data is backed by persistent storage, so local storage data durability is
not required. However, there is a use case for persistent local storage to
achieve data gravity for large caches. For large caches, if a pod restarts,
rebuilding the cache can take a long time. As an example, rebuilding a 100GB
cache from a hard disk with 150MB/s read throughput can take around 10 minutes.
If the service gets restarted and all the pods have to restart, then performance
and availability can be impacted while the pods are rebuilding. If the cache is
persisted, then cold startup latencies are reduced.
Content-serving applications and producer/consumer workflows commonly utilize
caches for better performance. They are typically deployed using Deployments,
and could be isolated in its own cluster, or shared with other applications.
## Environments
### Baremetal
In a baremetal environment, nodes may be configured with multiple local disks of
varying capacity, speeds and mediums. Mediums include spinning disks (HDDs) and
solid-state drives (SSDs), and capacities of each disk can range from hundreds
of GBs to tens of TB. Multiple disks may be arranged in JBOD or RAID configurations
to consume as persistent storage.
Currently, the methods to use the additional disks are to:
* Configure a distributed filesystem
* Configure a HostPath volume
It is also possible to configure a NAS or SAN on a node as well. Speeds and
capacities will widely vary depending on the solution.
### GCE/GKE
GCE and GKE both have a local SSD feature that can create a VM instance with up
to 8 fixed-size 375GB local SSDs physically attached to the instance host and
appears as additional disks in the instance. The local SSDs have to be
configured at the VM creation time and cannot be dynamically attached to an
instance later. If the VM gets shutdown, terminated, pre-empted, or the host
encounters a non-recoverable error, then the SSD data will be lost. If the
guest OS reboots, or a live migration occurs, then the SSD data will be
preserved.
### EC2
In EC2, the instance store feature attaches local HDDs or SSDs to a new instance
as additional disks. HDD capacities can go up to 24 2TB disks for the largest
configuration. SSD capacities can go up to 8 800GB disks or 2 2TB disks for the
largest configurations. Data on the instance store only persists across
instance reboot.
## Limitations of current volumes
The following is an overview of existing volume types in Kubernetes, and how
they cannot completely address the use cases for local persistent storage.
* EmptyDir: A temporary directory for a pod that is created under the kubelet
root directory. The contents are deleted when a pod dies. Limitations:
* Volume lifetime is bound to the pod lifetime. Pod failure is more likely
than node failure, so there can be increased network and storage activity to
recover data via replication and data backups when a replacement pod is started.
* Multiple disks are not supported unless the administrator aggregates them
into a spanned or RAID volume. In this case, all the storage is shared, and
IOPS guarantees cannot be provided.
* There is currently no method of distinguishing between HDDs and SDDs. The
“medium” field could be expanded, but it is not easily generalizable to
arbitrary types of mediums.
* HostPath: A direct mapping to a specified directory on the node. The
directory is not managed by the cluster. Limitations:
* Admin needs to manually setup directory permissions for the volumes users.
* Admin has to manage the volume lifecycle manually and do cleanup of the data and
directories.
* All nodes have to have their local storage provisioned the same way in order to
use the same pod template.
* There can be path collision issues if multiple pods get scheduled to the same
node that want the same path
* If node affinity is specified, then the user has to do the pod scheduling
manually.
* Providers block storage (GCE PD, AWS EBS, etc): A remote disk that can be
attached to a VM instance. The disks lifetime is independent of the pods
lifetime. Limitations:
* Doesnt meet performance requirements.
[Performance benchmarks on GCE](https://cloud.google.com/compute/docs/disks/performance)
show that local SSD can perform better than SSD persistent disks:
* 16x read IOPS
* 11x write IOPS
* 6.5x read throughput
* 4.5x write throughput
* Networked filesystems (NFS, GlusterFS, etc): A filesystem reachable over the
network that can provide shared access to data. Limitations:
* Requires more configuration and setup, which adds operational burden and
cost.
* Requires a high performance network to achieve equivalent performance as
local disks, especially when compared to high-performance SSDs.
Due to the current limitations in the existing volume types, a new method for
providing persistent local storage should be considered.
## Feature Plan
A detailed implementation plan can be found in the
[Storage SIG planning spreadhseet](https://docs.google.com/spreadsheets/d/1t4z5DYKjX2ZDlkTpCnp18icRAQqOE85C1T1r2gqJVck/view#gid=1566770776).
The following is a high level summary of the goals in each phase.
### Phase 1
* Support Pod, Deployment, and StatefulSet requesting a single local volume
* Support pre-configured, statically partitioned, filesystem-based local volumes
### Phase 2
* Block devices and raw partitions
* Smarter PV binding to consider local storage and pod scheduling constraints,
such as pod affinity/anti-affinity, and requesting multiple local volumes
### Phase 3
* Support common partitioning patterns
* Volume taints and tolerations for unbinding volumes in error conditions
### Phase 4
* Dynamic provisioning
## Design
A high level proposal with user workflows is available in the
[Local Storage Overview](local-storage-overview.md).
This design section will focus on one phase at a time. Each new release will
extend this section.
### Phase 1: 1.7 alpha
#### Local Volume Plugin
A new volume plugin will be introduced to represent logical block partitions and
filesystem mounts that are local to a node. Some examples include whole disks,
disk partitions, RAID volumes, LVM volumes, or even directories in a shared
partition. Multiple Local volumes can be created on a node, and is
accessed through a local mount point or path that is bind-mounted into the
container. It is only consumable as a PersistentVolumeSource because the PV
interface solves the pod spec portability problem and provides the following:
* Abstracts volume implementation details for the pod and expresses volume
requirements in terms of general concepts, like capacity and class. This allows
for portable configuration, as the pod is not tied to specific volume instances.
* Allows volume management to be independent of the pod lifecycle. The volume can
survive container, pod and node restarts.
* Allows volume classification by StorageClass.
* Is uniquely identifiable within a cluster and is managed from a cluster-wide
view.
There are major changes in PV and pod semantics when using Local volumes
compared to the typical remote storage volumes.
* Since Local volumes are fixed to a node, a pod using that volume has to
always be scheduled on that node.
* Volume availability is tied to the nodes availability. If the node is
unavailable, then the volume is also unavailable, which impacts pod
availability.
* The volumes data durability characteristics are determined by the underlying
storage system, and cannot be guaranteed by the plugin. A Local volume
in one environment can provide data durability, but in another environment may
only be ephemeral. As an example, in the GCE/GKE/AWS cloud environments, the
data in directly attached, physical SSDs is immediately deleted when the VM
instance terminates or becomes unavailable.
Due to these differences in behaviors, Local volumes are not suitable for
general purpose use cases, and are only suitable for specific applications that
need storage performance and data gravity, and can tolerate data loss or
unavailability. Applications need to be aware of, and be able to handle these
differences in data durability and availability.
Local volumes are similar to HostPath volumes in the following ways:
* Partitions need to be configured by the storage administrator beforehand.
* Volume is referenced by the path to the partition.
* Provides the same underlying partitions support for IOPS isolation.
* Volume is permanently attached to one node.
* Volume can be mounted by multiple pods on the same node.
However, Local volumes will address these current issues with HostPath
volumes:
* Security concerns allowing a pod to access any path in a node. Local
volumes cannot be consumed directly by a pod. They must be specified as a PV
source, so only users with storage provisioning privileges can determine which
paths on a node are available for consumption.
* Difficulty in permissions setup. Local volumes will support fsGroup so
that the admins do not need to setup the permissions beforehand, tying that
particular volume to a specific user/group. During the mount, the fsGroup
settings will be applied on the path. However, multiple pods
using the same volume should use the same fsGroup.
* Volume lifecycle is not clearly defined, and the volume has to be manually
cleaned up by users. For Local volumes, the PV has a clearly defined
lifecycle. Upon PVC deletion, the PV will be released (if it has the Delete
policy), and all the contents under the path will be deleted. In the future,
advanced cleanup options, like zeroing can also be specified for a more
comprehensive cleanup.
##### API Changes
All new changes are protected by a new feature gate, `PersistentLocalVolumes`.
A new `LocalVolumeSource` type is added as a `PersistentVolumeSource`. For this
initial phase, the path can only be a mount point or a directory in a shared
filesystem.
```
type LocalVolumeSource struct {
// The full path to the volume on the node
// For alpha, this path must be a directory
// Once block as a source is supported, then this path can point to a block device
Path string
}
type PersistentVolumeSource struct {
<snip>
// Local represents directly-attached storage with node affinity.
// +optional
Local *LocalVolumeSource
}
```
The relationship between a Local volume and its node will be expressed using
PersistentVolume node affinity, described in the following section.
Users request Local volumes using PersistentVolumeClaims in the same manner as any
other volume type. The PVC will bind to a matching PV with the appropriate capacity,
AccessMode, and StorageClassName. Then the user specifies that PVC in their
Pod spec. There are no special annotations or fields that need to be set in the Pod
or PVC to distinguish between local and remote storage. It is abstracted by the
StorageClass.
#### PersistentVolume Node Affinity
PersistentVolume node affinity is a new concept and is similar to Pod node affinity,
except instead of specifying which nodes a Pod has to be scheduled to, it specifies which nodes
a PersistentVolume can be attached and mounted to, influencing scheduling of Pods that
use local volumes.
For a Pod that uses a PV with node affinity, a new scheduler predicate
will evaluate that node affinity against the node's labels. For this initial phase, the
PV node affinity is only considered by the scheduler for already-bound PVs. It is not
considered during the initial PVC/PV binding, which will be addressed in a future release.
Only the `requiredDuringSchedulingIgnoredDuringExecution` field will be supported.
##### API Changes
For the initial alpha phase, node affinity is expressed as an optional
annotation in the PersistentVolume object.
```
// AlphaStorageNodeAffinityAnnotation defines node affinity policies for a PersistentVolume.
// Value is a string of the json representation of type NodeAffinity
AlphaStorageNodeAffinityAnnotation = "volume.alpha.kubernetes.io/node-affinity"
```
#### Local volume initial configuration
There are countless ways to configure local storage on a node, with different patterns to
follow depending on application requirements and use cases. Some use cases may require
dedicated disks; others may only need small partitions and are ok with sharing disks.
Instead of forcing a partitioning scheme on storage administrators, the Local volume
is represented by a path, and lets the administrators partition their storage however they
like, with a few minimum requirements:
* The paths to the mount points are always consistent, even across reboots or when storage
is added or removed.
* The paths are backed by a filesystem (block devices or raw partitions are not supported for
the first phase)
* The directories have appropriate permissions for the provisioner to be able to set owners and
cleanup the volume.
#### Local volume management
Local PVs are statically created and not dynamically provisioned for the first phase.
To mitigate the amount of time an administrator has to spend managing Local volumes,
a Local static provisioner application will be provided to handle common scenarios. For
uncommon scenarios, a specialized provisioner can be written.
The Local static provisioner will be developed in the
[kubernetes-incubator/external-storage](https://github.com/kubernetes-incubator)
repository, and will loosely follow the external provisioner design, with a few differences:
* A provisioner instance needs to run on each node and only manage the local storage on its node.
* For phase 1, it does not handle dynamic provisioning. Instead, it performs static provisioning
by discovering available partitions mounted under configurable discovery directories.
The basic design of the provisioner will have two separate handlers: one for PV deletion and
cleanup, and the other for static PV creation. A PersistentVolume informer will be created
and its cache will be used by both handlers.
PV deletion will operate on the Update event. If the PV it provisioned changes to the “Released”
state, and if the reclaim policy is Delete, then it will cleanup the volume and then delete the PV,
removing it from the cache.
PV creation does not operate on any informer events. Instead, it periodically monitors the discovery
directories, and will create a new PV for each path in the directory that is not in the PV cache. It
sets the "pv.kubernetes.io/provisioned-by" annotation so that it can distinguish which PVs it created.
For phase 1, the allowed discovery file types are directories and mount points. The PV capacity
will be the capacity of the underlying filesystem. Therefore, PVs that are backed by shared
directories will report its capacity as the entire filesystem, potentially causing overcommittment.
Separate partitions are recommended for capacity isolation.
The name of the PV needs to be unique across the cluster. The provisioner will hash the node name,
StorageClass name, and base file name in the volume path to generate a unique name.
##### Packaging
The provisioner is packaged as a container image and will run on each node in the cluster as part of
a DaemonSet. It needs to be run with a user or service account with the following permissions:
* Create/delete/list/get PersistentVolumes - Can use the `system:persistentvolumeprovisioner` ClusterRoleBinding
* Get ConfigMaps - To access user configuration for the provisioner
* Get Nodes - To get the node's UID and labels
These are broader permissions than necessary (a node's access to PVs should be restricted to only
those local to the node). A redesign will be considered in a future release to address this issue.
In addition, it should run with high priority so that it can reliably handle all the local storage
partitions on each node, and with enough permissions to be able to cleanup volume contents upon
deletion.
The provisioner DaemonSet requires the following configuration:
* The node's name set as the MY_NODE_NAME environment variable
* ConfigMap with StorageClass -> discovery directory mappings
* Each mapping in the ConfigMap needs a hostPath volume
* User/service account with all the required permissions
Here is an example ConfigMap:
```
kind: ConfigMap
metadata:
name: local-volume-config
namespace: kube-system
data:
"local-fast": |
{
"hostDir": "/mnt/ssds",
"mountDir": "/local-ssds"
}
"local-slow": |
{
"hostDir": "/mnt/hdds",
"mountDir": "/local-hdds"
}
```
The `hostDir` is the discovery path on the host, and the `mountDir` is the path it is mounted to in
the provisioner container. The `hostDir` is required because the provisioner needs to create Local PVs
with the `Path` based off of `hostDir`, not `mountDir`.
The DaemonSet for this example looks like:
```
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: local-storage-provisioner
namespace: kube-system
spec:
template:
metadata:
labels:
system: local-storage-provisioner
spec:
containers:
- name: provisioner
image: "gcr.io/google_containers/local-storage-provisioner:v1.0"
imagePullPolicy: Always
volumeMounts:
- name: vol1
mountPath: "/local-ssds"
- name: vol2
mountPath: "/local-hdds"
env:
- name: MY_NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumes:
- name: vol1
hostPath:
path: "/mnt/ssds"
- name: vol2
hostPath:
path: "/mnt/hdds"
serviceAccount: local-storage-admin
```
##### Provisioner Boostrapper
Manually setting up this DaemonSet spec can be tedious and it requires duplicate specification
of the StorageClass -> directory mappings both in the ConfigMap and as hostPath volumes. To
make it simpler and less error prone, a boostrapper application will be provided to generate
and launch the provisioner DaemonSet based off of the ConfigMap. It can also create a service
account with all the required permissions.
The boostrapper accepts the following optional arguments:
* -image: Name of local volume provisioner image (default
"quay.io/external_storage/local-volume-provisioner:latest")
* -volume-config: Name of the local volume configuration configmap. The configmap must reside in the same
namespace as the bootstrapper. (default "local-volume-default-config")
* -serviceaccount: Name of the service accout for local volume provisioner (default "local-storage-admin")
The boostrapper requires the following permissions:
* Get/Create/Update ConfigMap
* Create ServiceAccount
* Create ClusterRoleBindings
* Create DaemonSet
Since the boostrapper generates the DaemonSet spec, the ConfigMap can be simplifed to just specify the
host directories:
```
kind: ConfigMap
metadata:
name: local-volume-config
namespace: kube-system
data:
"local-fast": |
{
"hostDir": "/mnt/ssds",
}
"local-slow": |
{
"hostDir": "/mnt/hdds",
}
```
The boostrapper will update the ConfigMap with the generated `mountDir`. It generates the `mountDir`
by stripping off the initial "/" in `hostDir`, replacing the remaining "/" with "~", and adding the
prefix path "/mnt/local-storage".
In the above example, the generated `mountDir` is `/mnt/local-storage/mnt ~ssds` and
`/mnt/local-storage/mnt~hdds`, respectively.
#### Use Case Deliverables
This alpha phase for Local PV support will provide the following capabilities:
* Local directories to be specified as Local PVs with node affinity
* Pod using a PVC that is bound to a Local PV will always be scheduled to that node
* External static provisioner DaemonSet that discovers local directories, creates, cleans up,
and deletes Local PVs
#### Limitations
However, some use cases will not work:
* Specifying multiple Local PVCs in a pod. Most likely, the PVCs will be bound to Local PVs on
different nodes,
making the pod unschedulable.
* Specifying Pod affinity/anti-affinity with Local PVs. PVC binding does not look at Pod scheduling
constraints at all.
* Using Local PVs in a highly utilized cluster. PVC binding does not look at Pod resource requirements
and Node resource availability.
These issues will be solved in a future release with advanced storage topology scheduling.
As a workaround, PVCs can be manually prebound to Local PVs to essentially manually schedule Pods to
specific nodes.
#### Test Cases
##### API unit tests
* LocalVolumeSource cannot be specified without the feature gate
* Non-empty PV node affinity is required for LocalVolumeSource
* Preferred node affinity is not allowed
* Path is required to be non-empty
* Invalid json representation of type NodeAffinity returns error
##### PV node affinity unit tests
* Nil or empty node affinity evalutes to true for any node
* Node affinity specifying existing node labels evalutes to true
* Node affinity specifying non-existing node label keys evaluates to false
* Node affinity specifying non-existing node label values evaluates to false
##### Local volume plugin unit tests
* Plugin can support PersistentVolumeSource
* Plugin cannot support VolumeSource
* Plugin supports ReadWriteOnce access mode
* Plugin does not support remaining access modes
* Plugin supports Mounter and Unmounter
* Plugin does not support Provisioner, Recycler, Deleter
* Plugin supports readonly
* Plugin GetVolumeName() returns PV name
* Plugin ConstructVolumeSpec() returns PV info
* Plugin disallows backsteps in the Path
##### Local volume provisioner unit tests
* Directory not in the cache and PV should be created
* Directory is in the cache and PV should not be created
* Directories created later are discovered and PV is created
* Unconfigured directories are ignored
* PVs are created with the configured StorageClass
* PV name generation hashed correctly using node name, storageclass and filename
* PV creation failure should not add directory to cache
* Non-directory type should not create a PV
* PV is released, PV should be deleted
* PV should not be deleted for any other PV phase
* PV deletion failure should not remove PV from cache
* PV cleanup failure should not delete PV or remove from cache
##### E2E tests
* Pod that is bound to a Local PV is scheduled to the correct node
and can mount, read, and write
* Two pods serially accessing the same Local PV can mount, read, and write
* Two pods simultaneously accessing the same Local PV can mount, read, and write
* Test both directory-based Local PV, and mount point-based Local PV
* Launch local volume provisioner, create some directories under the discovery path,
and verify that PVs are created and a Pod can mount, read, and write.
* After destroying a PVC managed by the local volume provisioner, it should cleanup
the volume and recreate a new PV.
* Pod using a Local PV with non-existant path fails to mount
* Pod that sets nodeName to a different node than the PV node affinity cannot schedule.
### Phase 2: 1.9 alpha
#### Smarter PV binding
The issue of PV binding not taking into account pod scheduling requirements affects any
type of volume that imposes topology constraints, such as local storage and zonal disks.
Because this problem affects more than just local volumes, it will be treated as a
separate feature with a separate proposal. Once that feature is implemented, then the
limitations outlined above wil be fixed.
#### Block devices and raw partitions
Pods accessing raw block storage is a new alpha feature in 1.8. Changes are required in
the Local volume plugin and provisioner to be able to support raw block devices.
Design TBD
#### Provisioner redesign for stricter K8s API access control
In 1.7, each instance of the provisioner on each node has full permissions to create and
delete all PVs in the system. This is unnecessary and potentially a vulnerability if the
node gets compromised.
To address this issue, the provisioner will be redesigned into two major components:
1. A central manager pod that handles the creation and deletion of PV objects.
This central pod can run on a trusted node and be given PV create/delete permissions.
2. Worker pods on each node, run as a DaemonSet, that discovers and cleans up the local
volumes on that node. These workers do not interact with PV objects, however
they still require permissions to be able to read the `Node.Labels` on their node.
The central manager will poll each worker for their discovered volumes and create PVs for
them. When a PV is released, then it will send the cleanup request to the worker.
Detailed design TBD