This commit is contained in:
Robert Krawitz 2018-09-10 11:36:40 -04:00
parent 8dcee5c894
commit 4a754840c5
1 changed files with 499 additions and 81 deletions

View File

@ -26,43 +26,87 @@ superseded-by:
- KEP-100
---
# Quotas for Ephemeral Storaeg
# Quotas for Ephemeral Storage
## Table of Contents
A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
[Tools for generating][] a table of contents from markdown are available.
A table of contents is helpful for quickly jumping to sections of a
KEP and for highlighting any additional information provided beyond
the standard KEP template. [Tools for generating][https://github.com/ekalinin/github-markdown-toc] a table of
contents from markdown are available.
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [User Stories [optional]](#user-stories-optional)
* [Story 1](#story-1)
* [Story 2](#story-2)
* [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Drawbacks [optional]](#drawbacks-optional)
* [Alternatives [optional]](#alternatives-optional)
* [Quotas for Ephemeral Storage](#quotas-for-ephemeral-storage)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [Operation Flow -- Applying a Quota](#operation-flow----applying-a-quota)
* [Operation Flow -- Retrieving Storage Consumption](#operation-flow----retrieving-storage-consumption)
* [Operation Flow -- Removing a Quota.](#operation-flow----removing-a-quota)
* [Operation Notes](#operation-notes)
* [Selecting a Project ID](#selecting-a-project-id)
* [Determine Whether a Project ID Applies To a Directory](#determine-whether-a-project-id-applies-to-a-directory)
* [Return a Project ID To the System](#return-a-project-id-to-the-system)
* [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
* [Notes on Implementation](#notes-on-implementation)
* [Notes on Code Changes](#notes-on-code-changes)
* [Testing Strategy](#testing-strategy)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Drawbacks [optional]](#drawbacks-optional)
* [Alternatives [optional]](#alternatives-optional)
* [Alternative quota-based implementation](#alternative-quota-based-implementation)
* [Alternative loop filesystem-based implementation](#alternative-loop-filesystem-based-implementation)
* [Infrastructure Needed [optional]](#infrastructure-needed-optional)
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
## Summary
Local storage capacity isolation, aka ephemeral-storage, was introduced into Kubernetes via https://github.com/kubernetes/features/issues/361. It provides support for capacity isolation of shared storage between pods, such that a pod can be limited in its consumption of shared resources and can be evicted if its consumption of shared storage exceeds that limit. The limits and requests for shared ephemeral-storage are similar to those for memory and CPU consumption.
The current mechanism relies on periodically walking each ephemeral volume (emptydir, logdir, or container writable layer) and summing the space consumption. This method is slow, can be fooled, and has high latency (i. e. a pod could consume a lot of storage prior to the kubelet being aware of its overage and terminating it).
The mechanism proposed here utilizes filesystem project quotas to provide monitoring of resource consumption and optionally enforcement of limits. Project quotas, initially in XFS and more recently ported to ext4fs, offer a kernel-based means of restricting and monitoring filesystem consumption that can be applied to one or more directories.
Local storage capacity isolation, aka ephemeral-storage, was
introduced into Kubernetes via
<https://github.com/kubernetes/features/issues/361>. It provides
support for capacity isolation of shared storage between pods, such
that a pod can be limited in its consumption of shared resources and
can be evicted if its consumption of shared storage exceeds that
limit. The limits and requests for shared ephemeral-storage are
similar to those for memory and CPU consumption.
The current mechanism relies on periodically walking each ephemeral
volume (emptydir, logdir, or container writable layer) and summing the
space consumption. This method is slow, can be fooled, and has high
latency (i. e. a pod could consume a lot of storage prior to the
kubelet being aware of its overage and terminating it).
The mechanism proposed here utilizes filesystem project quotas to
provide monitoring of resource consumption and optionally enforcement
of limits. Project quotas, initially in XFS and more recently ported
to ext4fs, offer a kernel-based means of restricting and monitoring
filesystem consumption that can be applied to one or more directories.
## Motivation
The mechanism presently used to monitor storage consumption involves use of `du` and `find` to periodically gather information about storage and inode consumption of volumes. This mechanism suffers from a number of drawbacks:
The mechanism presently used to monitor storage consumption involves
use of `du` and `find` to periodically gather information about
storage and inode consumption of volumes. This mechanism suffers from
a number of drawbacks:
* It is slow. If a volume contains a large number of files, walking
the directory can take a significant amount of time. There has been
at least one known report of nodes becoming not ready due to volume
metrics: <https://github.com/kubernetes/kubernetes/issues/62917>
* It is possible to conceal a file from the walker by creating it and
removing it while holding an open file descriptor on it. POSIX
behavior is to not remove the file until the last open file
descriptor pointing to it is removed. This has legitimate uses; it
ensures that a temporary file is deleted when the processes using it
exit, and it minimizes the attack surface by not having a file that
can be found by an attacker. The following pod does this; it will
never be caught by the present mechanism:
* It is slow. If a volume contains a large number of files, walking the directory can take a significant amount of time. There has been at least one known report of nodes becoming not ready due to volume metrics: https://github.com/kubernetes/kubernetes/issues/62917
* It is possible to conceal a file from the walker by creating it and removing it while holding an open file descriptor on it. POSIX behavior is to not remove the file until the last open file descriptor pointing to it is removed. This has legitimate uses; it ensures that a temporary file is deleted when the processes using it exit, and it minimizes the attack surface by not having a file that can be found by an attacker. The following pod does this; it will never be caught by the present mechanism:
```yaml
apiVersion: v1
kind: Pod
@ -88,44 +132,85 @@ spec:
- name: a
emptyDir: {}
```
* It is reactive rather than proactive. It does not prevent a pod from overshooting its limit; at best it catches it after the fact. On a fast storage medium, such as NVMe, a pod may write 50 GB or more of data before the housekeeping performed once per minute catches up to it. If the primary volume is the root partition, this will completely fill the partition, possibly causing serious problems elsewhere on the system.
* It is reactive rather than proactive. It does not prevent a pod
from overshooting its limit; at best it catches it after the fact.
On a fast storage medium, such as NVMe, a pod may write 50 GB or
more of data before the housekeeping performed once per minute
catches up to it. If the primary volume is the root partition, this
will completely fill the partition, possibly causing serious
problems elsewhere on the system.
In many environments, these issues may not matter, but shared multi-tenant environments need these issues addressed.
In many environments, these issues may not matter, but shared
multi-tenant environments need these issues addressed.
### Goals
* Primary: improve performance of monitoring by using project quotas in a non-enforcing way to collect information about storage utilization.
* Primary: detect storage used by pods that is concealed by deleted files being held open.
* Primary: this will not interfere with the more common user and group quotas.
* Stretch: enforce limits on per-volume storage consumption by using enforced project quotas. Each volume would be given an enforced quota of the total ephemeral storage limit of the pod.
* Primary: improve performance of monitoring by using project quotas
in a non-enforcing way to collect information about storage
utilization.
* Primary: detect storage used by pods that is concealed by deleted
files being held open.
* Primary: this will not interfere with the more common user and group
quotas.
* Stretch: enforce limits on per-volume storage consumption by using
enforced project quotas. Each volume would be given an enforced
quota of the total ephemeral storage limit of the pod.
### Non-Goals
* Enforcing limits on total pod storage consumption by any means, such that the pod would be hard restricted to the desired storage limit.
* Enforcing limits on total pod storage consumption by any means, such
that the pod would be hard restricted to the desired storage limit.
## Proposal
This proposal applies project quotas to emptydir volumes on qualifying filesystems (ext4fs and xfs with project quotas enabled). Project quotas are applied by selecting an unused project ID (a 32-bit unsigned integer), setting a limit on space and/or inode consumption, and attaching the ID to one or more files. By default (and as utilized herein), if a project ID is attached to a directory, it is inherited by any files created under that directory.
If we elect to use the quota as enforcing, we impose a quota consistent with the desired limit. If we elect to use it as non-enforcing, we impose a large quota that in practice cannot be exceeded (2^61-1 bytes for XFS, 2^58-1 bytes for ext4fs).
This proposal applies project quotas to emptydir volumes on qualifying
filesystems (ext4fs and xfs with project quotas enabled). Project
quotas are applied by selecting an unused project ID (a 32-bit
unsigned integer), setting a limit on space and/or inode consumption,
and attaching the ID to one or more files. By default (and as
utilized herein), if a project ID is attached to a directory, it is
inherited by any files created under that directory.
If we elect to use the quota as enforcing, we impose a quota
consistent with the desired limit. If we elect to use it as
non-enforcing, we impose a large quota that in practice cannot be
exceeded (2^63-1 bytes for XFS, 2^58-1 bytes for ext4fs).
### Operation Flow -- Applying a Quota
* Caller (emptydir volume manager or container runtime) creates an emptydir volume, with an empty directory at a location of its choice.
* Caller (emptydir volume manager or container runtime) creates an
emptydir volume, with an empty directory at a location of its
choice.
* Caller requests that a quota be applied to a directory.
* Determine whether a quota can be imposed on the directory, by asking each quota provider (one per filesystem type) whether it can apply a quota to the directory. If no provider claims the directory, an error status is returned to the caller.
* Select an unused project ID (see below).
* Set the desired limit on the project ID, in a filesystem-dependent manner.
* Apply the project ID to the directory in question, in a filesystem-dependent manner.
* Determine whether a quota can be imposed on the directory, by asking
each quota provider (one per filesystem type) whether it can apply a
quota to the directory. If no provider claims the directory, an
error status is returned to the caller.
* Select an unused project ID (see [below](#selecting-a-project-id)).
* Set the desired limit on the project ID, in a filesystem-dependent
manner (see [below](#notes-on-implementation)).
* Apply the project ID to the directory in question, in a
filesystem-dependent manner.
An error at any point results in no quota being applied and no change to the state of the system. The caller in general should not assume a priori that the attempt will be successful. It could choose to reject a request if a quota cannot be applied, but at this time it will simply ignore the error and proceed as today.
An error at any point results in no quota being applied and no change
to the state of the system. The caller in general should not assume a
priori that the attempt will be successful. It could choose to reject
a request if a quota cannot be applied, but at this time it will
simply ignore the error and proceed as today.
### Operation Flow -- Retrieving Storage Consumption
* Caller (kubelet metrics code, cadvisor, container runtime) asks the quota code to compute the amount of storage used under the directory.
* Determine whether a quota applies to the directory, in a filesystem-dependent manner (see below).
* If so, determine how much storage or how many inodes are utilized, in a filesystem dependent manner.
* Caller (kubelet metrics code, cadvisor, container runtime) asks the
quota code to compute the amount of storage used under the
directory.
* Determine whether a quota applies to the directory, in a
filesystem-dependent manner (see [below](#notes-on-implementation)).
* If so, determine how much storage or how many inodes are utilized,
in a filesystem dependent manner.
If the quota code is unable to retrieve the consumption, it returns an error status and it is up to the caller to utilize a fallback mechanism (such as the directory walk performed today).
If the quota code is unable to retrieve the consumption, it returns an
error status and it is up to the caller to utilize a fallback
mechanism (such as the directory walk performed today).
### Operation Flow -- Removing a Quota.
@ -133,97 +218,430 @@ If the quota code is unable to retrieve the consumption, it returns an error sta
* Determine whether a project quota applies to the directory.
* Remove the limit from the project ID associated with the directory.
* Remove the association between the directory and the project ID.
* Return the project ID to the system to allow its use elsewhere (see below).
* Return the project ID to the system to allow its use elsewhere (see
[below](#return-a-project-id-to-the-system).
* Caller may delete the directory and its contents (normally it will).
### Operation Notes
#### Selecting a Project ID
Project IDs are a shared space within a filesystem. If the same project ID is assigned to multiple directories, the space consumption reported by the quota will be the sum of that of all of the directories. Hence, it is important to ensure that each directory is assigned a unique project ID (unless it is desired to pool the storage use of multiple directories).
Project IDs are a shared space within a filesystem. If the same
project ID is assigned to multiple directories, the space consumption
reported by the quota will be the sum of that of all of the
directories. Hence, it is important to ensure that each directory is
assigned a unique project ID (unless it is desired to pool the storage
use of multiple directories).
The canonical mechanism to record persistently that a project ID is reserved is to store it in the /etc/projid (projid(5)) and/or /etc/projects (projects(5)) files. However, it is possible to utilize project IDs without recording them in those files; they exist for administrative convenience but neither the kernel nor the filesystem is aware of them. Other ways can be used to determine whether a project ID is in active use on a given filesystem:
The canonical mechanism to record persistently that a project ID is
reserved is to store it in the /etc/projid (projid(5)) and/or
/etc/projects (projects(5)) files. However, it is possible to utilize
project IDs without recording them in those files; they exist for
administrative convenience but neither the kernel nor the filesystem
is aware of them. Other ways can be used to determine whether a
project ID is in active use on a given filesystem:
* The quota values (in blocks and/or inodes) assigned to the project ID are non-zero.
* The storage consumption (in blocks and/or inodes) reported under the project ID are non-zero.
* The quota values (in blocks and/or inodes) assigned to the project
ID are non-zero.
* The storage consumption (in blocks and/or inodes) reported under the
project ID are non-zero.
The algorithm to be used is as follows:
* Lock this instance of the quota code against re-entrancy.
* open and flock() the /etc/project and /etc/projid files, so that other uses of this code are excluded.
* open and flock() the /etc/project and /etc/projid files, so that
other uses of this code are excluded.
* Start from a high number (the prototype uses 1048577).
* Iterate from there, performing the following tests:
* Is the ID reserved by this instance of the quota code?
* Is the ID present in /etc/projects?
* Is the ID present in /etc/projid?
* Are the quota values and/or consumption reported by the kernel non-zero? This test is restricted to 128 iterations to ensure that a bug here or elsewhere does not result in an infinite loop looking for a quota ID.
* Are the quota values and/or consumption reported by the kernel
non-zero? This test is restricted to 128 iterations to ensure
that a bug here or elsewhere does not result in an infinite loop
looking for a quota ID.
* If an ID has been found:
* Add it to an in-memory copy of /etc/projects and /etc/projid so that any other uses of project quotas do not reuse it.
* Write temporary copies of /etc/projects and /etc/projid that are flock()ed
* If successful, rename the temporary files appropriately (if rename of one succeeds but the other fails, we have a problem that we cannot recover from, and the files may be inconsistent).
* Add it to an in-memory copy of /etc/projects and /etc/projid so
that any other uses of project quotas do not reuse it.
* Write temporary copies of /etc/projects and /etc/projid that are
flock()ed
* If successful, rename the temporary files appropriately (if
rename of one succeeds but the other fails, we have a problem
that we cannot recover from, and the files may be inconsistent).
* Unlock /etc/projid and /etc/projects.
* Unlock this instance of the quota code.
A minor variation of this is used if we want to reuse an existing quota ID.
A minor variation of this is used if we want to reuse an existing
quota ID.
#### Determine Whether a Project ID Applies To a Directory
It is possible to determine whether a directory has a project ID applied to it by requesting (via the quotactl(2) system call) the project ID associated with the directory. Whie the specifics are filesystem-dependent, the basic method is the same for at least XFS and ext4fs.
It is possible to determine whether a directory has a project ID
applied to it by requesting (via the quotactl(2) system call) the
project ID associated with the directory. Whie the specifics are
filesystem-dependent, the basic method is the same for at least XFS
and ext4fs.
It is not possible to directly determine the directory or directories to which a project ID is applied. It is possible to determine whether a project ID has been applied to an existing directory or files; the reported consumption will be non-zero.
It is not possible to determine in constant operations the directory
or directories to which a project ID is applied. It is possible to
determine whether a given project ID has been applied to an existing
directory or files (although those will not be known); the reported
consumption will be non-zero.
The code records internally the project ID applied to a directory, but it cannot always rely on this. In particular, if the kubelet has exited and has been restarted, the map from directory to project ID is lost. If it cannot find a map entry, it falls back on the approach discussed above.
The code records internally the project ID applied to a directory, but
it cannot always rely on this. In particular, if the kubelet has
exited and has been restarted (and hence the quota applying to the
directory should be removed), the map from directory to project ID is
lost. If it cannot find a map entry, it falls back on the approach
discussed above.
#### Return a Project ID To the System
The algorithm used to return a project ID to the system is very similar to the algorithm used to select a project ID, except of course for selecting a project ID. It performs the same sequence of locking /etc/project and /etc/projid, editing a copy of the file, and restoring it.
The algorithm used to return a project ID to the system is very
similar to the algorithm used to select a project ID, except of course
for selecting a project ID. It performs the same sequence of locking
/etc/project and /etc/projid, editing a copy of the file, and
restoring it.
If the project ID is applied to multiple directories and the code can determine that, it will not remove the project ID from /etc/projid until the last reference is removed. While it is not anticipated that this mode of operation will be used, at least initially, this can be detected even on kubelet restart by looking at the reference count in /etc/projects.
If the project ID is applied to multiple directories and the code can
determine that, it will not remove the project ID from /etc/projid
until the last reference is removed. While it is not anticipated in
this KEP that this mode of operation will be used, at least initially,
this can be detected even on kubelet restart by looking at the
reference count in /etc/projects.
### Implementation Details/Notes/Constraints [optional]
What are the caveats to the implementation?
What are some important details that didn't come across above.
Go in to as much detail as necessary here.
This might be a good place to talk about core concepts and how they releate.
#### Notes on Implementation
The primary new interface defined is the quota interface in
`pkg/volume/util/quota/quota.go`. This defines five operations:
* Does the specified directory support quotas
* Assign a quota to a directory. If a non-empty pod UID is provided,
the quota assigned is that of any other directories under this pod
UID; if an empty pod UID is provided, a unique quota is assigned.
* Retrieve the consumption of the specified directory. If the quota
code cannot handle it efficiently, it returns an error and the
caller falls back on existing mechanism.
* Retrieve the inode consumption of the specified directory; same
description as above.
* Remove quota from a directory. If a non-empty pod UID is passed, it
is checked against that recorded in-memory (if any). The quota is
removed from the specified directory. This can be used even if
AssignQuota has not been used; it inspects the directory and removes
the quota from it. This permits stale quotas from an interrupted
kubelet to be cleaned up.
Two implementations are provided: `quota_linux.go` (for Linux) and
`quota_unsupported.go` (for other operating systems). The latter
returns an error for all requests.
As the quota mechanism is intended to support multiple filesystems,
and different filesystems require different low level code for
manipulating quotas, a provider is supplied that finds an appropriate
quota applier implementation for the filesystem in question. The low
level quota applier provides similar operations to the top level quota
code, with two exceptions:
* No operation exists to determine whether a quota can be applied
(that is handled by the provider).
* An additional operation is provided to determine whether a given
quota ID is in use within the filesystem (outside of /etc/projects
and /etc/projid).
The two quota providers in the initial implementation are in
`pkg/volume/util/quota/extfs` and `pkg/volume/util/quota/xfs`. While
some quota operations do require different system calls, a lot of the
code is common, and factored into
`pkg/volume/util/quota/common/quota_linux_common_impl.go`.
#### Notes on Code Changes
The prototype for this project is mostly self-contained within
`pkg/volume/util/quota` and a few changes to
`pkg/volume/empty_dir/empty_dir.go`. However, a few changes were
required elsewhere:
* The operation executor needs to pass the desired size limit to the
volume plugin where appropriate so that the volume plugin can impose
a quota. The limit is passed as 0 (do not use quotas), positive
number (impose an enforcing quota if possible, measured in bytes),
or -1 (impose a non-enforcing quota, if possible) on the volume.
This requires changes to
`pkg/volume/util/operationexecutor/operation_executor.go` (to add
`DesiredSizeLimit` to `VolumeToMount`),
`pkg/kubelet/volumemanager/cache/desired_state_of_world.go`, and
`pkg/kubelet/eviction/helpers.go` (the latter in order to determine
whether the volume is a local ephemeral one).
* The volume manager (in `pkg/volume/volume.go`) changes the
`Mounter.SetUp` and `Mounter.SetUpAt` interfaces to take a new
`MounterArgs` type rather than an `FsGroup` (`*int64`). This is to
allow passing the desired size and pod UID (in the event we choose
to implement quotas shared between multiple volumes; see
[below](#alternative-quota-based-implementation)). This required
small changes to all volume plugins and their tests, but will in the
future allow adding additional data without having to change code
other than that which uses the new information.
#### Testing Strategy
The quota code is by an large not very amendable to unit tests. While
there are simple unit tests for parsing the mounts file, and there
could be tests for parsing the projects and projid files, the real
work (and risk) involves interactions with the kernel and with
multiple instances of this code (e. g. in the kubelet and the runtime
manager, particularly under stress). It also requires setup in the
form of a prepared filesystem. It would be better served by
appropriate end to end tests.
### Risks and Mitigations
What are the risks of this proposal and how do we mitigate.
Think broadly.
For example, consider both security and how this will impact the larger kubernetes ecosystem.
* The SIG raised the possibility of a container being unable to exit
should we enforce quotas, and the quota interferes with writing the
log. This can be mitigated by either not applying a quota to the
log directory and using the du mechanism, or by applying a separate
non-enforcing quota to the log directory.
As log directories are write-only by the container, and consumption
can be limited by other means (as the log is filtered by the
runtime), I do not consider the ability to write uncapped to the log
to be a serious exposure.
Note in addition that even without quotas it is possible for writes
to fail due to lack of filesystem space, which is effectively (and
in some cases operationally) indistinguishable from exceeding quota,
so even at present code must be able to handle those situations.
* Filesystem quotas may impact performance to an unknown degree.
Information on that is hard to come by in general, and one of the
reasons for using quotas is indeed to improve performance. If this
is a problem in the field, merely turning off quotas (or selectively
disabling project quotas) on the filesystem in question will avoid
the problem. Against the possibility that that cannot be done
(because project quotas are needed for other purposes), we should
provide a way to disable use of quotas altogether via a feature
gate.
A report <https://blog.pythonanywhere.com/110/> notes that an
unclean shutdown on Linux kernel versions between 3.11 and 3.17 can
result in a prolonged downtime while quota information is restored.
Unfortunately, [the link referenced
here](http://oss.sgi.com/pipermail/xfs/2015-March/040879.html) is no
longer available.
* Bugs in the quota code could result in a variety of regression
behavior. For example, if a quota is incorrectly applied it could
result in ability to write no data at all to the volume. This could
be mitigated by use of non-enforcing quotas. XFS in particular
offers the pqnoenforce mount option that makes all quotas
non-enforcing.
We should offer two feature gates, one to enable quotas at all (on
by default) and one to enable enforcing quotas (initially off, but
with intention of enabling in the near future).
## Graduation Criteria
How will we know that this has succeeded?
Gathering user feedback is crucial for building high quality experiences and SIGs have the important responsibility of setting milestones for stability and completeness.
Hopefully the content previously contained in [umbrella issues][] will be tracked in the `Graduation Criteria` section.
How will we know that this has succeeded? Gathering user feedback is
crucial for building high quality experiences and SIGs have the
important responsibility of setting milestones for stability and
completeness. Hopefully the content previously contained in [umbrella
issues][] will be tracked in the `Graduation Criteria` section.
[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752
[umbrella issues]: N/A
## Implementation History
Major milestones in the life cycle of a KEP should be tracked in `Implementation History`.
Major milestones might include
Major milestones in the life cycle of a KEP should be tracked in
`Implementation History`. Major milestones might include
- the `Summary` and `Motivation` sections being merged signaling SIG acceptance
- the `Proposal` section being merged signaling agreement on a proposed design
- the `Summary` and `Motivation` sections being merged signaling SIG
acceptance
- the `Proposal` section being merged signaling agreement on a
proposed design
- the date implementation started
- the first Kubernetes release where an initial version of the KEP was available
- the version of Kubernetes where the KEP graduated to general availability
- the first Kubernetes release where an initial version of the KEP was
available
- the version of Kubernetes where the KEP graduated to general
availability
- when the KEP was retired or superseded
## Drawbacks [optional]
Why should this KEP _not_ be implemented.
* Use of quotas, particularly the less commonly used project quotas,
requires additional action on the part of the administrator. In
particular:
* ext4fs filesystems must be created with additional options that
are not enabled by default:
```
mkfs.ext4 -O quota,project -Q usrquota,grpquota,prjquota _device_
```
* An additional option (`prjquota`) must be applied in /etc/fstab
* If the root filesystem is to be quota-enabled, it must be set in
the grub options.
* Use of project quotas for this purpose will preclude future use
within containers.
## Alternatives [optional]
Similar to the `Drawbacks` section the `Alternatives` section is used to highlight and record other possible approaches to delivering the value proposed by a KEP.
I have considered two classes of alternatives:
* Alternatives based on quotas, with different implementation
* Alternatives based on loop filesystems without use of quotas
### Alternative quota-based implementation
Within the basic framework of using quotas to monitor and potentially
enforce storage utilization, there are a number of possible options:
* Utilize per-volume non-enforcing quotas to monitor storage (the
first stage of this proposal).
This mostly preserves the current behavior, but with more efficient
determination of storage utilization and the possibility of building
further on it. The one change from current behavior is the ability
to detect space used by deleted files.
* Utilize per-volume enforcing quotas to monitor and enforce storage
(the second stage of this proposal).
This allows partial enforcement of storage limits. As local storage
capacity isolation works at the level of the pod, and we have no
control of user utilization of ephemeral volumes, we would have to
give each volume a quota of the full limit. For example, if a pod
had a limit of 1 MB but had four ephemeral volumes mounted, it would
be possible for storage utilization to reach (at least temporarily)
4MB before being capped.
* Utilize per-pod enforcing user or group quotas to enforce storage
consumption, and per-volume non-enforcing quotas for monitoring.
This would offer the best of both worlds: a fully capped storage
limit combined with efficient reporting. However, it would require
each pod to run under a distinct UID or GID. This may prevent pods
from using setuid or setgid or their variants, and would interfere
with any other use of group or user quotas within Kubernetes.
* Utilize per-pod enforcing quotas to monitor and enforce storage.
This allows for full enforcement of storage limits, at the expense
of being able to efficiently monitor per-volume storage
consumption. As there have already been reports of monitoring
causing trouble, I do not advise this option.
A variant of this would report (1/N) storage for each covered
volume, so with a pod with a 4MiB quota and 1MiB total consumption,
spread across 4 ephemeral volumes, each volume would report a
consumption of 256 KiB. Another variant would change the API to
report statistics for all ephemeral volumes combined. I do not
advise this option.
### Alternative loop filesystem-based implementation
Another way of isolating storage is to utilize filesystems of
pre-determined size, using the loop filesystem facility within Linux.
It is possible to create a file and run mkfs(8) on it, and then to
mount that filesystem on the desired directory. This both limits the
storage available within that directory and enables quick retrieval of
it via statfs(2).
Cleanup of such a filesystem involves unmounting it and removing the
backing file.
The backing file can be created as a sparse file, and the `discard`
option can be used to return unused space to the system, allowing for
thin provisioning.
I conducted preliminary investigations into this. While at first it
appeared promising, it turned out to have multiple critical flaws:
* If the filesystem is mounted without `discard`, it can grow to the
full size of the backing file, negating any possibility of thin
provisioning. If the file is created dense in the first place,
there is never any possibility of thin provisioning without use of
`discard`.
If the backing file is created densely, it additionally may require
significant time to create if the ephemeral limit is large.
* If the filesystem is mounted `nosync`, and is sparse, it is possible
for writes to succeed and then fail later with I/O errors when
synced to the backing storage. This will lead to data corruption
that cannot be detected at the time of write.
This can easily be reproduced by e. g. creating a 64MB filesystem
and within it creating a 128MB sparse file and building a filesystem
on it. When that filesystem is in turn mounted, writes to it will
succeed, but I/O errors will be seen in the log and the file will be
incomplete:
```
# mkdir /var/tmp/d1 /var/tmp/d2
# dd if=/dev/zero of=/var/tmp/fs1 bs=4096 count=1 seek=16383
# mkfs.ext4 /var/tmp/fs1
# mount -o nosync -t ext4 /var/tmp/fs1 /var/tmp/d1
# dd if=/dev/zero of=/var/tmp/d1/fs2 bs=4096 count=1 seek=32767
# mkfs.ext4 /var/tmp/d1/fs2
# mount -o nosync -t ext4 /var/tmp/d1/fs2 /var/tmp/d2
# dd if=/dev/zero of=/var/tmp/d2/test bs=4096 count=24576
_...will normally succeed..._
# sync
_...fails with I/O error!..._
```
* If the filesystem is mounted `sync`, all writes to it are
immediately committed to the backing store, and the _dd_ operation
above fails as soon as it fills up _/var/tmp/d1_. However,
performance is drastically slowed, particularly with small writes;
with 1K writes, I observed performance degradation in some cases
exceeding three orders of magnitude.
I performed a test comparing writing 64 MB to a base (partitioned)
filesystem, to a loop filesystem without _sync_, and a loop
filesystem with _sync. Total I/O was sufficient to run for at least
5 seconds in each case. All filesystems involved were XFS. Loop
filesystems were 128 MB and dense. Times are in seconds. The
erratic behavior (e. g. the 65536 case) was involved was observed
repeatedly, although the exact amount of time and which I/O sizes
were affected varied. The underlying device was an HP EX920 1TB
NVMe SSD.
| I/O Size | Partition | Loop w/sync | Loop w/o sync |
| ---: | ---: | ---: | ---: |
| 1024 | 0.104 | 0.120 | 140.390 |
| 4096 | 0.045 | 0.077 | 21.850 |
| 16384 | 0.045 | 0.067 | 5.550 |
| 65536 | 0.044 | 0.061 | 20.440 |
| 262144 | 0.043 | 0.087 | 0.545 |
| 1048576 | 0.043 | 0.055 | 7.490 |
| 4194304 | 0.043 | 0.053 | 0.587 |
The only potentially viable combination in my view would be a dense
loop filesystem without sync, but that would render any thin
provisioning impossible.
## Infrastructure Needed [optional]
Use this section if you need things from the project/SIG.
Examples include a new subproject, repos requested, github details.
Listing these here allows a SIG to get the process for these resources started right away.
* Decision: who is responsible for quota management of all volume
types (and especially ephemeral volumes of all types). At present,
emptydir volumes are managed by the kubelet and logdirs and writable
layers by either the kubelet or the runtime, depending upon the
choice of runtime. Beyond the specific proposal that the runtime
should manage quotas for volumes it creates, there are broader
issues that I request assistance from the SIG in addressing.
* Location of the quota code. If the quotas for different volume
types are to be managed by different components, each such component
needs access to the quota code. The code is substantial and should
not be copied; it would more appropriately be vendored.