371 lines
16 KiB
Markdown
371 lines
16 KiB
Markdown
**Note: this is a design doc, which describes features that have not been
|
||
completely implemented. User documentation of the current state is
|
||
[here](../user-guide/compute-resources.md). The tracking issue for
|
||
implementation of this model is [#168](http://issue.k8s.io/168). Currently, both
|
||
limits and requests of memory and cpu on containers (not pods) are supported.
|
||
"memory" is in bytes and "cpu" is in milli-cores.**
|
||
|
||
# The Kubernetes resource model
|
||
|
||
To do good pod placement, Kubernetes needs to know how big pods are, as well as
|
||
the sizes of the nodes onto which they are being placed. The definition of "how
|
||
big" is given by the Kubernetes resource model — the subject of this
|
||
document.
|
||
|
||
The resource model aims to be:
|
||
* simple, for common cases;
|
||
* extensible, to accommodate future growth;
|
||
* regular, with few special cases; and
|
||
* precise, to avoid misunderstandings and promote pod portability.
|
||
|
||
## The resource model
|
||
|
||
A Kubernetes _resource_ is something that can be requested by, allocated to, or
|
||
consumed by a pod or container. Examples include memory (RAM), CPU, disk-time,
|
||
and network bandwidth.
|
||
|
||
Once resources on a node have been allocated to one pod, they should not be
|
||
allocated to another until that pod is removed or exits. This means that
|
||
Kubernetes schedulers should ensure that the sum of the resources allocated
|
||
(requested and granted) to its pods never exceeds the usable capacity of the
|
||
node. Testing whether a pod will fit on a node is called _feasibility checking_.
|
||
|
||
Note that the resource model currently prohibits over-committing resources; we
|
||
will want to relax that restriction later.
|
||
|
||
### Resource types
|
||
|
||
All resources have a _type_ that is identified by their _typename_ (a string,
|
||
e.g., "memory"). Several resource types are predefined by Kubernetes (a full
|
||
list is below), although only two will be supported at first: CPU and memory.
|
||
Users and system administrators can define their own resource types if they wish
|
||
(e.g., Hadoop slots).
|
||
|
||
A fully-qualified resource typename is constructed from a DNS-style _subdomain_,
|
||
followed by a slash `/`, followed by a name.
|
||
* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt)
|
||
(e.g., `kubernetes.io`, `example.com`).
|
||
* The name must be not more than 63 characters, consisting of upper- or
|
||
lower-case alphanumeric characters, with the `-`, `_`, and `.` characters
|
||
allowed anywhere except the first or last character.
|
||
* As a shorthand, any resource typename that does not start with a subdomain and
|
||
a slash will automatically be prefixed with the built-in Kubernetes _namespace_,
|
||
`kubernetes.io/` in order to fully-qualify it. This namespace is reserved for
|
||
code in the open source Kubernetes repository; as a result, all user typenames
|
||
MUST be fully qualified, and cannot be created in this namespace.
|
||
|
||
Some example typenames include `memory` (which will be fully-qualified as
|
||
`kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
|
||
|
||
For future reference, note that some resources, such as CPU and network
|
||
bandwidth, are _compressible_, which means that their usage can potentially be
|
||
throttled in a relatively benign manner. All other resources are
|
||
_incompressible_, which means that any attempt to throttle them is likely to
|
||
cause grief. This distinction will be important if a Kubernetes implementation
|
||
supports over-committing of resources.
|
||
|
||
### Resource quantities
|
||
|
||
Initially, all Kubernetes resource types are _quantitative_, and have an
|
||
associated _unit_ for quantities of the associated resource (e.g., bytes for
|
||
memory, bytes per seconds for bandwidth, instances for software licences). The
|
||
units will always be a resource type's natural base units (e.g., bytes, not MB),
|
||
to avoid confusion between binary and decimal multipliers and the underlying
|
||
unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
|
||
|
||
Resource quantities can be added and subtracted: for example, a node has a fixed
|
||
quantity of each resource type that can be allocated to pods/containers; once
|
||
such an allocation has been made, the allocated resources cannot be made
|
||
available to other pods/containers without over-committing the resources.
|
||
|
||
To make life easier for people, quantities can be represented externally as
|
||
unadorned integers, or as fixed-point integers with one of these SI suffices
|
||
(E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi,
|
||
Ki). For example, the following represent roughly the same value: 128974848,
|
||
"129e6", "129M" , "123Mi". Small quantities can be represented directly as
|
||
decimals (e.g., 0.3), or using milli-units (e.g., "300m").
|
||
* "Externally" means in user interfaces, reports, graphs, and in JSON or YAML
|
||
resource specifications that might be generated or read by people.
|
||
* Case is significant: "m" and "M" are not the same, so "k" is not a valid SI
|
||
suffix. There are no power-of-two equivalents for SI suffixes that represent
|
||
multipliers less than 1.
|
||
* These conventions only apply to resource quantities, not arbitrary values.
|
||
|
||
Internally (i.e., everywhere else), Kubernetes will represent resource
|
||
quantities as integers so it can avoid problems with rounding errors, and will
|
||
not use strings to represent numeric values. To achieve this, quantities that
|
||
naturally have fractional parts (e.g., CPU seconds/second) will be scaled to
|
||
integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in.
|
||
Internal APIs, data structures, and protobufs will use these scaled integer
|
||
units. Raw measurement data such as usage may still need to be tracked and
|
||
calculated using floating point values, but internally they should be rescaled
|
||
to avoid some values being in milli-units and some not.
|
||
* Note that reading in a resource quantity and writing it out again may change
|
||
the way its values are represented, and truncate precision (e.g., 1.0001 may
|
||
become 1.000), so comparison and difference operations (e.g., by an updater)
|
||
must be done on the internal representations.
|
||
* Avoiding milli-units in external representations has advantages for people
|
||
who will use Kubernetes, but runs the risk of developers forgetting to rescale
|
||
or accidentally using floating-point representations. That seems like the right
|
||
choice. We will try to reduce the risk by providing libraries that automatically
|
||
do the quantization for JSON/YAML inputs.
|
||
|
||
### Resource specifications
|
||
|
||
Both users and a number of system components, such as schedulers, (horizontal)
|
||
auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers
|
||
need to reason about resource requirements of workloads, resource capacities of
|
||
nodes, and resource usage. Kubernetes divides specifications of *desired state*,
|
||
aka the Spec, and representations of *current state*, aka the Status. Resource
|
||
requirements and total node capacity fall into the specification category, while
|
||
resource usage, characterizations derived from usage (e.g., maximum usage,
|
||
histograms), and other resource demand signals (e.g., CPU load) clearly fall
|
||
into the status category and are discussed in the Appendix for now.
|
||
|
||
Resource requirements for a container or pod should have the following form:
|
||
|
||
```yaml
|
||
resourceRequirementSpec: [
|
||
request: [ cpu: 2.5, memory: "40Mi" ],
|
||
limit: [ cpu: 4.0, memory: "99Mi" ],
|
||
]
|
||
```
|
||
|
||
Where:
|
||
* _request_ [optional]: the amount of resources being requested, or that were
|
||
requested and have been allocated. Scheduler algorithms will use these
|
||
quantities to test feasibility (whether a pod will fit onto a node).
|
||
If a container (or pod) tries to use more resources than its _request_, any
|
||
associated SLOs are voided — e.g., the program it is running may be
|
||
throttled (compressible resource types), or the attempt may be denied. If
|
||
_request_ is omitted for a container, it defaults to _limit_ if that is
|
||
explicitly specified, otherwise to an implementation-defined value; this will
|
||
always be 0 for a user-defined resource type. If _request_ is omitted for a pod,
|
||
it defaults to the sum of the (explicit or implicit) _request_ values for the
|
||
containers it encloses.
|
||
|
||
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources
|
||
that will be made available to a container or pod; if a container or pod uses
|
||
more resources than its _limit_, it may be terminated. The _limit_ defaults to
|
||
"unbounded"; in practice, this probably means the capacity of an enclosing
|
||
container, pod, or node, but may result in non-deterministic behavior,
|
||
especially for memory.
|
||
|
||
Total capacity for a node should have a similar structure:
|
||
|
||
```yaml
|
||
resourceCapacitySpec: [
|
||
total: [ cpu: 12, memory: "128Gi" ]
|
||
]
|
||
```
|
||
|
||
Where:
|
||
* _total_: the total allocatable resources of a node. Initially, the resources
|
||
at a given scope will bound the resources of the sum of inner scopes.
|
||
|
||
#### Notes
|
||
|
||
* It is an error to specify the same resource type more than once in each
|
||
list.
|
||
|
||
* It is an error for the _request_ or _limit_ values for a pod to be less than
|
||
the sum of the (explicit or defaulted) values for the containers it encloses.
|
||
(We may relax this later.)
|
||
|
||
* If multiple pods are running on the same node and attempting to use more
|
||
resources than they have requested, the result is implementation-defined. For
|
||
example: unallocated or unused resources might be spread equally across
|
||
claimants, or the assignment might be weighted by the size of the original
|
||
request, or as a function of limits, or priority, or the phase of the moon,
|
||
perhaps modulated by the direction of the tide. Thus, although it's not
|
||
mandatory to provide a _request_, it's probably a good idea. (Note that the
|
||
_request_ could be filled in by an automated system that is observing actual
|
||
usage and/or historical data.)
|
||
|
||
* Internally, the Kubernetes master can decide the defaulting behavior and the
|
||
kubelet implementation may expected an absolute specification. For example, if
|
||
the master decided that "the default is unbounded" it would pass 2^64 to the
|
||
kubelet.
|
||
|
||
|
||
## Kubernetes-defined resource types
|
||
|
||
The following resource types are predefined ("reserved") by Kubernetes in the
|
||
`kubernetes.io` namespace, and so cannot be used for user-defined resources.
|
||
Note that the syntax of all resource types in the resource spec is deliberately
|
||
similar, but some resource types (e.g., CPU) may receive significantly more
|
||
support than simply tracking quantities in the schedulers and/or the Kubelet.
|
||
|
||
### Processor cycles
|
||
|
||
* Name: `cpu` (or `kubernetes.io/cpu`)
|
||
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to
|
||
a canonical "Kubernetes CPU")
|
||
* Internal representation: milli-KCUs
|
||
* Compressible? yes
|
||
* Qualities: this is a placeholder for the kind of thing that may be supported
|
||
in the future — see [#147](http://issue.k8s.io/147)
|
||
* [future] `schedulingLatency`: as per lmctfy
|
||
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU
|
||
core on the node's processor divided by the speed of the canonical Kubernetes
|
||
CPU (a floating point value; default = 1.0).
|
||
|
||
To reduce performance portability problems for pods, and to avoid worse-case
|
||
provisioning behavior, the units of CPU will be normalized to a canonical
|
||
"Kubernetes Compute Unit" (KCU, pronounced ˈko͝oko͞o), which will roughly be
|
||
equivalent to a single CPU hyperthreaded core for some recent x86 processor. The
|
||
normalization may be implementation-defined, although some reasonable defaults
|
||
will be provided in the open-source Kubernetes code.
|
||
|
||
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will
|
||
be allocated — control of aspects like this will be handled by resource
|
||
_qualities_ (a future feature).
|
||
|
||
|
||
### Memory
|
||
|
||
* Name: `memory` (or `kubernetes.io/memory`)
|
||
* Units: bytes
|
||
* Compressible? no (at least initially)
|
||
|
||
The precise meaning of what "memory" means is implementation dependent, but the
|
||
basic idea is to rely on the underlying `memcg` mechanisms, support, and
|
||
definitions.
|
||
|
||
Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory
|
||
quantities rather than decimal ones: "64MiB" rather than "64MB".
|
||
|
||
|
||
## Resource metadata
|
||
|
||
A resource type may have an associated read-only ResourceType structure, that
|
||
contains metadata about the type. For example:
|
||
|
||
```yaml
|
||
resourceTypes: [
|
||
"kubernetes.io/memory": [
|
||
isCompressible: false, ...
|
||
]
|
||
"kubernetes.io/cpu": [
|
||
isCompressible: true,
|
||
internalScaleExponent: 3, ...
|
||
]
|
||
"kubernetes.io/disk-space": [ ... ]
|
||
]
|
||
```
|
||
|
||
Kubernetes will provide ResourceType metadata for its predefined types. If no
|
||
resource metadata can be found for a resource type, Kubernetes will assume that
|
||
it is a quantified, incompressible resource that is not specified in
|
||
milli-units, and has no default value.
|
||
|
||
The defined properties are as follows:
|
||
|
||
| field name | type | contents |
|
||
| ---------- | ---- | -------- |
|
||
| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) |
|
||
| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) |
|
||
| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". |
|
||
| isCompressible | bool, default=false | true if the resource type is compressible |
|
||
| defaultRequest | string, default=none | in the same format as a user-supplied value |
|
||
| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). |
|
||
|
||
|
||
# Appendix: future extensions
|
||
|
||
The following are planned future extensions to the resource model, included here
|
||
to encourage comments.
|
||
|
||
## Usage data
|
||
|
||
Because resource usage and related metrics change continuously, need to be
|
||
tracked over time (i.e., historically), can be characterized in a variety of
|
||
ways, and are fairly voluminous, we will not include usage in core API objects,
|
||
such as [Pods](../user-guide/pods.md) and Nodes, but will provide separate APIs
|
||
for accessing and managing that data. See the Appendix for possible
|
||
representations of usage data, but the representation we'll use is TBD.
|
||
|
||
Singleton values for observed and predicted future usage will rapidly prove
|
||
inadequate, so we will support the following structure for extended usage
|
||
information:
|
||
|
||
```yaml
|
||
resourceStatus: [
|
||
usage: [ cpu: <CPU-info>, memory: <memory-info> ],
|
||
maxusage: [ cpu: <CPU-info>, memory: <memory-info> ],
|
||
predicted: [ cpu: <CPU-info>, memory: <memory-info> ],
|
||
]
|
||
```
|
||
|
||
where a `<CPU-info>` or `<memory-info>` structure looks like this:
|
||
|
||
```yaml
|
||
{
|
||
mean: <value> # arithmetic mean
|
||
max: <value> # maximum value
|
||
min: <value> # minimum value
|
||
count: <value> # number of data points
|
||
percentiles: [ # map from %iles to values
|
||
"10": <10th-percentile-value>,
|
||
"50": <median-value>,
|
||
"99": <99th-percentile-value>,
|
||
"99.9": <99.9th-percentile-value>,
|
||
...
|
||
]
|
||
}
|
||
```
|
||
|
||
All parts of this structure are optional, although we strongly encourage
|
||
including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles.
|
||
_[In practice, it will be important to include additional info such as the
|
||
length of the time window over which the averages are calculated, the
|
||
confidence level, and information-quality metrics such as the number of dropped
|
||
or discarded data points.]_ and predicted
|
||
|
||
## Future resource types
|
||
|
||
### _[future] Network bandwidth_
|
||
|
||
* Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`)
|
||
* Units: bytes per second
|
||
* Compressible? yes
|
||
|
||
### _[future] Network operations_
|
||
|
||
* Name: "network-iops" (or `kubernetes.io/network-iops`)
|
||
* Units: operations (messages) per second
|
||
* Compressible? yes
|
||
|
||
### _[future] Storage space_
|
||
|
||
* Name: "storage-space" (or `kubernetes.io/storage-space`)
|
||
* Units: bytes
|
||
* Compressible? no
|
||
|
||
The amount of secondary storage space available to a container. The main target
|
||
is local disk drives and SSDs, although this could also be used to qualify
|
||
remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a
|
||
disk array, or a file system fronting any of these, is left for future work.
|
||
|
||
### _[future] Storage time_
|
||
|
||
* Name: storage-time (or `kubernetes.io/storage-time`)
|
||
* Units: seconds per second of disk time
|
||
* Internal representation: milli-units
|
||
* Compressible? yes
|
||
|
||
This is the amount of time a container spends accessing disk, including actuator
|
||
and transfer time. A standard disk drive provides 1.0 diskTime seconds per
|
||
second.
|
||
|
||
### _[future] Storage operations_
|
||
|
||
* Name: "storage-iops" (or `kubernetes.io/storage-iops`)
|
||
* Units: operations per second
|
||
* Compressible? yes
|
||
|
||
|
||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
|
||
[]()
|
||
<!-- END MUNGE: GENERATED_ANALYTICS -->
|