addressing Brians comments
|
|
@ -1,3 +1,5 @@
|
|||
# OBSOLETE
|
||||
|
||||
# Templates+Parameterization: Repeatedly instantiating user-customized application topologies.
|
||||
|
||||
## Motivation
|
||||
|
|
@ -1,43 +0,0 @@
|
|||
# Containerized Mounter with Chroot for Container-Optimized OS
|
||||
|
||||
## Goal
|
||||
|
||||
Due security and management overhead, our new Container-Optimized OS used by GKE
|
||||
does not carry certain storage drivers and tools needed for such as nfs and
|
||||
glusterfs. This project takes a containerized mount approach to package mount
|
||||
binaries into a container. Volume plugin will execute mount inside of container
|
||||
and share the mount with the host.
|
||||
|
||||
|
||||
## Design
|
||||
|
||||
1. A docker image has storage tools (nfs and glusterfs) pre-installed and uploaded
|
||||
to gcs.
|
||||
2. During GKE cluster configuration, the docker image is pulled and installed on
|
||||
the cluster node.
|
||||
3. When nfs or glusterfs type mount is invoked by kubelet, it will run the mount
|
||||
command inside of a container with the pre-install docker image and the mount
|
||||
propagation set to “shared. In this way, the mount inside the container will
|
||||
visible to host node too.
|
||||
4. A special case for NFSv3, a rpcbind process is issued before running mount
|
||||
command.
|
||||
|
||||
## Implementation details
|
||||
|
||||
* In the first version of containerized mounter, we use rkt fly to dynamically
|
||||
start a container during mount. When mount command finishes, the container is
|
||||
normally exited and will be garbage-collected. However, in case the glusterfs
|
||||
mount, because a gluster daemon is running after command mount finishes util
|
||||
glusterfs unmount, the container started for mount will continue to run until
|
||||
glusterfs client finishes. The container cannot be garbage-collected right away
|
||||
and multiple containers might be running for some time. Due to shared mount
|
||||
propagation, with more containers running, the number of mounts will increase
|
||||
significantly and might cause kernel panic. To solve this problem, a chroot
|
||||
approach is proposed and implemented.
|
||||
* In the second version, instead of running a container on the host, the docker
|
||||
container’s file system is exported as a tar archive and pre-installed on host.
|
||||
Kubelet directory is shared mount between host and inside of the container’s
|
||||
rootfs. When a gluster/nfs mount is issued, a mounter script will use chroot to
|
||||
change to the container’s rootfs and run the mount. This approach is very clean
|
||||
since there is no need to manage a container’s lifecycle and avoid having large
|
||||
number of mounts.
|
||||
|
|
@ -1,240 +1,244 @@
|
|||
Uncategorized (Please Help)
|
||||
high-availability.md
|
||||
control-plane-resilience.md
|
||||
downward_api_resources_limits_requests.md
|
||||
seccomp.md
|
||||
client-package-structure.md
|
||||
service-discovery.md
|
||||
metadata-policy.md
|
||||
containerized-mounter.md~
|
||||
identifiers.md
|
||||
local-cluster-ux.md
|
||||
pod-pid-namespace.md
|
||||
grow-volume-size.md
|
||||
image-provenance.md
|
||||
core-metrics-pipeline.md
|
||||
versioning.md
|
||||
ha_master.md
|
||||
secret-configmap-downwarapi-file-mode.md
|
||||
protobuf.md
|
||||
flakiness-sla.md
|
||||
resources.md
|
||||
initial-resources.md
|
||||
Uncategorized
|
||||
admission_control_event_rate_limit.md
|
||||
create_sheet.py
|
||||
runtime-client-server.md
|
||||
OWNERS
|
||||
namespaces.md
|
||||
cpu-manager.md
|
||||
selinux-enhancements.md
|
||||
sysctl.md
|
||||
create_sheet.py~
|
||||
design_proposal_template.md
|
||||
dir_struct.txt
|
||||
selinux.md
|
||||
templates.md
|
||||
pod-cache.png
|
||||
README.md
|
||||
multi-platform.md
|
||||
pod-lifecycle-event-generator.md
|
||||
secrets.md
|
||||
cri-dockershim-checkpoint.md
|
||||
event_compression.md
|
||||
multi-platform.md
|
||||
owners
|
||||
pleg.png
|
||||
readme.md
|
||||
runtime-client-server.md
|
||||
templates.md~
|
||||
./sig-cli
|
||||
get-describe-apiserver-extensions.md
|
||||
kubectl-create-from-env-file.md
|
||||
kubectl-extension.md
|
||||
kubectl-login.md
|
||||
kubectl_apply_getsetdiff_last_applied_config.md
|
||||
multi-fields-merge-key.md
|
||||
template.md
|
||||
expansion.md
|
||||
kubectl-login.md
|
||||
simple-rolling-update.md
|
||||
OWNERS
|
||||
get-describe-apiserver-extensions.md
|
||||
owners
|
||||
preserve-order-in-strategic-merge-patch.md
|
||||
kubectl-create-from-env-file.md
|
||||
simple-rolling-update.md
|
||||
./network
|
||||
flannel-integration.md
|
||||
service-external-name.md
|
||||
networking.md
|
||||
command_execution_port_forwarding.md
|
||||
network-policy.md
|
||||
external-lb-source-ip-preservation.md
|
||||
flannel-integration.md
|
||||
network-policy.md
|
||||
networking.md
|
||||
selinux-enhancements.md
|
||||
service-discovery.md
|
||||
service-external-name.md
|
||||
./resource-management
|
||||
admission_control_limit_range.md
|
||||
admission_control_resource_quota.md
|
||||
device-plugin-overview.png
|
||||
device-plugin.md
|
||||
device-plugin.png
|
||||
gpu-support.md
|
||||
device-plugin-overview.png
|
||||
hugepages.md
|
||||
resource-quota-scoping.md
|
||||
./testing
|
||||
flakiness-sla.md
|
||||
./autoscaling
|
||||
hpa-v2.md
|
||||
hpa-status-conditions.md
|
||||
horizontal-pod-autoscaler.md
|
||||
hpa-status-conditions.md
|
||||
hpa-v2.md
|
||||
initial-resources.md
|
||||
./architecture
|
||||
architecture.md
|
||||
architecture.dia
|
||||
architecture.png
|
||||
architecture.svg
|
||||
./api-machinery
|
||||
admission_control_extension.md
|
||||
csi-client-structure-proposal.md
|
||||
selector-generation.md
|
||||
pod-safety.md
|
||||
container-init.md
|
||||
resource-quota-scoping.md
|
||||
thirdpartyresources.md
|
||||
aggregated-api-servers.md
|
||||
extending-api.md
|
||||
envvar-configmap.md
|
||||
dynamic-admission-control-configuration.md
|
||||
api-chunking.md
|
||||
garbage-collection.md
|
||||
customresources-validation.md
|
||||
auditing.md
|
||||
apiserver-watch.md
|
||||
admission_control_limit_range.md
|
||||
apiserver-build-in-admission-plugins.md
|
||||
synchronous-garbage-collection.md
|
||||
configmap.md
|
||||
csi-new-client-library-procedure.md
|
||||
pod-preset.md
|
||||
add-new-patchStrategy-to-clear-fields-not-present-in-patch.md
|
||||
api-group.md
|
||||
identifiers.md
|
||||
namespaces.md
|
||||
principles.md
|
||||
./api-machinery
|
||||
add-new-patchstrategy-to-clear-fields-not-present-in-patch.md
|
||||
admission_control.md
|
||||
optional-configmap.md
|
||||
server-get.md
|
||||
admission_control_extension.md
|
||||
aggregated-api-servers.md
|
||||
api-chunking.md
|
||||
api-group.md
|
||||
apiserver-build-in-admission-plugins.md
|
||||
apiserver-count-fix.md
|
||||
admission_control_resource_quota.md
|
||||
apiserver-watch.md
|
||||
auditing.md
|
||||
bulk_watch.md
|
||||
client-package-structure.md
|
||||
controller-ref.md
|
||||
csi-client-structure-proposal.md
|
||||
csi-new-client-library-procedure.md
|
||||
customresources-validation.md
|
||||
dynamic-admission-control-configuration.md
|
||||
extending-api.md
|
||||
garbage-collection.md
|
||||
metadata-policy.md
|
||||
protobuf.md
|
||||
server-get.md
|
||||
synchronous-garbage-collection.md
|
||||
thirdpartyresources.md
|
||||
./node
|
||||
pod-resource-management.md
|
||||
kubelet-tls-bootstrap.md
|
||||
dynamic-kubelet-configuration.md
|
||||
kubelet-hypercontainer-runtime.md
|
||||
all-in-one-volume.md
|
||||
annotations-downward-api.md
|
||||
configmap.md
|
||||
container-init.md
|
||||
container-runtime-interface-v1.md
|
||||
kubelet-authorizer.md
|
||||
cpu-manager.md
|
||||
cri-dockershim-checkpoint.md
|
||||
disk-accounting.md
|
||||
kubelet-systemd.md
|
||||
kubelet-cri-logging.md
|
||||
downward_api_resources_limits_requests.md
|
||||
dynamic-kubelet-configuration.md
|
||||
envvar-configmap.md
|
||||
expansion.md
|
||||
kubelet-auth.md
|
||||
runtimeconfig.md
|
||||
kubelet-authorizer.md
|
||||
kubelet-cri-logging.md
|
||||
kubelet-eviction.md
|
||||
kubelet-hypercontainer-runtime.md
|
||||
kubelet-rkt-runtime.md
|
||||
kubelet-rootfs-distribution.md
|
||||
kubelet-systemd.md
|
||||
node-allocatable.md
|
||||
optional-configmap.md
|
||||
pod-cache.png
|
||||
pod-lifecycle-event-generator.md
|
||||
pod-pid-namespace.md
|
||||
pod-resource-management.md
|
||||
propagation.md
|
||||
resource-qos.md
|
||||
runtime-pod-cache.md
|
||||
kubelet-rootfs-distribution.md
|
||||
kubelet-rkt-runtime.md
|
||||
node-allocatable.md
|
||||
kubelet-eviction.md
|
||||
seccomp.md
|
||||
secret-configmap-downwardapi-file-mode.md
|
||||
selinux.md
|
||||
sysctl.md
|
||||
./service-catalog
|
||||
pod-preset.md
|
||||
./instrumentation
|
||||
core-metrics-pipeline.md
|
||||
custom-metrics-api.md
|
||||
metrics-server.md
|
||||
monitoring_architecture.md
|
||||
monitoring_architecture.png
|
||||
custom-metrics-api.md
|
||||
resource-metrics-api.md
|
||||
performance-related-monitoring.md
|
||||
metrics-server.md
|
||||
resource-metrics-api.md
|
||||
volume_stats_pvc_ref.md
|
||||
./auth
|
||||
security_context.md
|
||||
no-new-privs.md
|
||||
access.md
|
||||
enhance-pluggable-policy.md
|
||||
apparmor.md
|
||||
security-context-constraints.md
|
||||
enhance-pluggable-policy.md
|
||||
image-provenance.md
|
||||
no-new-privs.md
|
||||
pod-security-context.md
|
||||
bulk_watch.md
|
||||
secrets.md
|
||||
security-context-constraints.md
|
||||
security.md
|
||||
security_context.md
|
||||
service_accounts.md
|
||||
./federation
|
||||
federated-replicasets.md
|
||||
ubernetes-design.png
|
||||
ubernetes-cluster-state.png
|
||||
federation-phase-1.md
|
||||
federation-clusterselector.md
|
||||
ubernetes-scheduling.png
|
||||
federation-lite.md
|
||||
federation.md
|
||||
federated-services.md
|
||||
federation-high-level-arch.png
|
||||
control-plane-resilience.md
|
||||
federated-api-servers.md
|
||||
federated-placement-policy.md
|
||||
federated-ingress.md
|
||||
federated-placement-policy.md
|
||||
federated-replicasets.md
|
||||
federated-services.md
|
||||
federation-clusterselector.md
|
||||
federation-high-level-arch.png
|
||||
federation-lite.md
|
||||
federation-phase-1.md
|
||||
federation.md
|
||||
ubernetes-cluster-state.png
|
||||
ubernetes-design.png
|
||||
ubernetes-scheduling.png
|
||||
./scalability
|
||||
Kubemark_architecture.png
|
||||
scalability-testing.md
|
||||
kubemark.md
|
||||
kubemark_architecture.png
|
||||
scalability-testing.md
|
||||
./cluster-lifecycle
|
||||
self-hosted-layers.png
|
||||
self-hosted-kubernetes.md
|
||||
dramatically-simplify-cluster-creation.md
|
||||
bootstrap-discovery.md
|
||||
cluster-deployment.md
|
||||
self-hosted-kubelet.md
|
||||
clustering.md
|
||||
dramatically-simplify-cluster-creation.md
|
||||
ha_master.md
|
||||
high-availability.md
|
||||
kubelet-tls-bootstrap.md
|
||||
local-cluster-ux.md
|
||||
runtimeconfig.md
|
||||
self-hosted-final-cluster.png
|
||||
self-hosted-kubelet.md
|
||||
self-hosted-kubernetes.md
|
||||
self-hosted-layers.png
|
||||
self-hosted-moving-parts.png
|
||||
./cluster-lifecycle/clustering
|
||||
static.png
|
||||
.gitignore
|
||||
Dockerfile
|
||||
static.seqdiag
|
||||
dynamic.seqdiag
|
||||
OWNERS
|
||||
Makefile
|
||||
README.md
|
||||
dockerfile
|
||||
dynamic.png
|
||||
dynamic.seqdiag
|
||||
makefile
|
||||
owners
|
||||
readme.md
|
||||
static.png
|
||||
static.seqdiag
|
||||
./release
|
||||
release-notes.md
|
||||
release-test-signal.md
|
||||
versioning.md
|
||||
./scheduling
|
||||
rescheduling.md
|
||||
rescheduler.md
|
||||
nodeaffinity.md
|
||||
podaffinity.md
|
||||
hugepages.md
|
||||
taint-toleration-dedicated.md
|
||||
multiple-schedulers.md
|
||||
nodeaffinity.md
|
||||
pod-preemption.md
|
||||
pod-priority-api.md
|
||||
taint-node-by-condition.md
|
||||
scheduler_extender.md
|
||||
podaffinity.md
|
||||
rescheduler.md
|
||||
rescheduling-for-critical-pods.md
|
||||
multiple-schedulers.md
|
||||
rescheduling.md
|
||||
resources.md
|
||||
scheduler_extender.md
|
||||
taint-node-by-condition.md
|
||||
taint-toleration-dedicated.md
|
||||
./scheduling/images
|
||||
.gitignore
|
||||
owners
|
||||
preemption_1.png
|
||||
preemption_2.png
|
||||
preemption_3.png
|
||||
preemption_4.png
|
||||
./apps
|
||||
daemonset-update.md
|
||||
cronjob.md
|
||||
annotations-downward-api.md
|
||||
controller-ref.md
|
||||
statefulset-update.md
|
||||
stateful-apps.md
|
||||
deploy.md
|
||||
daemon.md
|
||||
controller_history.md
|
||||
job.md
|
||||
indexed-job.md
|
||||
cronjob.md
|
||||
daemon.md
|
||||
daemonset-update.md
|
||||
deploy.md
|
||||
deployment.md
|
||||
indexed-job.md
|
||||
job.md
|
||||
obsolete_templates.md
|
||||
selector-generation.md
|
||||
stateful-apps.md
|
||||
statefulset-update.md
|
||||
./storage
|
||||
flex-volumes-drivers-psp.md
|
||||
local-storage-overview.md
|
||||
all-in-one-volume.md
|
||||
volume-selectors.md
|
||||
persistent-storage.md
|
||||
volume-metrics.md
|
||||
flexvolume-deployment.md
|
||||
volume-snapshotting.png
|
||||
volume-provisioning.md
|
||||
propagation.md
|
||||
volume-ownership-management.md
|
||||
mount-options.md
|
||||
volumes.md
|
||||
containerized-mounter.md
|
||||
default-storage-class.md
|
||||
volume-snapshotting.md
|
||||
flex-volumes-drivers-psp.md
|
||||
flexvolume-deployment.md
|
||||
grow-volume-size.md
|
||||
local-storage-overview.md
|
||||
mount-options.md
|
||||
persistent-storage.md
|
||||
pod-safety.md
|
||||
volume-hostpath-qualifiers.md
|
||||
volume-metrics.md
|
||||
volume-ownership-management.md
|
||||
volume-provisioning.md
|
||||
volume-selectors.md
|
||||
volume-snapshotting.md
|
||||
volume-snapshotting.png
|
||||
volumes.md
|
||||
./aws
|
||||
aws_under_the_hood.md
|
||||
./images
|
||||
preemption_1.png
|
||||
preemption_3.png
|
||||
.gitignore
|
||||
OWNERS
|
||||
preemption_2.png
|
||||
preemption_4.png
|
||||
./gcp
|
||||
gce-l4-loadbalancer-healthcheck.md
|
||||
containerized-mounter.md
|
||||
./cloud-provider
|
||||
cloudprovider-storage-metrics.md
|
||||
cloud-provider-refactoring.md
|
||||
cloudprovider-storage-metrics.md
|
||||
|
|
|
|||
|
Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 50 KiB |
|
|
@ -0,0 +1,308 @@
|
|||
# HugePages support in Kubernetes
|
||||
|
||||
**Authors**
|
||||
* Derek Carr (@derekwaynecarr)
|
||||
* Seth Jennings (@sjenning)
|
||||
* Piotr Prokop (@PiotrProkop)
|
||||
|
||||
**Status**: In progress
|
||||
|
||||
## Abstract
|
||||
|
||||
A proposal to enable applications running in a Kubernetes cluster to use huge
|
||||
pages.
|
||||
|
||||
A pod may request a number of huge pages. The `scheduler` is able to place the
|
||||
pod on a node that can satisfy that request. The `kubelet` advertises an
|
||||
allocatable number of huge pages to support scheduling decisions. A pod may
|
||||
consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not
|
||||
overcommitted.
|
||||
|
||||
## Motivation
|
||||
|
||||
Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi
|
||||
of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have
|
||||
a built-in memory management unit that manages a list of these pages in
|
||||
hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of
|
||||
virtual-to-physical page mappings. If the virtual address passed in a hardware
|
||||
instruction can be found in the TLB, the mapping can be determined quickly. If
|
||||
not, a TLB miss occurs, and the system falls back to slower, software based
|
||||
address translation. This results in performance issues. Since the size of the
|
||||
TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the
|
||||
page size.
|
||||
|
||||
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures,
|
||||
there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other
|
||||
architectures, but the idea is the same. In order to use huge pages,
|
||||
application must write code that is aware of them. Transparent huge pages (THP)
|
||||
attempts to automate the management of huge pages without application knowledge,
|
||||
but they have limitations. In particular, they are limited to 2Mi page sizes.
|
||||
THP might lead to performance degradation on nodes with high memory utilization
|
||||
or fragmentation due to defragmenting efforts of THP, which can lock memory
|
||||
pages. For this reason, some applications may be designed to (or recommend)
|
||||
usage of pre-allocated huge pages instead of THP.
|
||||
|
||||
Managing memory is hard, and unfortunately, there is no one-size fits all
|
||||
solution for all applications.
|
||||
|
||||
## Scope
|
||||
|
||||
This proposal only includes pre-allocated huge pages configured on the node by
|
||||
the administrator at boot time or by manual dynamic allocation. It does not
|
||||
discuss how the cluster could dynamically attempt to allocate huge pages in an
|
||||
attempt to find a fit for a pod pending scheduling. It is anticipated that
|
||||
operators may use a variety of strategies to allocate huge pages, but we do not
|
||||
anticipate the kubelet itself doing the allocation. Allocation of huge pages
|
||||
ideally happens soon after boot time.
|
||||
|
||||
This proposal defers issues relating to NUMA.
|
||||
|
||||
## Use Cases
|
||||
|
||||
The class of applications that benefit from huge pages typically have
|
||||
- A large memory working set
|
||||
- A sensitivity to memory access latency
|
||||
|
||||
Example applications include:
|
||||
- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.)
|
||||
- Java applications can back the heap with huge pages using the
|
||||
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
|
||||
- packet processing systems (DPDK)
|
||||
|
||||
Applications can generally use huge pages by calling
|
||||
- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
|
||||
- `mmap()` a file backed by `hugetlbfs`
|
||||
- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known
|
||||
Issues).
|
||||
|
||||
1. A pod can use huge pages with any of the prior described methods.
|
||||
1. A pod can request huge pages.
|
||||
1. A scheduler can bind pods to nodes that have available huge pages.
|
||||
1. A quota may limit usage of huge pages.
|
||||
1. A limit range may constrain min and max huge page requests.
|
||||
|
||||
## Feature Gate
|
||||
|
||||
The proposal introduces huge pages as an Alpha feature.
|
||||
|
||||
It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent
|
||||
components pending graduation to Beta.
|
||||
|
||||
## Node Specfication
|
||||
|
||||
Huge pages cannot be overcommitted on a node.
|
||||
|
||||
A system may support multiple huge page sizes. It is assumed that most nodes
|
||||
will be configured to primarily use the default huge page size as returned via
|
||||
`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems
|
||||
unless overriden by `default_hugepagesz=1g` in kernel boot parameters.
|
||||
|
||||
For each supported huge page size, the node will advertise a resource of the
|
||||
form `hugepages-<hugepagesize>`. On Linux, supported huge page sizes are
|
||||
determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB`
|
||||
directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>`
|
||||
resource using binary notation form. It will convert `<hugepagesize>` into the
|
||||
most compact binary notation using integer values. For example, if a node
|
||||
supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node
|
||||
capacity and allocatable values. Operators may set aside pre-allocated huge
|
||||
pages that are not available for user pods similar to normal memory via the
|
||||
`--system-reserved` flag.
|
||||
|
||||
There are a variety of huge page sizes supported across different hardware
|
||||
architectures. It is preferred to have a resource per size in order to better
|
||||
support quota. For example, 1 huge page with size 2Mi is orders of magnitude
|
||||
different than 1 huge page with size 1Gi. We assume gigantic pages are even
|
||||
more precious resources than huge pages.
|
||||
|
||||
Pre-allocated huge pages reduce the amount of allocatable memory on a node. The
|
||||
node will treat pre-allocated huge pages similar to other system reservations
|
||||
and reduce the amount of `memory` it reports using the following formula:
|
||||
|
||||
```
|
||||
[Allocatable] = [Node Capacity] -
|
||||
[Kube-Reserved] -
|
||||
[System-Reserved] -
|
||||
[Pre-Allocated-HugePages * HugePageSize] -
|
||||
[Hard-Eviction-Threshold]
|
||||
```
|
||||
|
||||
The following represents a machine with 10Gi of memory. 1Gi of memory has been
|
||||
reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the
|
||||
allocatable memory has been reduced to account for the amount of huge pages
|
||||
reserved.
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: Node
|
||||
metadata:
|
||||
name: node1
|
||||
...
|
||||
status:
|
||||
capacity:
|
||||
memory: 10Gi
|
||||
hugepages-2Mi: 1Gi
|
||||
allocatable:
|
||||
memory: 9Gi
|
||||
hugepages-2Mi: 1Gi
|
||||
...
|
||||
```
|
||||
|
||||
## Pod Specification
|
||||
|
||||
A pod must make a request to consume pre-allocated huge pages using the resource
|
||||
`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in
|
||||
bytes. The specified amount must align with the `<hugepagesize>`; otherwise,
|
||||
the pod will fail validation. For example, it would be valid to request
|
||||
`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`.
|
||||
|
||||
The request and limit for `hugepages-<hugepagesize>` must match. Similar to
|
||||
memory, an application that requests `hugepages-<hugepagesize>` resource is at
|
||||
minimum in the `Burstable` QoS class.
|
||||
|
||||
If a pod consumes huge pages via `shmget`, it must run with a supplemental group
|
||||
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of
|
||||
this group is outside the scope of this specification.
|
||||
|
||||
Initially, a pod may not consume multiple huge page sizes in a single pod spec.
|
||||
Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will
|
||||
fail validation. We believe it is rare for applications to attempt to use
|
||||
multiple huge page sizes. This restriction may be lifted in the future with
|
||||
community presented use cases. Introducing the feature with this restriction
|
||||
limits the exposure of API changes needed when consuming huge pages via volumes.
|
||||
|
||||
In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
|
||||
specified container in the pod, it is helpful to understand the set of mount
|
||||
options used with `hugetlbfs`. For more details, see "Using Huge Pages" here:
|
||||
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
|
||||
|
||||
```
|
||||
mount -t hugetlbfs \
|
||||
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||
```
|
||||
|
||||
The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy
|
||||
this use case. A new `medium=HugePages` option would be supported. To write
|
||||
into this volume, the pod must make a request for huge pages. The `pagesize`
|
||||
argument is inferred from the `hugepages-<hugepagesize>` from the resource
|
||||
request. If in the future, multiple huge page sizes are supported in a single
|
||||
pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page
|
||||
size. The existing `sizeLimit` option for `emptyDir` would restrict usage to
|
||||
the minimum value specified between `sizeLimit` and the sum of huge page limits
|
||||
of all containers in a pod. This keeps the behavior consistent with memory
|
||||
backed `emptyDir` volumes whose usage is ultimately constrained by the pod
|
||||
cgroup sandbox memory settings. The `min_size` option is omitted as its not
|
||||
necessary. The `nr_inodes` mount option is omitted at this time in the same
|
||||
manner it is omitted with `medium=Memory` when using `tmpfs`.
|
||||
|
||||
The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It
|
||||
can consume those pages using `shmget()` or via `mmap()` with the specified
|
||||
volume.
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: example
|
||||
spec:
|
||||
containers:
|
||||
...
|
||||
volumeMounts:
|
||||
- mountPath: /hugepages
|
||||
name: hugepage
|
||||
resources:
|
||||
requests:
|
||||
hugepages-2Mi: 1Gi
|
||||
limits:
|
||||
hugepages-2Mi: 1Gi
|
||||
volumes:
|
||||
- name: hugepage
|
||||
emptyDir:
|
||||
medium: HugePages
|
||||
```
|
||||
|
||||
## CRI Updates
|
||||
|
||||
The `LinuxContainerResources` message should be extended to support specifying
|
||||
huge page limits per size. The specification for huge pages should align with
|
||||
opencontainers/runtime-spec.
|
||||
|
||||
see:
|
||||
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits
|
||||
|
||||
The CRI changes are required before promoting this feature to Beta.
|
||||
|
||||
## Cgroup Enforcement
|
||||
|
||||
To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the
|
||||
`hugetlb` cgroup must be mounted.
|
||||
|
||||
The `kubepods` cgroup is bounded by the `Allocatable` value.
|
||||
|
||||
The QoS level cgroups are left unbounded across all huge page pool sizes.
|
||||
|
||||
The pod level cgroup sandbox is configured as follows, where `hugepagesize` is
|
||||
the system supported huge page size(s). If no request is made for huge pages of
|
||||
a particular size, the limit is set to 0 for all supported types on the node.
|
||||
|
||||
```
|
||||
pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>])
|
||||
```
|
||||
|
||||
If the container runtime supports specification of huge page limits, the
|
||||
container cgroup sandbox will be configured with the specified limit.
|
||||
|
||||
The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level
|
||||
cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed.
|
||||
|
||||
## Limits and Quota
|
||||
|
||||
The `ResourceQuota` resource will be extended to support accounting for
|
||||
`hugepages-<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange`
|
||||
resource will be extended to define min and max constraints for `hugepages`
|
||||
similar to `cpu` and `memory`.
|
||||
|
||||
## Scheduler changes
|
||||
|
||||
The scheduler will need to ensure any huge page request defined in the pod spec
|
||||
can be fulfilled by a candidate node.
|
||||
|
||||
## cAdvisor changes
|
||||
|
||||
cAdvisor will need to be modified to return the number of pre-allocated huge
|
||||
pages per page size on the node. It will be used to determine capacity and
|
||||
calculate allocatable values on the node.
|
||||
|
||||
## Roadmap
|
||||
|
||||
### Version 1.8
|
||||
|
||||
Initial alpha support for huge pages usage by pods.
|
||||
|
||||
### Version 1.9
|
||||
|
||||
Resource Quota support. Limit Range support. Beta support for huge pages
|
||||
(pending community feedback)
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Huge pages as shared memory
|
||||
|
||||
For the Java use case, the JVM maps the huge pages as a shared memory segment
|
||||
and memlocks them to prevent the system from moving or swapping them out.
|
||||
|
||||
There are several issues here:
|
||||
- The user running the Java app must be a member of the gid set in the
|
||||
`vm.huge_tlb_shm_group` sysctl
|
||||
- sysctl `kernel.shmmax` must allow the size of the shared memory segment
|
||||
- The user's memlock ulimits must allow the size of the shared memory segment
|
||||
- `vm.huge_tlb_shm_group` is not namespaced.
|
||||
|
||||
### NUMA
|
||||
|
||||
NUMA is complicated. To support NUMA, the node must support cpu pinning,
|
||||
devices, and memory locality. Extending that requirement to huge pages is not
|
||||
much different. It is anticipated that the `kubelet` will provide future NUMA
|
||||
locality guarantees as a feature of QoS. In particular, pods in the
|
||||
`Guaranteed` QoS class are expected to have NUMA locality preferences.
|
||||
|
||||
|
Before Width: | Height: | Size: 15 KiB After Width: | Height: | Size: 15 KiB |
|
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 20 KiB |
|
Before Width: | Height: | Size: 21 KiB After Width: | Height: | Size: 21 KiB |
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |