addressing Brians comments
|
|
@ -1,3 +1,5 @@
|
||||||
|
# OBSOLETE
|
||||||
|
|
||||||
# Templates+Parameterization: Repeatedly instantiating user-customized application topologies.
|
# Templates+Parameterization: Repeatedly instantiating user-customized application topologies.
|
||||||
|
|
||||||
## Motivation
|
## Motivation
|
||||||
|
|
@ -1,43 +0,0 @@
|
||||||
# Containerized Mounter with Chroot for Container-Optimized OS
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
|
|
||||||
Due security and management overhead, our new Container-Optimized OS used by GKE
|
|
||||||
does not carry certain storage drivers and tools needed for such as nfs and
|
|
||||||
glusterfs. This project takes a containerized mount approach to package mount
|
|
||||||
binaries into a container. Volume plugin will execute mount inside of container
|
|
||||||
and share the mount with the host.
|
|
||||||
|
|
||||||
|
|
||||||
## Design
|
|
||||||
|
|
||||||
1. A docker image has storage tools (nfs and glusterfs) pre-installed and uploaded
|
|
||||||
to gcs.
|
|
||||||
2. During GKE cluster configuration, the docker image is pulled and installed on
|
|
||||||
the cluster node.
|
|
||||||
3. When nfs or glusterfs type mount is invoked by kubelet, it will run the mount
|
|
||||||
command inside of a container with the pre-install docker image and the mount
|
|
||||||
propagation set to “shared. In this way, the mount inside the container will
|
|
||||||
visible to host node too.
|
|
||||||
4. A special case for NFSv3, a rpcbind process is issued before running mount
|
|
||||||
command.
|
|
||||||
|
|
||||||
## Implementation details
|
|
||||||
|
|
||||||
* In the first version of containerized mounter, we use rkt fly to dynamically
|
|
||||||
start a container during mount. When mount command finishes, the container is
|
|
||||||
normally exited and will be garbage-collected. However, in case the glusterfs
|
|
||||||
mount, because a gluster daemon is running after command mount finishes util
|
|
||||||
glusterfs unmount, the container started for mount will continue to run until
|
|
||||||
glusterfs client finishes. The container cannot be garbage-collected right away
|
|
||||||
and multiple containers might be running for some time. Due to shared mount
|
|
||||||
propagation, with more containers running, the number of mounts will increase
|
|
||||||
significantly and might cause kernel panic. To solve this problem, a chroot
|
|
||||||
approach is proposed and implemented.
|
|
||||||
* In the second version, instead of running a container on the host, the docker
|
|
||||||
container’s file system is exported as a tar archive and pre-installed on host.
|
|
||||||
Kubelet directory is shared mount between host and inside of the container’s
|
|
||||||
rootfs. When a gluster/nfs mount is issued, a mounter script will use chroot to
|
|
||||||
change to the container’s rootfs and run the mount. This approach is very clean
|
|
||||||
since there is no need to manage a container’s lifecycle and avoid having large
|
|
||||||
number of mounts.
|
|
||||||
|
|
@ -1,240 +1,244 @@
|
||||||
Uncategorized (Please Help)
|
Uncategorized
|
||||||
high-availability.md
|
admission_control_event_rate_limit.md
|
||||||
control-plane-resilience.md
|
|
||||||
downward_api_resources_limits_requests.md
|
|
||||||
seccomp.md
|
|
||||||
client-package-structure.md
|
|
||||||
service-discovery.md
|
|
||||||
metadata-policy.md
|
|
||||||
containerized-mounter.md~
|
|
||||||
identifiers.md
|
|
||||||
local-cluster-ux.md
|
|
||||||
pod-pid-namespace.md
|
|
||||||
grow-volume-size.md
|
|
||||||
image-provenance.md
|
|
||||||
core-metrics-pipeline.md
|
|
||||||
versioning.md
|
|
||||||
ha_master.md
|
|
||||||
secret-configmap-downwarapi-file-mode.md
|
|
||||||
protobuf.md
|
|
||||||
flakiness-sla.md
|
|
||||||
resources.md
|
|
||||||
initial-resources.md
|
|
||||||
create_sheet.py
|
create_sheet.py
|
||||||
runtime-client-server.md
|
create_sheet.py~
|
||||||
OWNERS
|
design_proposal_template.md
|
||||||
namespaces.md
|
|
||||||
cpu-manager.md
|
|
||||||
selinux-enhancements.md
|
|
||||||
sysctl.md
|
|
||||||
dir_struct.txt
|
dir_struct.txt
|
||||||
selinux.md
|
|
||||||
templates.md
|
|
||||||
pod-cache.png
|
|
||||||
README.md
|
|
||||||
multi-platform.md
|
|
||||||
pod-lifecycle-event-generator.md
|
|
||||||
secrets.md
|
|
||||||
cri-dockershim-checkpoint.md
|
|
||||||
event_compression.md
|
event_compression.md
|
||||||
|
multi-platform.md
|
||||||
|
owners
|
||||||
pleg.png
|
pleg.png
|
||||||
|
readme.md
|
||||||
|
runtime-client-server.md
|
||||||
|
templates.md~
|
||||||
./sig-cli
|
./sig-cli
|
||||||
|
get-describe-apiserver-extensions.md
|
||||||
|
kubectl-create-from-env-file.md
|
||||||
kubectl-extension.md
|
kubectl-extension.md
|
||||||
|
kubectl-login.md
|
||||||
kubectl_apply_getsetdiff_last_applied_config.md
|
kubectl_apply_getsetdiff_last_applied_config.md
|
||||||
multi-fields-merge-key.md
|
multi-fields-merge-key.md
|
||||||
template.md
|
owners
|
||||||
expansion.md
|
|
||||||
kubectl-login.md
|
|
||||||
simple-rolling-update.md
|
|
||||||
OWNERS
|
|
||||||
get-describe-apiserver-extensions.md
|
|
||||||
preserve-order-in-strategic-merge-patch.md
|
preserve-order-in-strategic-merge-patch.md
|
||||||
kubectl-create-from-env-file.md
|
simple-rolling-update.md
|
||||||
./network
|
./network
|
||||||
flannel-integration.md
|
|
||||||
service-external-name.md
|
|
||||||
networking.md
|
|
||||||
command_execution_port_forwarding.md
|
command_execution_port_forwarding.md
|
||||||
network-policy.md
|
|
||||||
external-lb-source-ip-preservation.md
|
external-lb-source-ip-preservation.md
|
||||||
|
flannel-integration.md
|
||||||
|
network-policy.md
|
||||||
|
networking.md
|
||||||
|
selinux-enhancements.md
|
||||||
|
service-discovery.md
|
||||||
|
service-external-name.md
|
||||||
./resource-management
|
./resource-management
|
||||||
|
admission_control_limit_range.md
|
||||||
|
admission_control_resource_quota.md
|
||||||
|
device-plugin-overview.png
|
||||||
device-plugin.md
|
device-plugin.md
|
||||||
device-plugin.png
|
device-plugin.png
|
||||||
gpu-support.md
|
gpu-support.md
|
||||||
device-plugin-overview.png
|
hugepages.md
|
||||||
|
resource-quota-scoping.md
|
||||||
|
./testing
|
||||||
|
flakiness-sla.md
|
||||||
./autoscaling
|
./autoscaling
|
||||||
hpa-v2.md
|
|
||||||
hpa-status-conditions.md
|
|
||||||
horizontal-pod-autoscaler.md
|
horizontal-pod-autoscaler.md
|
||||||
|
hpa-status-conditions.md
|
||||||
|
hpa-v2.md
|
||||||
|
initial-resources.md
|
||||||
./architecture
|
./architecture
|
||||||
architecture.md
|
architecture.md
|
||||||
architecture.dia
|
|
||||||
architecture.png
|
architecture.png
|
||||||
architecture.svg
|
architecture.svg
|
||||||
./api-machinery
|
identifiers.md
|
||||||
admission_control_extension.md
|
namespaces.md
|
||||||
csi-client-structure-proposal.md
|
|
||||||
selector-generation.md
|
|
||||||
pod-safety.md
|
|
||||||
container-init.md
|
|
||||||
resource-quota-scoping.md
|
|
||||||
thirdpartyresources.md
|
|
||||||
aggregated-api-servers.md
|
|
||||||
extending-api.md
|
|
||||||
envvar-configmap.md
|
|
||||||
dynamic-admission-control-configuration.md
|
|
||||||
api-chunking.md
|
|
||||||
garbage-collection.md
|
|
||||||
customresources-validation.md
|
|
||||||
auditing.md
|
|
||||||
apiserver-watch.md
|
|
||||||
admission_control_limit_range.md
|
|
||||||
apiserver-build-in-admission-plugins.md
|
|
||||||
synchronous-garbage-collection.md
|
|
||||||
configmap.md
|
|
||||||
csi-new-client-library-procedure.md
|
|
||||||
pod-preset.md
|
|
||||||
add-new-patchStrategy-to-clear-fields-not-present-in-patch.md
|
|
||||||
api-group.md
|
|
||||||
principles.md
|
principles.md
|
||||||
|
./api-machinery
|
||||||
|
add-new-patchstrategy-to-clear-fields-not-present-in-patch.md
|
||||||
admission_control.md
|
admission_control.md
|
||||||
optional-configmap.md
|
admission_control_extension.md
|
||||||
server-get.md
|
aggregated-api-servers.md
|
||||||
|
api-chunking.md
|
||||||
|
api-group.md
|
||||||
|
apiserver-build-in-admission-plugins.md
|
||||||
apiserver-count-fix.md
|
apiserver-count-fix.md
|
||||||
admission_control_resource_quota.md
|
apiserver-watch.md
|
||||||
|
auditing.md
|
||||||
|
bulk_watch.md
|
||||||
|
client-package-structure.md
|
||||||
|
controller-ref.md
|
||||||
|
csi-client-structure-proposal.md
|
||||||
|
csi-new-client-library-procedure.md
|
||||||
|
customresources-validation.md
|
||||||
|
dynamic-admission-control-configuration.md
|
||||||
|
extending-api.md
|
||||||
|
garbage-collection.md
|
||||||
|
metadata-policy.md
|
||||||
|
protobuf.md
|
||||||
|
server-get.md
|
||||||
|
synchronous-garbage-collection.md
|
||||||
|
thirdpartyresources.md
|
||||||
./node
|
./node
|
||||||
pod-resource-management.md
|
all-in-one-volume.md
|
||||||
kubelet-tls-bootstrap.md
|
annotations-downward-api.md
|
||||||
dynamic-kubelet-configuration.md
|
configmap.md
|
||||||
kubelet-hypercontainer-runtime.md
|
container-init.md
|
||||||
container-runtime-interface-v1.md
|
container-runtime-interface-v1.md
|
||||||
kubelet-authorizer.md
|
cpu-manager.md
|
||||||
|
cri-dockershim-checkpoint.md
|
||||||
disk-accounting.md
|
disk-accounting.md
|
||||||
kubelet-systemd.md
|
downward_api_resources_limits_requests.md
|
||||||
kubelet-cri-logging.md
|
dynamic-kubelet-configuration.md
|
||||||
|
envvar-configmap.md
|
||||||
|
expansion.md
|
||||||
kubelet-auth.md
|
kubelet-auth.md
|
||||||
runtimeconfig.md
|
kubelet-authorizer.md
|
||||||
|
kubelet-cri-logging.md
|
||||||
|
kubelet-eviction.md
|
||||||
|
kubelet-hypercontainer-runtime.md
|
||||||
|
kubelet-rkt-runtime.md
|
||||||
|
kubelet-rootfs-distribution.md
|
||||||
|
kubelet-systemd.md
|
||||||
|
node-allocatable.md
|
||||||
|
optional-configmap.md
|
||||||
|
pod-cache.png
|
||||||
|
pod-lifecycle-event-generator.md
|
||||||
|
pod-pid-namespace.md
|
||||||
|
pod-resource-management.md
|
||||||
|
propagation.md
|
||||||
resource-qos.md
|
resource-qos.md
|
||||||
runtime-pod-cache.md
|
runtime-pod-cache.md
|
||||||
kubelet-rootfs-distribution.md
|
seccomp.md
|
||||||
kubelet-rkt-runtime.md
|
secret-configmap-downwardapi-file-mode.md
|
||||||
node-allocatable.md
|
selinux.md
|
||||||
kubelet-eviction.md
|
sysctl.md
|
||||||
|
./service-catalog
|
||||||
|
pod-preset.md
|
||||||
./instrumentation
|
./instrumentation
|
||||||
|
core-metrics-pipeline.md
|
||||||
|
custom-metrics-api.md
|
||||||
|
metrics-server.md
|
||||||
monitoring_architecture.md
|
monitoring_architecture.md
|
||||||
monitoring_architecture.png
|
monitoring_architecture.png
|
||||||
custom-metrics-api.md
|
|
||||||
resource-metrics-api.md
|
|
||||||
performance-related-monitoring.md
|
performance-related-monitoring.md
|
||||||
metrics-server.md
|
resource-metrics-api.md
|
||||||
volume_stats_pvc_ref.md
|
volume_stats_pvc_ref.md
|
||||||
./auth
|
./auth
|
||||||
security_context.md
|
|
||||||
no-new-privs.md
|
|
||||||
access.md
|
access.md
|
||||||
enhance-pluggable-policy.md
|
|
||||||
apparmor.md
|
apparmor.md
|
||||||
security-context-constraints.md
|
enhance-pluggable-policy.md
|
||||||
|
image-provenance.md
|
||||||
|
no-new-privs.md
|
||||||
pod-security-context.md
|
pod-security-context.md
|
||||||
bulk_watch.md
|
secrets.md
|
||||||
|
security-context-constraints.md
|
||||||
security.md
|
security.md
|
||||||
|
security_context.md
|
||||||
service_accounts.md
|
service_accounts.md
|
||||||
./federation
|
./federation
|
||||||
federated-replicasets.md
|
control-plane-resilience.md
|
||||||
ubernetes-design.png
|
|
||||||
ubernetes-cluster-state.png
|
|
||||||
federation-phase-1.md
|
|
||||||
federation-clusterselector.md
|
|
||||||
ubernetes-scheduling.png
|
|
||||||
federation-lite.md
|
|
||||||
federation.md
|
|
||||||
federated-services.md
|
|
||||||
federation-high-level-arch.png
|
|
||||||
federated-api-servers.md
|
federated-api-servers.md
|
||||||
federated-placement-policy.md
|
|
||||||
federated-ingress.md
|
federated-ingress.md
|
||||||
|
federated-placement-policy.md
|
||||||
|
federated-replicasets.md
|
||||||
|
federated-services.md
|
||||||
|
federation-clusterselector.md
|
||||||
|
federation-high-level-arch.png
|
||||||
|
federation-lite.md
|
||||||
|
federation-phase-1.md
|
||||||
|
federation.md
|
||||||
|
ubernetes-cluster-state.png
|
||||||
|
ubernetes-design.png
|
||||||
|
ubernetes-scheduling.png
|
||||||
./scalability
|
./scalability
|
||||||
Kubemark_architecture.png
|
|
||||||
scalability-testing.md
|
|
||||||
kubemark.md
|
kubemark.md
|
||||||
|
kubemark_architecture.png
|
||||||
|
scalability-testing.md
|
||||||
./cluster-lifecycle
|
./cluster-lifecycle
|
||||||
self-hosted-layers.png
|
|
||||||
self-hosted-kubernetes.md
|
|
||||||
dramatically-simplify-cluster-creation.md
|
|
||||||
bootstrap-discovery.md
|
bootstrap-discovery.md
|
||||||
cluster-deployment.md
|
cluster-deployment.md
|
||||||
self-hosted-kubelet.md
|
|
||||||
clustering.md
|
clustering.md
|
||||||
|
dramatically-simplify-cluster-creation.md
|
||||||
|
ha_master.md
|
||||||
|
high-availability.md
|
||||||
|
kubelet-tls-bootstrap.md
|
||||||
|
local-cluster-ux.md
|
||||||
|
runtimeconfig.md
|
||||||
self-hosted-final-cluster.png
|
self-hosted-final-cluster.png
|
||||||
|
self-hosted-kubelet.md
|
||||||
|
self-hosted-kubernetes.md
|
||||||
|
self-hosted-layers.png
|
||||||
self-hosted-moving-parts.png
|
self-hosted-moving-parts.png
|
||||||
./cluster-lifecycle/clustering
|
./cluster-lifecycle/clustering
|
||||||
static.png
|
|
||||||
.gitignore
|
.gitignore
|
||||||
Dockerfile
|
dockerfile
|
||||||
static.seqdiag
|
|
||||||
dynamic.seqdiag
|
|
||||||
OWNERS
|
|
||||||
Makefile
|
|
||||||
README.md
|
|
||||||
dynamic.png
|
dynamic.png
|
||||||
|
dynamic.seqdiag
|
||||||
|
makefile
|
||||||
|
owners
|
||||||
|
readme.md
|
||||||
|
static.png
|
||||||
|
static.seqdiag
|
||||||
./release
|
./release
|
||||||
release-notes.md
|
release-notes.md
|
||||||
release-test-signal.md
|
release-test-signal.md
|
||||||
|
versioning.md
|
||||||
./scheduling
|
./scheduling
|
||||||
rescheduling.md
|
|
||||||
rescheduler.md
|
|
||||||
nodeaffinity.md
|
|
||||||
podaffinity.md
|
|
||||||
hugepages.md
|
hugepages.md
|
||||||
taint-toleration-dedicated.md
|
multiple-schedulers.md
|
||||||
|
nodeaffinity.md
|
||||||
pod-preemption.md
|
pod-preemption.md
|
||||||
pod-priority-api.md
|
pod-priority-api.md
|
||||||
taint-node-by-condition.md
|
podaffinity.md
|
||||||
scheduler_extender.md
|
rescheduler.md
|
||||||
rescheduling-for-critical-pods.md
|
rescheduling-for-critical-pods.md
|
||||||
multiple-schedulers.md
|
rescheduling.md
|
||||||
|
resources.md
|
||||||
|
scheduler_extender.md
|
||||||
|
taint-node-by-condition.md
|
||||||
|
taint-toleration-dedicated.md
|
||||||
|
./scheduling/images
|
||||||
|
.gitignore
|
||||||
|
owners
|
||||||
|
preemption_1.png
|
||||||
|
preemption_2.png
|
||||||
|
preemption_3.png
|
||||||
|
preemption_4.png
|
||||||
./apps
|
./apps
|
||||||
daemonset-update.md
|
|
||||||
cronjob.md
|
|
||||||
annotations-downward-api.md
|
|
||||||
controller-ref.md
|
|
||||||
statefulset-update.md
|
|
||||||
stateful-apps.md
|
|
||||||
deploy.md
|
|
||||||
daemon.md
|
|
||||||
controller_history.md
|
controller_history.md
|
||||||
job.md
|
cronjob.md
|
||||||
indexed-job.md
|
daemon.md
|
||||||
|
daemonset-update.md
|
||||||
|
deploy.md
|
||||||
deployment.md
|
deployment.md
|
||||||
|
indexed-job.md
|
||||||
|
job.md
|
||||||
|
obsolete_templates.md
|
||||||
|
selector-generation.md
|
||||||
|
stateful-apps.md
|
||||||
|
statefulset-update.md
|
||||||
./storage
|
./storage
|
||||||
flex-volumes-drivers-psp.md
|
containerized-mounter.md
|
||||||
local-storage-overview.md
|
|
||||||
all-in-one-volume.md
|
|
||||||
volume-selectors.md
|
|
||||||
persistent-storage.md
|
|
||||||
volume-metrics.md
|
|
||||||
flexvolume-deployment.md
|
|
||||||
volume-snapshotting.png
|
|
||||||
volume-provisioning.md
|
|
||||||
propagation.md
|
|
||||||
volume-ownership-management.md
|
|
||||||
mount-options.md
|
|
||||||
volumes.md
|
|
||||||
default-storage-class.md
|
default-storage-class.md
|
||||||
volume-snapshotting.md
|
flex-volumes-drivers-psp.md
|
||||||
|
flexvolume-deployment.md
|
||||||
|
grow-volume-size.md
|
||||||
|
local-storage-overview.md
|
||||||
|
mount-options.md
|
||||||
|
persistent-storage.md
|
||||||
|
pod-safety.md
|
||||||
volume-hostpath-qualifiers.md
|
volume-hostpath-qualifiers.md
|
||||||
|
volume-metrics.md
|
||||||
|
volume-ownership-management.md
|
||||||
|
volume-provisioning.md
|
||||||
|
volume-selectors.md
|
||||||
|
volume-snapshotting.md
|
||||||
|
volume-snapshotting.png
|
||||||
|
volumes.md
|
||||||
./aws
|
./aws
|
||||||
aws_under_the_hood.md
|
aws_under_the_hood.md
|
||||||
./images
|
|
||||||
preemption_1.png
|
|
||||||
preemption_3.png
|
|
||||||
.gitignore
|
|
||||||
OWNERS
|
|
||||||
preemption_2.png
|
|
||||||
preemption_4.png
|
|
||||||
./gcp
|
./gcp
|
||||||
gce-l4-loadbalancer-healthcheck.md
|
gce-l4-loadbalancer-healthcheck.md
|
||||||
containerized-mounter.md
|
|
||||||
./cloud-provider
|
./cloud-provider
|
||||||
cloudprovider-storage-metrics.md
|
|
||||||
cloud-provider-refactoring.md
|
cloud-provider-refactoring.md
|
||||||
|
cloudprovider-storage-metrics.md
|
||||||
|
|
|
||||||
|
Before Width: | Height: | Size: 50 KiB After Width: | Height: | Size: 50 KiB |
|
|
@ -0,0 +1,308 @@
|
||||||
|
# HugePages support in Kubernetes
|
||||||
|
|
||||||
|
**Authors**
|
||||||
|
* Derek Carr (@derekwaynecarr)
|
||||||
|
* Seth Jennings (@sjenning)
|
||||||
|
* Piotr Prokop (@PiotrProkop)
|
||||||
|
|
||||||
|
**Status**: In progress
|
||||||
|
|
||||||
|
## Abstract
|
||||||
|
|
||||||
|
A proposal to enable applications running in a Kubernetes cluster to use huge
|
||||||
|
pages.
|
||||||
|
|
||||||
|
A pod may request a number of huge pages. The `scheduler` is able to place the
|
||||||
|
pod on a node that can satisfy that request. The `kubelet` advertises an
|
||||||
|
allocatable number of huge pages to support scheduling decisions. A pod may
|
||||||
|
consume hugepages via `hugetlbfs` or `shmget`. Huge pages are not
|
||||||
|
overcommitted.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi
|
||||||
|
of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have
|
||||||
|
a built-in memory management unit that manages a list of these pages in
|
||||||
|
hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of
|
||||||
|
virtual-to-physical page mappings. If the virtual address passed in a hardware
|
||||||
|
instruction can be found in the TLB, the mapping can be determined quickly. If
|
||||||
|
not, a TLB miss occurs, and the system falls back to slower, software based
|
||||||
|
address translation. This results in performance issues. Since the size of the
|
||||||
|
TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the
|
||||||
|
page size.
|
||||||
|
|
||||||
|
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures,
|
||||||
|
there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other
|
||||||
|
architectures, but the idea is the same. In order to use huge pages,
|
||||||
|
application must write code that is aware of them. Transparent huge pages (THP)
|
||||||
|
attempts to automate the management of huge pages without application knowledge,
|
||||||
|
but they have limitations. In particular, they are limited to 2Mi page sizes.
|
||||||
|
THP might lead to performance degradation on nodes with high memory utilization
|
||||||
|
or fragmentation due to defragmenting efforts of THP, which can lock memory
|
||||||
|
pages. For this reason, some applications may be designed to (or recommend)
|
||||||
|
usage of pre-allocated huge pages instead of THP.
|
||||||
|
|
||||||
|
Managing memory is hard, and unfortunately, there is no one-size fits all
|
||||||
|
solution for all applications.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
This proposal only includes pre-allocated huge pages configured on the node by
|
||||||
|
the administrator at boot time or by manual dynamic allocation. It does not
|
||||||
|
discuss how the cluster could dynamically attempt to allocate huge pages in an
|
||||||
|
attempt to find a fit for a pod pending scheduling. It is anticipated that
|
||||||
|
operators may use a variety of strategies to allocate huge pages, but we do not
|
||||||
|
anticipate the kubelet itself doing the allocation. Allocation of huge pages
|
||||||
|
ideally happens soon after boot time.
|
||||||
|
|
||||||
|
This proposal defers issues relating to NUMA.
|
||||||
|
|
||||||
|
## Use Cases
|
||||||
|
|
||||||
|
The class of applications that benefit from huge pages typically have
|
||||||
|
- A large memory working set
|
||||||
|
- A sensitivity to memory access latency
|
||||||
|
|
||||||
|
Example applications include:
|
||||||
|
- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.)
|
||||||
|
- Java applications can back the heap with huge pages using the
|
||||||
|
`-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
|
||||||
|
- packet processing systems (DPDK)
|
||||||
|
|
||||||
|
Applications can generally use huge pages by calling
|
||||||
|
- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
|
||||||
|
- `mmap()` a file backed by `hugetlbfs`
|
||||||
|
- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known
|
||||||
|
Issues).
|
||||||
|
|
||||||
|
1. A pod can use huge pages with any of the prior described methods.
|
||||||
|
1. A pod can request huge pages.
|
||||||
|
1. A scheduler can bind pods to nodes that have available huge pages.
|
||||||
|
1. A quota may limit usage of huge pages.
|
||||||
|
1. A limit range may constrain min and max huge page requests.
|
||||||
|
|
||||||
|
## Feature Gate
|
||||||
|
|
||||||
|
The proposal introduces huge pages as an Alpha feature.
|
||||||
|
|
||||||
|
It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent
|
||||||
|
components pending graduation to Beta.
|
||||||
|
|
||||||
|
## Node Specfication
|
||||||
|
|
||||||
|
Huge pages cannot be overcommitted on a node.
|
||||||
|
|
||||||
|
A system may support multiple huge page sizes. It is assumed that most nodes
|
||||||
|
will be configured to primarily use the default huge page size as returned via
|
||||||
|
`grep Hugepagesize /proc/meminfo`. This defaults to 2Mi on most Linux systems
|
||||||
|
unless overriden by `default_hugepagesz=1g` in kernel boot parameters.
|
||||||
|
|
||||||
|
For each supported huge page size, the node will advertise a resource of the
|
||||||
|
form `hugepages-<hugepagesize>`. On Linux, supported huge page sizes are
|
||||||
|
determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB`
|
||||||
|
directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>`
|
||||||
|
resource using binary notation form. It will convert `<hugepagesize>` into the
|
||||||
|
most compact binary notation using integer values. For example, if a node
|
||||||
|
supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node
|
||||||
|
capacity and allocatable values. Operators may set aside pre-allocated huge
|
||||||
|
pages that are not available for user pods similar to normal memory via the
|
||||||
|
`--system-reserved` flag.
|
||||||
|
|
||||||
|
There are a variety of huge page sizes supported across different hardware
|
||||||
|
architectures. It is preferred to have a resource per size in order to better
|
||||||
|
support quota. For example, 1 huge page with size 2Mi is orders of magnitude
|
||||||
|
different than 1 huge page with size 1Gi. We assume gigantic pages are even
|
||||||
|
more precious resources than huge pages.
|
||||||
|
|
||||||
|
Pre-allocated huge pages reduce the amount of allocatable memory on a node. The
|
||||||
|
node will treat pre-allocated huge pages similar to other system reservations
|
||||||
|
and reduce the amount of `memory` it reports using the following formula:
|
||||||
|
|
||||||
|
```
|
||||||
|
[Allocatable] = [Node Capacity] -
|
||||||
|
[Kube-Reserved] -
|
||||||
|
[System-Reserved] -
|
||||||
|
[Pre-Allocated-HugePages * HugePageSize] -
|
||||||
|
[Hard-Eviction-Threshold]
|
||||||
|
```
|
||||||
|
|
||||||
|
The following represents a machine with 10Gi of memory. 1Gi of memory has been
|
||||||
|
reserved as 512 pre-allocated huge pages sized 2Mi. As you can see, the
|
||||||
|
allocatable memory has been reduced to account for the amount of huge pages
|
||||||
|
reserved.
|
||||||
|
|
||||||
|
```
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Node
|
||||||
|
metadata:
|
||||||
|
name: node1
|
||||||
|
...
|
||||||
|
status:
|
||||||
|
capacity:
|
||||||
|
memory: 10Gi
|
||||||
|
hugepages-2Mi: 1Gi
|
||||||
|
allocatable:
|
||||||
|
memory: 9Gi
|
||||||
|
hugepages-2Mi: 1Gi
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pod Specification
|
||||||
|
|
||||||
|
A pod must make a request to consume pre-allocated huge pages using the resource
|
||||||
|
`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in
|
||||||
|
bytes. The specified amount must align with the `<hugepagesize>`; otherwise,
|
||||||
|
the pod will fail validation. For example, it would be valid to request
|
||||||
|
`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`.
|
||||||
|
|
||||||
|
The request and limit for `hugepages-<hugepagesize>` must match. Similar to
|
||||||
|
memory, an application that requests `hugepages-<hugepagesize>` resource is at
|
||||||
|
minimum in the `Burstable` QoS class.
|
||||||
|
|
||||||
|
If a pod consumes huge pages via `shmget`, it must run with a supplemental group
|
||||||
|
that matches `/proc/sys/vm/hugetlb_shm_group` on the node. Configuration of
|
||||||
|
this group is outside the scope of this specification.
|
||||||
|
|
||||||
|
Initially, a pod may not consume multiple huge page sizes in a single pod spec.
|
||||||
|
Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will
|
||||||
|
fail validation. We believe it is rare for applications to attempt to use
|
||||||
|
multiple huge page sizes. This restriction may be lifted in the future with
|
||||||
|
community presented use cases. Introducing the feature with this restriction
|
||||||
|
limits the exposure of API changes needed when consuming huge pages via volumes.
|
||||||
|
|
||||||
|
In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
|
||||||
|
specified container in the pod, it is helpful to understand the set of mount
|
||||||
|
options used with `hugetlbfs`. For more details, see "Using Huge Pages" here:
|
||||||
|
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
|
||||||
|
|
||||||
|
```
|
||||||
|
mount -t hugetlbfs \
|
||||||
|
-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
|
||||||
|
min_size=<value>,nr_inodes=<value> none /mnt/huge
|
||||||
|
```
|
||||||
|
|
||||||
|
The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy
|
||||||
|
this use case. A new `medium=HugePages` option would be supported. To write
|
||||||
|
into this volume, the pod must make a request for huge pages. The `pagesize`
|
||||||
|
argument is inferred from the `hugepages-<hugepagesize>` from the resource
|
||||||
|
request. If in the future, multiple huge page sizes are supported in a single
|
||||||
|
pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page
|
||||||
|
size. The existing `sizeLimit` option for `emptyDir` would restrict usage to
|
||||||
|
the minimum value specified between `sizeLimit` and the sum of huge page limits
|
||||||
|
of all containers in a pod. This keeps the behavior consistent with memory
|
||||||
|
backed `emptyDir` volumes whose usage is ultimately constrained by the pod
|
||||||
|
cgroup sandbox memory settings. The `min_size` option is omitted as its not
|
||||||
|
necessary. The `nr_inodes` mount option is omitted at this time in the same
|
||||||
|
manner it is omitted with `medium=Memory` when using `tmpfs`.
|
||||||
|
|
||||||
|
The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It
|
||||||
|
can consume those pages using `shmget()` or via `mmap()` with the specified
|
||||||
|
volume.
|
||||||
|
|
||||||
|
```
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Pod
|
||||||
|
metadata:
|
||||||
|
name: example
|
||||||
|
spec:
|
||||||
|
containers:
|
||||||
|
...
|
||||||
|
volumeMounts:
|
||||||
|
- mountPath: /hugepages
|
||||||
|
name: hugepage
|
||||||
|
resources:
|
||||||
|
requests:
|
||||||
|
hugepages-2Mi: 1Gi
|
||||||
|
limits:
|
||||||
|
hugepages-2Mi: 1Gi
|
||||||
|
volumes:
|
||||||
|
- name: hugepage
|
||||||
|
emptyDir:
|
||||||
|
medium: HugePages
|
||||||
|
```
|
||||||
|
|
||||||
|
## CRI Updates
|
||||||
|
|
||||||
|
The `LinuxContainerResources` message should be extended to support specifying
|
||||||
|
huge page limits per size. The specification for huge pages should align with
|
||||||
|
opencontainers/runtime-spec.
|
||||||
|
|
||||||
|
see:
|
||||||
|
https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits
|
||||||
|
|
||||||
|
The CRI changes are required before promoting this feature to Beta.
|
||||||
|
|
||||||
|
## Cgroup Enforcement
|
||||||
|
|
||||||
|
To use this feature, the `--cgroups-per-qos` must be enabled. In addition, the
|
||||||
|
`hugetlb` cgroup must be mounted.
|
||||||
|
|
||||||
|
The `kubepods` cgroup is bounded by the `Allocatable` value.
|
||||||
|
|
||||||
|
The QoS level cgroups are left unbounded across all huge page pool sizes.
|
||||||
|
|
||||||
|
The pod level cgroup sandbox is configured as follows, where `hugepagesize` is
|
||||||
|
the system supported huge page size(s). If no request is made for huge pages of
|
||||||
|
a particular size, the limit is set to 0 for all supported types on the node.
|
||||||
|
|
||||||
|
```
|
||||||
|
pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>])
|
||||||
|
```
|
||||||
|
|
||||||
|
If the container runtime supports specification of huge page limits, the
|
||||||
|
container cgroup sandbox will be configured with the specified limit.
|
||||||
|
|
||||||
|
The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level
|
||||||
|
cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed.
|
||||||
|
|
||||||
|
## Limits and Quota
|
||||||
|
|
||||||
|
The `ResourceQuota` resource will be extended to support accounting for
|
||||||
|
`hugepages-<hugepagesize>` similar to `cpu` and `memory`. The `LimitRange`
|
||||||
|
resource will be extended to define min and max constraints for `hugepages`
|
||||||
|
similar to `cpu` and `memory`.
|
||||||
|
|
||||||
|
## Scheduler changes
|
||||||
|
|
||||||
|
The scheduler will need to ensure any huge page request defined in the pod spec
|
||||||
|
can be fulfilled by a candidate node.
|
||||||
|
|
||||||
|
## cAdvisor changes
|
||||||
|
|
||||||
|
cAdvisor will need to be modified to return the number of pre-allocated huge
|
||||||
|
pages per page size on the node. It will be used to determine capacity and
|
||||||
|
calculate allocatable values on the node.
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
### Version 1.8
|
||||||
|
|
||||||
|
Initial alpha support for huge pages usage by pods.
|
||||||
|
|
||||||
|
### Version 1.9
|
||||||
|
|
||||||
|
Resource Quota support. Limit Range support. Beta support for huge pages
|
||||||
|
(pending community feedback)
|
||||||
|
|
||||||
|
## Known Issues
|
||||||
|
|
||||||
|
### Huge pages as shared memory
|
||||||
|
|
||||||
|
For the Java use case, the JVM maps the huge pages as a shared memory segment
|
||||||
|
and memlocks them to prevent the system from moving or swapping them out.
|
||||||
|
|
||||||
|
There are several issues here:
|
||||||
|
- The user running the Java app must be a member of the gid set in the
|
||||||
|
`vm.huge_tlb_shm_group` sysctl
|
||||||
|
- sysctl `kernel.shmmax` must allow the size of the shared memory segment
|
||||||
|
- The user's memlock ulimits must allow the size of the shared memory segment
|
||||||
|
- `vm.huge_tlb_shm_group` is not namespaced.
|
||||||
|
|
||||||
|
### NUMA
|
||||||
|
|
||||||
|
NUMA is complicated. To support NUMA, the node must support cpu pinning,
|
||||||
|
devices, and memory locality. Extending that requirement to huge pages is not
|
||||||
|
much different. It is anticipated that the `kubelet` will provide future NUMA
|
||||||
|
locality guarantees as a feature of QoS. In particular, pods in the
|
||||||
|
`Guaranteed` QoS class are expected to have NUMA locality preferences.
|
||||||
|
|
||||||
|
Before Width: | Height: | Size: 15 KiB After Width: | Height: | Size: 15 KiB |
|
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 20 KiB |
|
Before Width: | Height: | Size: 21 KiB After Width: | Height: | Size: 21 KiB |
|
Before Width: | Height: | Size: 17 KiB After Width: | Height: | Size: 17 KiB |