addressing Brians comments

2017-09-06 15:00:19 -07:00 · 2017-09-06 15:00:19 -07:00 · 18958f8256
parent 98ab8521ff
commit 18958f8256
61 changed files with 478 additions and 207 deletions
--- a/contributors/design-proposals/Design_Proposal_TEMPLATE.md
+++ b/contributors/design-proposals/Design_Proposal_TEMPLATE.md
--- a/contributors/design-proposals/api-machinery/bulk_watch.md
+++ b/contributors/design-proposals/api-machinery/bulk_watch.md
--- a/contributors/design-proposals/api-machinery/client-package-structure.md
+++ b/contributors/design-proposals/api-machinery/client-package-structure.md
--- a/contributors/design-proposals/api-machinery/controller-ref.md
+++ b/contributors/design-proposals/api-machinery/controller-ref.md
--- a/contributors/design-proposals/api-machinery/metadata-policy.md
+++ b/contributors/design-proposals/api-machinery/metadata-policy.md
--- a/contributors/design-proposals/api-machinery/protobuf.md
+++ b/contributors/design-proposals/api-machinery/protobuf.md
--- a/contributors/design-proposals/apps/OBSOLETE_templates.md
+++ b/contributors/design-proposals/apps/OBSOLETE_templates.md
@ -1,3 +1,5 @@
+# OBSOLETE
+
 # Templates+Parameterization: Repeatedly instantiating user-customized application topologies.

 ## Motivation
--- a/contributors/design-proposals/api-machinery/selector-generation.md
+++ b/contributors/design-proposals/api-machinery/selector-generation.md
--- a/contributors/design-proposals/architecture/architecture.dia
+++ b/contributors/design-proposals/architecture/architecture.dia
--- a/contributors/design-proposals/architecture/identifiers.md
+++ b/contributors/design-proposals/architecture/identifiers.md
--- a/contributors/design-proposals/architecture/namespaces.md
+++ b/contributors/design-proposals/architecture/namespaces.md
--- a/contributors/design-proposals/api-machinery/principles.md
+++ b/contributors/design-proposals/api-machinery/principles.md
--- a/contributors/design-proposals/auth/image-provenance.md
+++ b/contributors/design-proposals/auth/image-provenance.md
--- a/contributors/design-proposals/auth/secrets.md
+++ b/contributors/design-proposals/auth/secrets.md
--- a/contributors/design-proposals/autoscaling/initial-resources.md
+++ b/contributors/design-proposals/autoscaling/initial-resources.md
--- a/contributors/design-proposals/cluster-lifecycle/ha_master.md
+++ b/contributors/design-proposals/cluster-lifecycle/ha_master.md
--- a/contributors/design-proposals/cluster-lifecycle/high-availability.md
+++ b/contributors/design-proposals/cluster-lifecycle/high-availability.md
--- a/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md
+++ b/contributors/design-proposals/cluster-lifecycle/kubelet-tls-bootstrap.md
--- a/contributors/design-proposals/cluster-lifecycle/local-cluster-ux.md
+++ b/contributors/design-proposals/cluster-lifecycle/local-cluster-ux.md
--- a/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md
+++ b/contributors/design-proposals/cluster-lifecycle/runtimeconfig.md
--- a/contributors/design-proposals/containerized-mounter.md~
+++ b/contributors/design-proposals/containerized-mounter.md~
@ -1,43 +0,0 @@
-# Containerized Mounter with Chroot for Container-Optimized OS
-
-## Goal
-
-Due security and management overhead, our new Container-Optimized OS used by GKE
-does not carry certain storage drivers and tools needed for such as nfs and 
-glusterfs. This project takes a containerized mount approach to package mount 
-binaries into a container. Volume plugin will execute mount inside of container 
-and share the mount with the host. 
-
-
-## Design
-
-1. A docker image has storage tools (nfs and glusterfs) pre-installed and uploaded
-   to gcs. 
-2. During GKE cluster configuration, the docker image is pulled and installed on 
-   the cluster node.
-3. When nfs or glusterfs type mount is invoked by kubelet, it will run the mount 
-   command inside of a container with the pre-install docker image and the mount
-   propagation set to “shared. In this way, the mount inside the container will 
-   visible to host node too.
-4. A special case for NFSv3, a rpcbind process is issued before running mount
-   command. 
-   
-## Implementation details
-
-* In the first version of containerized mounter, we use rkt fly to dynamically
-  start a container during mount. When mount command finishes, the container is 
-  normally exited and will be garbage-collected. However, in case the glusterfs
-  mount, because a gluster daemon is running after command mount finishes util
-  glusterfs unmount, the container started for mount will continue to run until 
-  glusterfs client finishes. The container cannot be garbage-collected right away
-  and multiple containers might be running for some time. Due to shared mount
-  propagation, with more containers running, the number of mounts will increase
-  significantly and might cause kernel panic. To solve this problem, a chroot
-  approach is proposed and implemented. 
-* In the second version, instead of running a container on the host, the docker
-  container’s file system is exported as a tar archive and pre-installed on host. 
-  Kubelet directory is shared mount between host and inside of the container’s 
-  rootfs. When a gluster/nfs mount is issued, a mounter script will use chroot to
-  change to the container’s rootfs and run the mount. This approach is very clean
-  since there is no need to manage a container’s lifecycle and avoid having large
-  number of mounts.
--- a/contributors/design-proposals/dir_struct.txt
+++ b/contributors/design-proposals/dir_struct.txt
@ -1,240 +1,244 @@
-Uncategorized (Please Help)
-	high-availability.md
-	control-plane-resilience.md
-	downward_api_resources_limits_requests.md
-	seccomp.md
-	client-package-structure.md
-	service-discovery.md
-	metadata-policy.md
-	containerized-mounter.md~
-	identifiers.md
-	local-cluster-ux.md
-	pod-pid-namespace.md
-	grow-volume-size.md
-	image-provenance.md
-	core-metrics-pipeline.md
-	versioning.md
-	ha_master.md
-	secret-configmap-downwarapi-file-mode.md
-	protobuf.md
-	flakiness-sla.md
-	resources.md
-	initial-resources.md
+Uncategorized
+	admission_control_event_rate_limit.md
 	create_sheet.py
-	runtime-client-server.md
-	OWNERS
-	namespaces.md
-	cpu-manager.md
-	selinux-enhancements.md
-	sysctl.md
+	create_sheet.py~
+	design_proposal_template.md
 	dir_struct.txt
-	selinux.md
-	templates.md
-	pod-cache.png
-	README.md
-	multi-platform.md
-	pod-lifecycle-event-generator.md
-	secrets.md
-	cri-dockershim-checkpoint.md
 	event_compression.md
+	multi-platform.md
+	owners
 	pleg.png
+	readme.md
+	runtime-client-server.md
+	templates.md~
 ./sig-cli
+	get-describe-apiserver-extensions.md
+	kubectl-create-from-env-file.md
 	kubectl-extension.md
+	kubectl-login.md
 	kubectl_apply_getsetdiff_last_applied_config.md
 	multi-fields-merge-key.md
-	template.md
-	expansion.md
-	kubectl-login.md
-	simple-rolling-update.md
-	OWNERS
-	get-describe-apiserver-extensions.md
+	owners
 	preserve-order-in-strategic-merge-patch.md
-	kubectl-create-from-env-file.md
+	simple-rolling-update.md
 ./network
-	flannel-integration.md
-	service-external-name.md
-	networking.md
 	command_execution_port_forwarding.md
-	network-policy.md
 	external-lb-source-ip-preservation.md
+	flannel-integration.md
+	network-policy.md
+	networking.md
+	selinux-enhancements.md
+	service-discovery.md
+	service-external-name.md
 ./resource-management
+	admission_control_limit_range.md
+	admission_control_resource_quota.md
+	device-plugin-overview.png
 	device-plugin.md
 	device-plugin.png
 	gpu-support.md
-	device-plugin-overview.png
+	hugepages.md
+	resource-quota-scoping.md
+./testing
+	flakiness-sla.md
 ./autoscaling
-	hpa-v2.md
-	hpa-status-conditions.md
 	horizontal-pod-autoscaler.md
+	hpa-status-conditions.md
+	hpa-v2.md
+	initial-resources.md
 ./architecture
 	architecture.md
-	architecture.dia
 	architecture.png
 	architecture.svg
-./api-machinery
-	admission_control_extension.md
-	csi-client-structure-proposal.md
-	selector-generation.md
-	pod-safety.md
-	container-init.md
-	resource-quota-scoping.md
-	thirdpartyresources.md
-	aggregated-api-servers.md
-	extending-api.md
-	envvar-configmap.md
-	dynamic-admission-control-configuration.md
-	api-chunking.md
-	garbage-collection.md
-	customresources-validation.md
-	auditing.md
-	apiserver-watch.md
-	admission_control_limit_range.md
-	apiserver-build-in-admission-plugins.md
-	synchronous-garbage-collection.md
-	configmap.md
-	csi-new-client-library-procedure.md
-	pod-preset.md
-	add-new-patchStrategy-to-clear-fields-not-present-in-patch.md
-	api-group.md
+	identifiers.md
+	namespaces.md
 	principles.md
+./api-machinery
+	add-new-patchstrategy-to-clear-fields-not-present-in-patch.md
 	admission_control.md
-	optional-configmap.md
-	server-get.md
+	admission_control_extension.md
+	aggregated-api-servers.md
+	api-chunking.md
+	api-group.md
+	apiserver-build-in-admission-plugins.md
 	apiserver-count-fix.md
-	admission_control_resource_quota.md
+	apiserver-watch.md
+	auditing.md
+	bulk_watch.md
+	client-package-structure.md
+	controller-ref.md
+	csi-client-structure-proposal.md
+	csi-new-client-library-procedure.md
+	customresources-validation.md
+	dynamic-admission-control-configuration.md
+	extending-api.md
+	garbage-collection.md
+	metadata-policy.md
+	protobuf.md
+	server-get.md
+	synchronous-garbage-collection.md
+	thirdpartyresources.md
 ./node
-	pod-resource-management.md
-	kubelet-tls-bootstrap.md
-	dynamic-kubelet-configuration.md
-	kubelet-hypercontainer-runtime.md
+	all-in-one-volume.md
+	annotations-downward-api.md
+	configmap.md
+	container-init.md
 	container-runtime-interface-v1.md
-	kubelet-authorizer.md
+	cpu-manager.md
+	cri-dockershim-checkpoint.md
 	disk-accounting.md
-	kubelet-systemd.md
-	kubelet-cri-logging.md
+	downward_api_resources_limits_requests.md
+	dynamic-kubelet-configuration.md
+	envvar-configmap.md
+	expansion.md
 	kubelet-auth.md
-	runtimeconfig.md
+	kubelet-authorizer.md
+	kubelet-cri-logging.md
+	kubelet-eviction.md
+	kubelet-hypercontainer-runtime.md
+	kubelet-rkt-runtime.md
+	kubelet-rootfs-distribution.md
+	kubelet-systemd.md
+	node-allocatable.md
+	optional-configmap.md
+	pod-cache.png
+	pod-lifecycle-event-generator.md
+	pod-pid-namespace.md
+	pod-resource-management.md
+	propagation.md
 	resource-qos.md
 	runtime-pod-cache.md
-	kubelet-rootfs-distribution.md
-	kubelet-rkt-runtime.md
-	node-allocatable.md
-	kubelet-eviction.md
+	seccomp.md
+	secret-configmap-downwardapi-file-mode.md
+	selinux.md
+	sysctl.md
+./service-catalog
+	pod-preset.md
 ./instrumentation
+	core-metrics-pipeline.md
+	custom-metrics-api.md
+	metrics-server.md
 	monitoring_architecture.md
 	monitoring_architecture.png
-	custom-metrics-api.md
-	resource-metrics-api.md
 	performance-related-monitoring.md
-	metrics-server.md
+	resource-metrics-api.md
 	volume_stats_pvc_ref.md
 ./auth
-	security_context.md
-	no-new-privs.md
 	access.md
-	enhance-pluggable-policy.md
 	apparmor.md
-	security-context-constraints.md
+	enhance-pluggable-policy.md
+	image-provenance.md
+	no-new-privs.md
 	pod-security-context.md
-	bulk_watch.md
+	secrets.md
+	security-context-constraints.md
 	security.md
+	security_context.md
 	service_accounts.md
 ./federation
-	federated-replicasets.md
-	ubernetes-design.png
-	ubernetes-cluster-state.png
-	federation-phase-1.md
-	federation-clusterselector.md
-	ubernetes-scheduling.png
-	federation-lite.md
-	federation.md
-	federated-services.md
-	federation-high-level-arch.png
+	control-plane-resilience.md
 	federated-api-servers.md
-	federated-placement-policy.md
 	federated-ingress.md
+	federated-placement-policy.md
+	federated-replicasets.md
+	federated-services.md
+	federation-clusterselector.md
+	federation-high-level-arch.png
+	federation-lite.md
+	federation-phase-1.md
+	federation.md
+	ubernetes-cluster-state.png
+	ubernetes-design.png
+	ubernetes-scheduling.png
 ./scalability
-	Kubemark_architecture.png
-	scalability-testing.md
 	kubemark.md
+	kubemark_architecture.png
+	scalability-testing.md
 ./cluster-lifecycle
-	self-hosted-layers.png
-	self-hosted-kubernetes.md
-	dramatically-simplify-cluster-creation.md
 	bootstrap-discovery.md
 	cluster-deployment.md
-	self-hosted-kubelet.md
 	clustering.md
+	dramatically-simplify-cluster-creation.md
+	ha_master.md
+	high-availability.md
+	kubelet-tls-bootstrap.md
+	local-cluster-ux.md
+	runtimeconfig.md
 	self-hosted-final-cluster.png
+	self-hosted-kubelet.md
+	self-hosted-kubernetes.md
+	self-hosted-layers.png
 	self-hosted-moving-parts.png
 ./cluster-lifecycle/clustering
-	static.png
 	.gitignore
-	Dockerfile
-	static.seqdiag
-	dynamic.seqdiag
-	OWNERS
-	Makefile
-	README.md
+	dockerfile
 	dynamic.png
+	dynamic.seqdiag
+	makefile
+	owners
+	readme.md
+	static.png
+	static.seqdiag
 ./release
 	release-notes.md
 	release-test-signal.md
+	versioning.md
 ./scheduling
-	rescheduling.md
-	rescheduler.md
-	nodeaffinity.md
-	podaffinity.md
 	hugepages.md
-	taint-toleration-dedicated.md
+	multiple-schedulers.md
+	nodeaffinity.md
 	pod-preemption.md
 	pod-priority-api.md
-	taint-node-by-condition.md
-	scheduler_extender.md
+	podaffinity.md
+	rescheduler.md
 	rescheduling-for-critical-pods.md
-	multiple-schedulers.md
+	rescheduling.md
+	resources.md
+	scheduler_extender.md
+	taint-node-by-condition.md
+	taint-toleration-dedicated.md
+./scheduling/images
+	.gitignore
+	owners
+	preemption_1.png
+	preemption_2.png
+	preemption_3.png
+	preemption_4.png
 ./apps
-	daemonset-update.md
-	cronjob.md
-	annotations-downward-api.md
-	controller-ref.md
-	statefulset-update.md
-	stateful-apps.md
-	deploy.md
-	daemon.md
 	controller_history.md
-	job.md
-	indexed-job.md
+	cronjob.md
+	daemon.md
+	daemonset-update.md
+	deploy.md
 	deployment.md
+	indexed-job.md
+	job.md
+	obsolete_templates.md
+	selector-generation.md
+	stateful-apps.md
+	statefulset-update.md
 ./storage
-	flex-volumes-drivers-psp.md
-	local-storage-overview.md
-	all-in-one-volume.md
-	volume-selectors.md
-	persistent-storage.md
-	volume-metrics.md
-	flexvolume-deployment.md
-	volume-snapshotting.png
-	volume-provisioning.md
-	propagation.md
-	volume-ownership-management.md
-	mount-options.md
-	volumes.md
+	containerized-mounter.md
 	default-storage-class.md
-	volume-snapshotting.md
+	flex-volumes-drivers-psp.md
+	flexvolume-deployment.md
+	grow-volume-size.md
+	local-storage-overview.md
+	mount-options.md
+	persistent-storage.md
+	pod-safety.md
 	volume-hostpath-qualifiers.md
+	volume-metrics.md
+	volume-ownership-management.md
+	volume-provisioning.md
+	volume-selectors.md
+	volume-snapshotting.md
+	volume-snapshotting.png
+	volumes.md
 ./aws
 	aws_under_the_hood.md
-./images
-	preemption_1.png
-	preemption_3.png
-	.gitignore
-	OWNERS
-	preemption_2.png
-	preemption_4.png
 ./gcp
 	gce-l4-loadbalancer-healthcheck.md
-	containerized-mounter.md
 ./cloud-provider
-	cloudprovider-storage-metrics.md
 	cloud-provider-refactoring.md
+	cloudprovider-storage-metrics.md
--- a/contributors/design-proposals/federation/control-plane-resilience.md
+++ b/contributors/design-proposals/federation/control-plane-resilience.md
--- a/contributors/design-proposals/instrumentation/core-metrics-pipeline.md
+++ b/contributors/design-proposals/instrumentation/core-metrics-pipeline.md
--- a/contributors/design-proposals/network/selinux-enhancements.md
+++ b/contributors/design-proposals/network/selinux-enhancements.md
--- a/contributors/design-proposals/network/service-discovery.md
+++ b/contributors/design-proposals/network/service-discovery.md
--- a/contributors/design-proposals/storage/all-in-one-volume.md
+++ b/contributors/design-proposals/storage/all-in-one-volume.md
--- a/contributors/design-proposals/node/annotations-downward-api.md
+++ b/contributors/design-proposals/node/annotations-downward-api.md
--- a/contributors/design-proposals/api-machinery/configmap.md
+++ b/contributors/design-proposals/api-machinery/configmap.md
--- a/contributors/design-proposals/api-machinery/container-init.md
+++ b/contributors/design-proposals/api-machinery/container-init.md
--- a/contributors/design-proposals/node/cpu-manager.md
+++ b/contributors/design-proposals/node/cpu-manager.md
--- a/contributors/design-proposals/node/cri-dockershim-checkpoint.md
+++ b/contributors/design-proposals/node/cri-dockershim-checkpoint.md
--- a/contributors/design-proposals/node/downward_api_resources_limits_requests.md
+++ b/contributors/design-proposals/node/downward_api_resources_limits_requests.md
--- a/contributors/design-proposals/api-machinery/envvar-configmap.md
+++ b/contributors/design-proposals/api-machinery/envvar-configmap.md
--- a/contributors/design-proposals/sig-cli/expansion.md
+++ b/contributors/design-proposals/sig-cli/expansion.md
--- a/contributors/design-proposals/api-machinery/optional-configmap.md
+++ b/contributors/design-proposals/api-machinery/optional-configmap.md
--- a/contributors/design-proposals/node/pod-cache.png
+++ b/contributors/design-proposals/node/pod-cache.png
--- a/contributors/design-proposals/node/pod-lifecycle-event-generator.md
+++ b/contributors/design-proposals/node/pod-lifecycle-event-generator.md
--- a/contributors/design-proposals/node/pod-pid-namespace.md
+++ b/contributors/design-proposals/node/pod-pid-namespace.md
--- a/contributors/design-proposals/storage/propagation.md
+++ b/contributors/design-proposals/storage/propagation.md
--- a/contributors/design-proposals/node/seccomp.md
+++ b/contributors/design-proposals/node/seccomp.md
--- a/contributors/design-proposals/node/secret-configmap-downwardapi-file-mode.md
+++ b/contributors/design-proposals/node/secret-configmap-downwardapi-file-mode.md
--- a/contributors/design-proposals/node/selinux.md
+++ b/contributors/design-proposals/node/selinux.md
--- a/contributors/design-proposals/node/sysctl.md
+++ b/contributors/design-proposals/node/sysctl.md
--- a/contributors/design-proposals/release/versioning.md
+++ b/contributors/design-proposals/release/versioning.md
--- a/contributors/design-proposals/resource-management/admission_control_limit_range.md
+++ b/contributors/design-proposals/resource-management/admission_control_limit_range.md
--- a/contributors/design-proposals/resource-management/admission_control_resource_quota.md
+++ b/contributors/design-proposals/resource-management/admission_control_resource_quota.md
--- a/contributors/design-proposals/resource-management/hugepages.md
+++ b/contributors/design-proposals/resource-management/hugepages.md
@ -0,0 +1,308 @@
+# HugePages support in Kubernetes
+
+**Authors**
+* Derek Carr (@derekwaynecarr)
+* Seth Jennings (@sjenning)
+* Piotr Prokop (@PiotrProkop)
+
+**Status**: In progress
+
+## Abstract
+
+A proposal to enable applications running in a Kubernetes cluster to use huge
+pages.
+
+A pod may request a number of huge pages.  The `scheduler` is able to place the
+pod on a node that can satisfy that request.  The `kubelet` advertises an
+allocatable number of huge pages to support scheduling decisions. A pod may
+consume hugepages via `hugetlbfs` or `shmget`.  Huge pages are not
+overcommitted.
+
+## Motivation
+
+Memory is managed in blocks known as pages.  On most systems, a page is 4Ki. 1Mi
+of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, etc. CPUs have
+a built-in memory management unit that manages a list of these pages in
+hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of
+virtual-to-physical page mappings.  If the virtual address passed in a hardware
+instruction can be found in the TLB, the mapping can be determined quickly.  If
+not, a TLB miss occurs, and the system falls back to slower, software based
+address translation.  This results in performance issues.  Since the size of the
+TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the
+page size.
+
+A huge page is a memory page that is larger than 4Ki.  On x86_64 architectures,
+there are two common huge page sizes: 2Mi and 1Gi.  Sizes vary on other
+architectures, but the idea is the same.  In order to use huge pages,
+application must write code that is aware of them.  Transparent huge pages (THP)
+attempts to automate the management of huge pages without application knowledge,
+but they have limitations.  In particular, they are limited to 2Mi page sizes.
+THP might lead to performance degradation on nodes with high memory utilization
+or fragmentation due to defragmenting efforts of THP, which can lock memory
+pages. For this reason, some applications may be designed to (or recommend)
+usage of pre-allocated huge pages instead of THP.
+
+Managing memory is hard, and unfortunately, there is no one-size fits all
+solution for all applications.
+
+## Scope
+
+This proposal only includes pre-allocated huge pages configured on the node by
+the administrator at boot time or by manual dynamic allocation.  It does not
+discuss how the cluster could dynamically attempt to allocate huge pages in an
+attempt to find a fit for a pod pending scheduling.  It is anticipated that
+operators may use a variety of strategies to allocate huge pages, but we do not
+anticipate the kubelet itself doing the allocation.  Allocation of huge pages
+ideally happens soon after boot time.
+
+This proposal defers issues relating to NUMA.
+
+## Use Cases
+
+The class of applications that benefit from huge pages typically have
+- A large memory working set
+- A sensitivity to memory access latency
+
+Example applications include:
+- database management systems (MySQL, PostgreSQL, MongoDB, Oracle, etc.)
+- Java applications can back the heap with huge pages using the
+  `-XX:+UseLargePages` and `-XX:LagePageSizeInBytes` options.
+- packet processing systems (DPDK)
+
+Applications can generally use huge pages by calling
+- `mmap()` with `MAP_ANONYMOUS | MAP_HUGETLB` and use it as anonymous memory
+- `mmap()` a file backed by `hugetlbfs`
+- `shmget()` with `SHM_HUGETLB` and use it as a shared memory segment (see Known
+  Issues).
+
+1. A pod can use huge pages with any of the prior described methods.
+1. A pod can request huge pages.
+1. A scheduler can bind pods to nodes that have available huge pages.
+1. A quota may limit usage of huge pages.
+1. A limit range may constrain min and max huge page requests.
+
+## Feature Gate
+
+The proposal introduces huge pages as an Alpha feature.
+
+It must be enabled via the `--feature-gates=HugePages=true` flag on pertinent
+components pending graduation to Beta.
+
+## Node Specfication
+
+Huge pages cannot be overcommitted on a node.
+
+A system may support multiple huge page sizes.  It is assumed that most nodes
+will be configured to primarily use the default huge page size as returned via
+`grep Hugepagesize /proc/meminfo`.  This defaults to 2Mi on most Linux systems
+unless overriden by `default_hugepagesz=1g` in kernel boot parameters.  
+
+For each supported huge page size, the node will advertise a resource of the
+form `hugepages-<hugepagesize>`.  On Linux, supported huge page sizes are
+determined by parsing the `/sys/kernel/mm/hugepages/hugepages-{size}kB`
+directory on the host. Kubernetes will expose a `hugepages-<hugepagesize>`
+resource using binary notation form. It will convert `<hugepagesize>` into the
+most compact binary notation using integer values.  For example, if a node
+supports `hugepages-2048kB`, a resource `hugepages-2Mi` will be shown in node
+capacity and allocatable values. Operators may set aside pre-allocated huge
+pages that are not available for user pods similar to normal memory via the
+`--system-reserved` flag.
+
+There are a variety of huge page sizes supported across different hardware
+architectures.  It is preferred to have a resource per size in order to better
+support quota.  For example, 1 huge page with size 2Mi is orders of magnitude
+different than 1 huge page with size 1Gi.  We assume gigantic pages are even
+more precious resources than huge pages.
+
+Pre-allocated huge pages reduce the amount of allocatable memory on a node. The
+node will treat pre-allocated huge pages similar to other system reservations
+and reduce the amount of `memory` it reports using the following formula:
+
+```
+[Allocatable] = [Node Capacity] - 
+ [Kube-Reserved] - 
+ [System-Reserved] - 
+ [Pre-Allocated-HugePages * HugePageSize] -
+ [Hard-Eviction-Threshold]
+```
+
+The following represents a machine with 10Gi of memory.  1Gi of memory has been
+reserved as 512 pre-allocated huge pages sized 2Mi.  As you can see, the
+allocatable memory has been reduced to account for the amount of huge pages
+reserved.
+
+```
+apiVersion: v1
+kind: Node
+metadata:
+  name: node1
+...
+status:
+  capacity:
+    memory: 10Gi
+    hugepages-2Mi: 1Gi
+  allocatable:
+    memory: 9Gi
+    hugepages-2Mi: 1Gi
+...  
+```
+
+## Pod Specification
+
+A pod must make a request to consume pre-allocated huge pages using the resource
+`hugepages-<hugepagesize>` whose quantity is a positive amount of memory in
+bytes.  The specified amount must align with the `<hugepagesize>`; otherwise,
+the pod will fail validation.  For example, it would be valid to request
+`hugepages-2Mi: 4Mi`, but invalid to request `hugepages-2Mi: 3Mi`.
+
+The request and limit for `hugepages-<hugepagesize>` must match.  Similar to
+memory, an application that requests `hugepages-<hugepagesize>` resource is at
+minimum in the `Burstable` QoS class.
+
+If a pod consumes huge pages via `shmget`, it must run with a supplemental group
+that matches `/proc/sys/vm/hugetlb_shm_group` on the node.  Configuration of
+this group is outside the scope of this specification.
+
+Initially, a pod may not consume multiple huge page sizes in a single pod spec.
+Attempting to use `hugepages-2Mi` and `hugepages-1Gi` in the same pod spec will
+fail validation.  We believe it is rare for applications to attempt to use
+multiple huge page sizes. This restriction may be lifted in the future with
+community presented use cases.  Introducing the feature with this restriction
+limits the exposure of API changes needed when consuming huge pages via volumes.
+
+In order to consume huge pages backed by the `hugetlbfs` filesystem inside the
+specified container in the pod, it is helpful to understand the set of mount
+options used with `hugetlbfs`.  For more details, see "Using Huge Pages" here:
+https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
+
+```
+mount -t hugetlbfs \
+	-o uid=<value>,gid=<value>,mode=<value>,pagesize=<value>,size=<value>,\
+	min_size=<value>,nr_inodes=<value> none /mnt/huge
+```
+
+The proposal recommends extending the existing `EmptyDirVolumeSource` to satisfy
+this use case.  A new `medium=HugePages` option would be supported.  To write
+into this volume, the pod must make a request for huge pages. The `pagesize`
+argument is inferred from the `hugepages-<hugepagesize>` from the resource
+request.  If in the future, multiple huge page sizes are supported in a single
+pod spec, we may modify the `EmptyDirVolumeSource` to provide an optional page
+size.  The existing `sizeLimit` option for `emptyDir` would restrict usage to
+the minimum value specified between `sizeLimit` and the sum of huge page limits
+of all containers in a pod. This keeps the behavior consistent with memory
+backed `emptyDir` volumes whose usage is ultimately constrained by the pod
+cgroup sandbox memory settings.  The `min_size` option is omitted as its not
+necessary.  The `nr_inodes` mount option is omitted at this time in the same
+manner it is omitted with `medium=Memory` when using `tmpfs`.
+
+The following is a sample pod that is limited to 1Gi huge pages of size 2Mi. It
+can consume those pages using `shmget()` or via `mmap()` with the specified
+volume.
+
+```
+apiVersion: v1
+kind: Pod
+metadata:
+  name: example
+spec:
+  containers:
+...
+    volumeMounts:
+    - mountPath: /hugepages
+      name: hugepage
+    resources:
+      requests:
+        hugepages-2Mi: 1Gi
+      limits:
+        hugepages-2Mi: 1Gi
+  volumes:
+  - name: hugepage
+    emptyDir:
+      medium: HugePages
+```
+
+## CRI Updates
+
+The `LinuxContainerResources` message should be extended to support specifying
+huge page limits per size.  The specification for huge pages should align with
+opencontainers/runtime-spec.
+
+see:
+https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#huge-page-limits
+
+The CRI changes are required before promoting this feature to Beta.
+
+## Cgroup Enforcement
+
+To use this feature, the `--cgroups-per-qos` must be enabled.  In addition, the
+`hugetlb` cgroup must be mounted.
+
+The `kubepods` cgroup is bounded by the `Allocatable` value.
+
+The QoS level cgroups are left unbounded across all huge page pool sizes.
+
+The pod level cgroup sandbox is configured as follows, where `hugepagesize` is
+the system supported huge page size(s).  If no request is made for huge pages of
+a particular size, the limit is set to 0 for all supported types on the node.
+
+```
+pod<UID>/hugetlb.<hugepagesize>.limit_in_bytes = sum(pod.spec.containers.resources.limits[hugepages-<hugepagesize>])
+```
+
+If the container runtime supports specification of huge page limits, the
+container cgroup sandbox will be configured with the specified limit.
+
+The `kubelet` will ensure the `hugetlb` has no usage charged to the pod level
+cgroup sandbox prior to deleting the pod to ensure all resources are reclaimed.
+
+## Limits and Quota
+
+The `ResourceQuota` resource will be extended to support accounting for
+`hugepages-<hugepagesize>` similar to `cpu` and `memory`.  The `LimitRange`
+resource will be extended to define min and max constraints for `hugepages`
+similar to `cpu` and `memory`.
+
+## Scheduler changes
+
+The scheduler will need to ensure any huge page request defined in the pod spec
+can be fulfilled by a candidate node.
+
+## cAdvisor changes
+
+cAdvisor will need to be modified to return the number of pre-allocated huge
+pages per page size on the node.  It will be used to determine capacity and
+calculate allocatable values on the node.
+
+## Roadmap
+
+### Version 1.8
+
+Initial alpha support for huge pages usage by pods.
+
+### Version 1.9
+
+Resource Quota support. Limit Range support. Beta support for huge pages
+(pending community feedback)
+
+## Known Issues
+
+### Huge pages as shared memory
+
+For the Java use case, the JVM maps the huge pages as a shared memory segment
+and memlocks them to prevent the system from moving or swapping them out.
+
+There are several issues here:
+- The user running the Java app must be a member of the gid set in the
+  `vm.huge_tlb_shm_group` sysctl
+- sysctl `kernel.shmmax` must allow the size of the shared memory segment
+- The user's memlock ulimits must allow the size of the shared memory segment
+- `vm.huge_tlb_shm_group` is not namespaced.
+
+### NUMA
+
+NUMA is complicated.  To support NUMA, the node must support cpu pinning,
+devices, and memory locality.  Extending that requirement to huge pages is not
+much different.  It is anticipated that the `kubelet` will provide future NUMA
+locality guarantees as a feature of QoS.  In particular, pods in the
+`Guaranteed` QoS class are expected to have NUMA locality preferences.
+
--- a/contributors/design-proposals/resource-management/resource-quota-scoping.md
+++ b/contributors/design-proposals/resource-management/resource-quota-scoping.md
--- a/contributors/design-proposals/scheduling/images/.gitignore
+++ b/contributors/design-proposals/scheduling/images/.gitignore
--- a/contributors/design-proposals/scheduling/images/OWNERS
+++ b/contributors/design-proposals/scheduling/images/OWNERS
--- a/contributors/design-proposals/scheduling/images/preemption_1.png
+++ b/contributors/design-proposals/scheduling/images/preemption_1.png
--- a/contributors/design-proposals/scheduling/images/preemption_2.png
+++ b/contributors/design-proposals/scheduling/images/preemption_2.png
--- a/contributors/design-proposals/scheduling/images/preemption_3.png
+++ b/contributors/design-proposals/scheduling/images/preemption_3.png
--- a/contributors/design-proposals/scheduling/images/preemption_4.png
+++ b/contributors/design-proposals/scheduling/images/preemption_4.png
--- a/contributors/design-proposals/scheduling/resources.md
+++ b/contributors/design-proposals/scheduling/resources.md
--- a/contributors/design-proposals/service-catalog/pod-preset.md
+++ b/contributors/design-proposals/service-catalog/pod-preset.md
--- a/contributors/design-proposals/storage/containerized-mounter.md
+++ b/contributors/design-proposals/storage/containerized-mounter.md
--- a/contributors/design-proposals/storage/grow-volume-size.md
+++ b/contributors/design-proposals/storage/grow-volume-size.md
--- a/contributors/design-proposals/api-machinery/pod-safety.md
+++ b/contributors/design-proposals/api-machinery/pod-safety.md
--- a/contributors/design-proposals/testing/flakiness-sla.md
+++ b/contributors/design-proposals/testing/flakiness-sla.md