Proof of concept for import script; various FARs

This commit is contained in:
johndmulhausen 2016-02-14 14:08:16 -08:00
parent 42c2226bef
commit e37603f8bd
68 changed files with 527 additions and 6976 deletions

View File

@ -0,0 +1,17 @@
v1.1/examples
v1.1/docs/design
v1.1/docs/man
v1.1/docs/proposals
v1.1/docs/api-reference
v1.1/docs/user-guide/kubectl
v1.1/docs/admin/kube-apiserver.md
v1.1/docs/admin/kube-controller-manager.md
v1.1/docs/admin/kube-proxy.md
v1.1/docs/admin/kube-scheduler.md
v1.1/docs/admin/kubelet.md
v1.0/docs/user-guide/kubectl
v1.0/docs/admin/kube-apiserver.md
v1.0/docs/admin/kube-controller-manager.md
v1.0/docs/admin/kube-proxy.md
v1.0/docs/admin/kube-scheduler.md
v1.0/docs/admin/kubelet.md

View File

@ -1,7 +1,14 @@
git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
git checkout gh-pages
cd ..
rm -rf v1.1/examples
mv kubernetes/_v1.1/examples v1.1/
rm -rf kubernetes
#git clone https://github.com/kubernetes/kubernetes.git k8s
#cd k8s
#git checkout gh-pages
#cd ..
while read line || [[ -n ${line} ]]; do
CLEARPATH=${line}
K8SSOURCE='k8s/_'${line}
DESTINATION=${line%/*}
echo "rm -rf ${CLEARPATH}"
echo "mv ${K8SSOURCE} ${DESTINATION}"
done <source_files_from_main_k8s_repo.txt
#rm -rf k8s

View File

@ -1,9 +1,6 @@
---
title: "Limit Range"
---
Limit Range
========================================
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
system will be able to consume as much CPU and memory on the node that executes the pod.
@ -31,39 +28,44 @@ apply default resource limits to pods in the absence of an end-user specified va
See [LimitRange design doc](../../design/admission_control_limit_range) for more information. For a detailed description of the Kubernetes resource model, see [Resources](/{{page.version}}/docs/user-guide/compute-resources)
Step 0: Prerequisites
-----------------------------------------
## Step 0: Prerequisites
This example requires a running Kubernetes cluster. See the [Getting Started guides](/{{page.version}}/docs/getting-started-guides/) for how to get started.
Change to the `<kubernetes>` directory if you're not already there.
Step 1: Create a namespace
-----------------------------------------
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called limit-example:
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/namespace.yaml
namespace "limit-example" created
$ kubectl get namespaces
NAME LABELS STATUS AGE
default <none> Active 5m
limit-example <none> Active 53s
{% endhighlight %}
Step 2: Apply a limit to the namespace
-----------------------------------------
## Step 2: Apply a limit to the namespace
Let's create a simple limit in our namespace.
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
limitrange "mylimits" created
{% endhighlight %}
Let's describe the limits that we have imposed in our namespace.
{% highlight console %}
$ kubectl describe limits mylimits --namespace=limit-example
Name: mylimits
Namespace: limit-example
@ -73,6 +75,7 @@ Pod cpu 200m 2 - - -
Pod memory 6Mi 1Gi - - -
Container cpu 100m 2 200m 300m -
Container memory 3Mi 1Gi 100Mi 200Mi -
{% endhighlight %}
In this scenario, we have said the following:
@ -89,8 +92,8 @@ set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
containers CPU limits must be <= 2.
Step 3: Enforcing limits at point of creation
-----------------------------------------
## Step 3: Enforcing limits at point of creation
The limits enumerated in a namespace are only enforced when a pod is created or updated in
the cluster. If you change the limits to a different value range, it does not affect pods that
were previously created in a namespace.
@ -102,15 +105,18 @@ Let's first spin up a replication controller that creates a single container pod
how default values are applied to each pod.
{% highlight console %}
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
replicationcontroller "nginx" created
$ kubectl get pods --namespace=limit-example
NAME READY STATUS RESTARTS AGE
nginx-aq0mf 1/1 Running 0 35s
$ kubectl get pods nginx-aq0mf --namespace=limit-example -o yaml | grep resources -C 8
{% endhighlight %}
{% highlight yaml %}
resourceVersion: "127"
selfLink: /api/v1/namespaces/limit-example/pods/nginx-aq0mf
uid: 51be42a7-7156-11e5-9921-286ed488f785
@ -128,6 +134,7 @@ spec:
memory: 100Mi
terminationMessagePath: /dev/termination-log
volumeMounts:
{% endhighlight %}
Note that our nginx container has picked up the namespace default cpu and memory resource *limits* and *requests*.
@ -135,19 +142,24 @@ Note that our nginx container has picked up the namespace default cpu and memory
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 cpu cores.
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
{% endhighlight %}
Let's create a pod that falls within the allowed limit boundaries.
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
pod "valid-pod" created
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
{% endhighlight %}
{% highlight yaml %}
uid: 162a12aa-7157-11e5-9921-286ed488f785
spec:
containers:
@ -161,6 +173,7 @@ spec:
requests:
cpu: "1"
memory: 512Mi
{% endhighlight %}
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
@ -170,27 +183,31 @@ Note: The *limits* for CPU resource are not enforced in the default Kubernetes s
that runs the container unless the administrator deploys the kubelet with the folllowing flag:
```
$ kubelet --help
Usage of kubelet
....
--cpu-cfs-quota[=false]: Enable CPU CFS quota enforcement for containers that specify CPU limits
$ kubelet --cpu-cfs-quota=true ...
```
Step 4: Cleanup
----------------------------
## Step 4: Cleanup
To remove the resources used by this example, you can just delete the limit-example namespace.
{% highlight console %}
$ kubectl delete namespace limit-example
namespace "limit-example" deleted
$ kubectl get namespaces
NAME LABELS STATUS AGE
default <none> Active 20m
{% endhighlight %}
Summary
----------------------------
## Summary
Cluster operators that want to restrict the amount of resources a single container or pod may consume
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to

View File

@ -1,9 +1,6 @@
---
title: "Limit Range"
---
Limit Range
========================================
By default, pods run with unbounded CPU and memory limits. This means that any pod in the
system will be able to consume as much CPU and memory on the node that executes the pod.
@ -31,39 +28,44 @@ apply default resource limits to pods in the absence of an end-user specified va
See [LimitRange design doc](../../design/admission_control_limit_range) for more information. For a detailed description of the Kubernetes resource model, see [Resources](/{{page.version}}/docs/user-guide/compute-resources)
Step 0: Prerequisites
-----------------------------------------
## Step 0: Prerequisites
This example requires a running Kubernetes cluster. See the [Getting Started guides](/{{page.version}}/docs/getting-started-guides/) for how to get started.
Change to the `<kubernetes>` directory if you're not already there.
Step 1: Create a namespace
-----------------------------------------
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called limit-example:
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/namespace.yaml
namespace "limit-example" created
$ kubectl get namespaces
NAME LABELS STATUS AGE
default <none> Active 5m
limit-example <none> Active 53s
{% endhighlight %}
Step 2: Apply a limit to the namespace
-----------------------------------------
## Step 2: Apply a limit to the namespace
Let's create a simple limit in our namespace.
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/limits.yaml --namespace=limit-example
limitrange "mylimits" created
{% endhighlight %}
Let's describe the limits that we have imposed in our namespace.
{% highlight console %}
$ kubectl describe limits mylimits --namespace=limit-example
Name: mylimits
Namespace: limit-example
@ -73,6 +75,7 @@ Pod cpu 200m 2 - - -
Pod memory 6Mi 1Gi - - -
Container cpu 100m 2 200m 300m -
Container memory 3Mi 1Gi 100Mi 200Mi -
{% endhighlight %}
In this scenario, we have said the following:
@ -89,8 +92,8 @@ set by *defaultRequest* in file `limits.yaml` (200m CPU and 100Mi memory).
memory limits must be <= 1Gi; the sum of all containers CPU requests must be >= 200m and the sum of all
containers CPU limits must be <= 2.
Step 3: Enforcing limits at point of creation
-----------------------------------------
## Step 3: Enforcing limits at point of creation
The limits enumerated in a namespace are only enforced when a pod is created or updated in
the cluster. If you change the limits to a different value range, it does not affect pods that
were previously created in a namespace.
@ -102,15 +105,18 @@ Let's first spin up a replication controller that creates a single container pod
how default values are applied to each pod.
{% highlight console %}
$ kubectl run nginx --image=nginx --replicas=1 --namespace=limit-example
replicationcontroller "nginx" created
$ kubectl get pods --namespace=limit-example
NAME READY STATUS RESTARTS AGE
nginx-aq0mf 1/1 Running 0 35s
$ kubectl get pods nginx-aq0mf --namespace=limit-example -o yaml | grep resources -C 8
{% endhighlight %}
{% highlight yaml %}
resourceVersion: "127"
selfLink: /api/v1/namespaces/limit-example/pods/nginx-aq0mf
uid: 51be42a7-7156-11e5-9921-286ed488f785
@ -128,6 +134,7 @@ spec:
memory: 100Mi
terminationMessagePath: /dev/termination-log
volumeMounts:
{% endhighlight %}
Note that our nginx container has picked up the namespace default cpu and memory resource *limits* and *requests*.
@ -135,19 +142,24 @@ Note that our nginx container has picked up the namespace default cpu and memory
Let's create a pod that exceeds our allowed limits by having it have a container that requests 3 cpu cores.
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/invalid-pod.yaml --namespace=limit-example
Error from server: error when creating "docs/admin/limitrange/invalid-pod.yaml": Pod "invalid-pod" is forbidden: [Maximum cpu usage per Pod is 2, but limit is 3., Maximum cpu usage per Container is 2, but limit is 3.]
{% endhighlight %}
Let's create a pod that falls within the allowed limit boundaries.
{% highlight console %}
$ kubectl create -f docs/admin/limitrange/valid-pod.yaml --namespace=limit-example
pod "valid-pod" created
$ kubectl get pods valid-pod --namespace=limit-example -o yaml | grep -C 6 resources
{% endhighlight %}
{% highlight yaml %}
uid: 162a12aa-7157-11e5-9921-286ed488f785
spec:
containers:
@ -161,6 +173,7 @@ spec:
requests:
cpu: "1"
memory: 512Mi
{% endhighlight %}
Note that this pod specifies explicit resource *limits* and *requests* so it did not pick up the namespace
@ -170,27 +183,31 @@ Note: The *limits* for CPU resource are not enforced in the default Kubernetes s
that runs the container unless the administrator deploys the kubelet with the folllowing flag:
```
$ kubelet --help
Usage of kubelet
....
--cpu-cfs-quota[=false]: Enable CPU CFS quota enforcement for containers that specify CPU limits
$ kubelet --cpu-cfs-quota=true ...
```
Step 4: Cleanup
----------------------------
## Step 4: Cleanup
To remove the resources used by this example, you can just delete the limit-example namespace.
{% highlight console %}
$ kubectl delete namespace limit-example
namespace "limit-example" deleted
$ kubectl get namespaces
NAME LABELS STATUS AGE
default <none> Active 20m
{% endhighlight %}
Summary
----------------------------
## Summary
Cluster operators that want to restrict the amount of resources a single container or pod may consume
are able to define allowable ranges per Kubernetes namespace. In the absence of any explicit assignments,
the Kubernetes system is able to apply default resource *limits* and *requests* if desired in order to

View File

@ -1,32 +1,31 @@
---
title: "Resource Quota"
---
Resource Quota
========================================
This example demonstrates how [resource quota](../../admin/admission-controllers.html#resourcequota) and
[limitsranger](../../admin/admission-controllers.html#limitranger) can be applied to a Kubernetes namespace.
See [ResourceQuota design doc](../../design/admission_control_resource_quota) for more information.
This example assumes you have a functional Kubernetes setup.
Step 1: Create a namespace
-----------------------------------------
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called quota-example:
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
namespace "quota-example" created
$ kubectl get namespaces
NAME LABELS STATUS AGE
default <none> Active 2m
quota-example <none> Active 39s
{% endhighlight %}
Step 2: Apply a quota to the namespace
-----------------------------------------
## Step 2: Apply a quota to the namespace
By default, a pod will run with unbounded CPU and memory requests/limits. This means that any pod in the
system will be able to consume as much CPU and memory on the node that executes the pod.
@ -39,8 +38,10 @@ checks the total resource *requests*, not resource *limits* of all containers/po
Let's create a simple quota in our namespace:
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example
resourcequota "quota" created
{% endhighlight %}
Once your quota is applied to a namespace, the system will restrict any creation of content
@ -50,6 +51,7 @@ You can describe your current quota usage to see what resources are being consum
namespace.
{% highlight console %}
$ kubectl describe quota quota --namespace=quota-example
Name: quota
Namespace: quota-example
@ -63,10 +65,11 @@ replicationcontrollers 0 20
resourcequotas 1 1
secrets 1 10
services 0 5
{% endhighlight %}
Step 3: Applying default resource requests and limits
-----------------------------------------
## Step 3: Applying default resource requests and limits
Pod authors rarely specify resource requests and limits for their pods.
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
@ -75,20 +78,25 @@ cpu and memory by creating an nginx container.
To demonstrate, lets create a replication controller that runs nginx:
{% highlight console %}
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
replicationcontroller "nginx" created
{% endhighlight %}
Now let's look at the pods that were created.
{% highlight console %}
$ kubectl get pods --namespace=quota-example
NAME READY STATUS RESTARTS AGE
{% endhighlight %}
What happened? I have no pods! Let's describe the replication controller to get a view of what is happening.
{% highlight console %}
kubectl describe rc nginx --namespace=quota-example
Name: nginx
Namespace: quota-example
@ -101,6 +109,7 @@ No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
42s 11s 3 {replication-controller } FailedCreate Error creating: Pod "nginx-" is forbidden: Must make a non-zero request for memory since it is tracked by quota.
{% endhighlight %}
The Kubernetes API server is rejecting the replication controllers requests to create a pod because our pods
@ -109,6 +118,7 @@ do not specify any memory usage *request*.
So let's set some default values for the amount of cpu and memory a pod can consume:
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
limitrange "limits" created
$ kubectl describe limits limits --namespace=quota-example
@ -118,6 +128,7 @@ Type Resource Min Max Request Limit Limit/Request
---- -------- --- --- ------- ----- -------------
Container memory - - 256Mi 512Mi -
Container cpu - - 100m 200m -
{% endhighlight %}
Now any time a pod is created in this namespace, if it has not specified any resource request/limit, the default
@ -127,14 +138,17 @@ Now that we have applied default resource *request* for our namespace, our repli
create its pods.
{% highlight console %}
$ kubectl get pods --namespace=quota-example
NAME READY STATUS RESTARTS AGE
nginx-fca65 1/1 Running 0 1m
{% endhighlight %}
And if we print out our quota usage in the namespace:
{% highlight console %}
$ kubectl describe quota quota --namespace=quota-example
Name: quota
Namespace: quota-example
@ -148,13 +162,14 @@ replicationcontrollers 1 20
resourcequotas 1 1
secrets 1 10
services 0 5
{% endhighlight %}
You can now see the pod that was created is consuming explicit amounts of resources (specified by resource *request*),
and the usage is being tracked by the Kubernetes system properly.
Summary
----------------------------
## Summary
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined
by the namespace quota. The resource consumption is measured by resource *request* in pod specification.

View File

@ -1,32 +1,31 @@
---
title: "Resource Quota"
---
Resource Quota
========================================
This example demonstrates how [resource quota](../../admin/admission-controllers.html#resourcequota) and
[limitsranger](../../admin/admission-controllers.html#limitranger) can be applied to a Kubernetes namespace.
See [ResourceQuota design doc](../../design/admission_control_resource_quota) for more information.
This example assumes you have a functional Kubernetes setup.
Step 1: Create a namespace
-----------------------------------------
## Step 1: Create a namespace
This example will work in a custom namespace to demonstrate the concepts involved.
Let's create a new namespace called quota-example:
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
namespace "quota-example" created
$ kubectl get namespaces
NAME LABELS STATUS AGE
default <none> Active 2m
quota-example <none> Active 39s
{% endhighlight %}
Step 2: Apply a quota to the namespace
-----------------------------------------
## Step 2: Apply a quota to the namespace
By default, a pod will run with unbounded CPU and memory requests/limits. This means that any pod in the
system will be able to consume as much CPU and memory on the node that executes the pod.
@ -39,8 +38,10 @@ checks the total resource *requests*, not resource *limits* of all containers/po
Let's create a simple quota in our namespace:
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example
resourcequota "quota" created
{% endhighlight %}
Once your quota is applied to a namespace, the system will restrict any creation of content
@ -50,6 +51,7 @@ You can describe your current quota usage to see what resources are being consum
namespace.
{% highlight console %}
$ kubectl describe quota quota --namespace=quota-example
Name: quota
Namespace: quota-example
@ -63,10 +65,11 @@ replicationcontrollers 0 20
resourcequotas 1 1
secrets 1 10
services 0 5
{% endhighlight %}
Step 3: Applying default resource requests and limits
-----------------------------------------
## Step 3: Applying default resource requests and limits
Pod authors rarely specify resource requests and limits for their pods.
Since we applied a quota to our project, let's see what happens when an end-user creates a pod that has unbounded
@ -75,20 +78,25 @@ cpu and memory by creating an nginx container.
To demonstrate, lets create a replication controller that runs nginx:
{% highlight console %}
$ kubectl run nginx --image=nginx --replicas=1 --namespace=quota-example
replicationcontroller "nginx" created
{% endhighlight %}
Now let's look at the pods that were created.
{% highlight console %}
$ kubectl get pods --namespace=quota-example
NAME READY STATUS RESTARTS AGE
{% endhighlight %}
What happened? I have no pods! Let's describe the replication controller to get a view of what is happening.
{% highlight console %}
kubectl describe rc nginx --namespace=quota-example
Name: nginx
Namespace: quota-example
@ -101,6 +109,7 @@ No volumes.
Events:
FirstSeen LastSeen Count From SubobjectPath Reason Message
42s 11s 3 {replication-controller } FailedCreate Error creating: Pod "nginx-" is forbidden: Must make a non-zero request for memory since it is tracked by quota.
{% endhighlight %}
The Kubernetes API server is rejecting the replication controllers requests to create a pod because our pods
@ -109,6 +118,7 @@ do not specify any memory usage *request*.
So let's set some default values for the amount of cpu and memory a pod can consume:
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/limits.yaml --namespace=quota-example
limitrange "limits" created
$ kubectl describe limits limits --namespace=quota-example
@ -118,6 +128,7 @@ Type Resource Min Max Request Limit Limit/Request
---- -------- --- --- ------- ----- -------------
Container memory - - 256Mi 512Mi -
Container cpu - - 100m 200m -
{% endhighlight %}
Now any time a pod is created in this namespace, if it has not specified any resource request/limit, the default
@ -127,14 +138,17 @@ Now that we have applied default resource *request* for our namespace, our repli
create its pods.
{% highlight console %}
$ kubectl get pods --namespace=quota-example
NAME READY STATUS RESTARTS AGE
nginx-fca65 1/1 Running 0 1m
{% endhighlight %}
And if we print out our quota usage in the namespace:
{% highlight console %}
$ kubectl describe quota quota --namespace=quota-example
Name: quota
Namespace: quota-example
@ -148,13 +162,14 @@ replicationcontrollers 1 20
resourcequotas 1 1
secrets 1 10
services 0 5
{% endhighlight %}
You can now see the pod that was created is consuming explicit amounts of resources (specified by resource *request*),
and the usage is being tracked by the Kubernetes system properly.
Summary
----------------------------
## Summary
Actions that consume node resources for cpu and memory can be subject to hard quota limits defined
by the namespace quota. The resource consumption is measured by resource *request* in pod specification.

View File

@ -1,21 +0,0 @@
---
title: "Kubernetes Design Overview"
---
Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.
Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration.
Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways.
Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary.
Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts.
A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster) and [cluster federation proposal](../proposals/federation) for more details).
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner.
For more about the Kubernetes architecture, see [architecture](architecture).

View File

@ -1,259 +0,0 @@
---
title: "K8s Identity and Access Management Sketch"
---
This document suggests a direction for identity and access management in the Kubernetes system.
## Background
High level goals are:
- Have a plan for how identity, authentication, and authorization will fit in to the API.
- Have a plan for partitioning resources within a cluster between independent organizational units.
- Ease integration with existing enterprise and hosted scenarios.
### Actors
Each of these can act as normal users or attackers.
- External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access.
- K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods)
- K8s Project Admins: People who manage access for some K8s Users
- K8s Cluster Admins: People who control the machines, networks, or binaries that make up a K8s cluster.
- K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
### Threats
Both intentional attacks and accidental use of privilege are concerns.
For both cases it may be useful to think about these categories differently:
- Application Path - attack by sending network messages from the internet to the IP/port of any application running on K8s. May exploit weakness in application or misconfiguration of K8s.
- K8s API Path - attack by sending network messages to any K8s API endpoint.
- Insider Path - attack on K8s system components. Attacker may have privileged access to networks, machines or K8s software and data. Software errors in K8s system components and administrator error are some types of threat in this category.
This document is primarily concerned with K8s API paths, and secondarily with Internal paths. The Application path also needs to be secure, but is not the focus of this document.
### Assets to protect
External User assets:
- Personal information like private messages, or images uploaded by External Users.
- web server logs.
K8s User assets:
- External User assets of each K8s User.
- things private to the K8s app, like:
- credentials for accessing other services (docker private repos, storage services, facebook, etc)
- SSL certificates for web servers
- proprietary data and code
K8s Cluster assets:
- Assets of each K8s User.
- Machine Certificates or secrets.
- The value of K8s cluster computing resources (cpu, memory, etc).
This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins.
### Usage environments
Cluster in Small organization:
- K8s Admins may be the same people as K8s Users.
- few K8s Admins.
- prefer ease of use to fine-grained access control/precise accounting, etc.
- Product requirement that it be easy for potential K8s Cluster Admin to try out setting up a simple cluster.
Cluster in Large organization:
- K8s Admins typically distinct people from K8s Users. May need to divide K8s Cluster Admin access by roles.
- K8s Users need to be protected from each other.
- Auditing of K8s User and K8s Admin actions important.
- flexible accurate usage accounting and resource controls important.
- Lots of automated access to APIs.
- Need to integrate with existing enterprise directory, authentication, accounting, auditing, and security policy infrastructure.
Org-run cluster:
- organization that runs K8s master components is same as the org that runs apps on K8s.
- Nodes may be on-premises VMs or physical machines; Cloud VMs; or a mix.
Hosted cluster:
- Offering K8s API as a service, or offering a Paas or Saas built on K8s.
- May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure.
- May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.)
- Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded).
K8s ecosystem services:
- There may be companies that want to offer their existing services (Build, CI, A/B-test, release automation, etc) for use with K8s. There should be some story for this case.
Pods configs should be largely portable between Org-run and hosted configurations.
# Design
Related discussion:
- http://issue.k8s.io/442
- http://issue.k8s.io/443
This doc describes two security profiles:
- Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users.
- Enterprise profile: Provide mechanisms needed for large numbers of users. Defense in depth. Should integrate with existing enterprise security infrastructure.
K8s distribution should include templates of config, and documentation, for simple and enterprise profiles. System should be flexible enough for knowledgeable users to create intermediate profiles, but K8s developers should only reason about those two Profiles, not a matrix.
Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00.
## Identity
### userAccount
K8s will have a `userAccount` API object.
- `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs.
- `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field.
- `userAccount` is not related to the unix username of processes in Pods created by that userAccount.
- `userAccount` API objects can have labels.
The system may associate one or more Authentication Methods with a
`userAccount` (but they are not formally part of the userAccount object.)
In a simple deployment, the authentication method for a
user might be an authentication token which is verified by a K8s server. In a
more complex deployment, the authentication might be delegated to
another system which is trusted by the K8s API to authenticate users, but where
the authentication details are unknown to K8s.
Initial Features:
- there is no superuser `userAccount`
- `userAccount` objects are statically populated in the K8s API store by reading a config file. Only a K8s Cluster Admin can do this.
- `userAccount` can have a default `namespace`. If API call does not specify a `namespace`, the default `namespace` for that caller is assumed.
- `userAccount` is global. A single human with access to multiple namespaces is recommended to only have one userAccount.
Improvements:
- Make `userAccount` part of a separate API group from core K8s objects like `pod`. Facilitates plugging in alternate Access Management.
Simple Profile:
- single `userAccount`, used by all K8s Users and Project Admins. One access token shared by all.
Enterprise Profile:
- every human user has own `userAccount`.
- `userAccount`s have labels that indicate both membership in groups, and ability to act in certain roles.
- each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`)
- automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file.
### Unix accounts
A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity.
Initially:
- The unix accounts available in a container, and used by the processes running in a container are those that are provided by the combination of the base operating system and the Docker manifest.
- Kubernetes doesn't enforce any relation between `userAccount` and unix accounts.
Improvements:
- Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
- requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids.
- any features that help users avoid use of privileged containers (http://issue.k8s.io/391)
### Namespaces
K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies.
Namespaces are described in [namespaces.md](namespaces).
In the Enterprise Profile:
- a `userAccount` may have permission to access several `namespace`s.
In the Simple Profile:
- There is a single `namespace` used by the single user.
Namespaces versus userAccount vs Labels:
- `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s.
- `labels` (see [docs/user-guide/labels.md](/{{page.version}}/docs/user-guide/labels)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities.
- `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people.
## Authentication
Goals for K8s authentication:
- Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required.
- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The Kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users.
- For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication.
- So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver.
- For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal.
- Avoid mixing authentication and authorization, so that authorization policies be centrally managed, and to allow changes in authentication methods without affecting authorization code.
Initially:
- Tokens used to authenticate a user.
- Long lived tokens identify a particular `userAccount`.
- Administrator utility generates tokens at cluster setup.
- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
- No scopes for tokens. Authorization happens in the API server
- Tokens dynamically generated by apiserver to identify pods which are making API calls.
- Tokens checked in a module of the APIserver.
- Authentication in apiserver can be disabled by flag, to allow testing without authorization enabled, and to allow use of an authenticating proxy. In this mode, a query parameter or header added by the proxy will identify the caller.
Improvements:
- Refresh of tokens.
- SSH keys to access inside containers.
To be considered for subsequent versions:
- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749)
- Scoped tokens.
- Tokens that are bound to the channel between the client and the api server
- http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
- http://www.browserauth.net
## Authorization
K8s authorization should:
- Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems.
- Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service).
- Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults.
- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Replication Controllers, Services, and the identities and policies for those Pods and Replication Controllers.
- Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies.
K8s will implement a relatively simple
[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
The model will be described in more detail in a forthcoming document. The model will
- Be less complex than XACML
- Be easily recognizable to those familiar with Amazon IAM Policies.
- Have a subset/aliases/defaults which allow it to be used in a way comfortable to those users more familiar with Role-Based Access Control.
Authorization policy is set by creating a set of Policy objects.
The API Server will be the Enforcement Point for Policy. For each API call that it receives, it will construct the Attributes needed to evaluate the policy (what user is making the call, what resource they are accessing, what they are trying to do that resource, etc) and pass those attributes to a Decision Point. The Decision Point code evaluates the Attributes against all the Policies and allows or denies the API call. The system will be modular enough that the Decision Point code can either be linked into the APIserver binary, or be another service that the apiserver calls for each Decision (with appropriate time-limited caching as needed for performance).
Policy objects may be applicable only to a single namespace or to all namespaces; K8s Project Admins would be able to create those as needed. Other Policy objects may be applicable to all namespaces; a K8s Cluster Admin might create those in order to authorize a new type of controller to be used by all namespaces, or to make a K8s User into a K8s Project Admin.)
## Accounting
The API should have a `quota` concept (see http://issue.k8s.io/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources design doc](resources)).
Initially:
- a `quota` object is immutable.
- for hosted K8s systems that do billing, Project is recommended level for billing accounts.
- Every object that consumes resources should have a `namespace` so that Resource usage stats are roll-up-able to `namespace`.
- K8s Cluster Admin sets quota objects by writing a config file.
Improvements:
- allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object.
- allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores.
- tools to help write consistent quota config files based on number of nodes, historical namespace usages, QoS needs, etc.
- way for K8s Cluster Admin to incrementally adjust Quota objects.
Simple profile:
- a single `namespace` with infinite resource limits.
Enterprise profile:
- multiple namespaces each with their own limits.
Issues:
- need for locking or "eventual consistency" when multiple apiserver goroutines are accessing the object store and handling pod creations.
## Audit Logging
API actions can be logged.
Initial implementation:
- All API calls logged to nginx logs.
Improvements:
- API server does logging instead.
- Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions.

View File

@ -1,88 +0,0 @@
---
title: "Kubernetes Proposal - Admission Control"
---
**Related PR:**
| Topic | Link |
| ----- | ---- |
| Separate validation from RESTStorage | http://issue.k8s.io/2977 |
## Background
High level goals:
* Enable an easy-to-use mechanism to provide admission control to cluster
* Enable a provider to support multiple admission control strategies or author their own
* Ensure any rejected request can propagate errors back to the caller with why the request failed
Authorization via policy is focused on answering if a user is authorized to perform an action.
Admission Control is focused on if the system will accept an authorized action.
Kubernetes may choose to dismiss an authorized action based on any number of admission control strategies.
This proposal documents the basic design, and describes how any number of admission control plug-ins could be injected.
Implementation of specific admission control strategies are handled in separate documents.
## kube-apiserver
The kube-apiserver takes the following OPTIONAL arguments to enable admission control
| Option | Behavior |
| ------ | -------- |
| admission-control | Comma-delimited, ordered list of admission control choices to invoke prior to modifying or deleting an object. |
| admission-control-config-file | File with admission control configuration parameters to boot-strap plug-in. |
An **AdmissionControl** plug-in is an implementation of the following interface:
{% highlight go %}
package admission
// Attributes is an interface used by a plug-in to make an admission decision on a individual request.
type Attributes interface {
GetNamespace() string
GetKind() string
GetOperation() string
GetObject() runtime.Object
}
// Interface is an abstract, pluggable interface for Admission Control decisions.
type Interface interface {
// Admit makes an admission decision based on the request attributes
// An error is returned if it denies the request.
Admit(a Attributes) (err error)
}
{% endhighlight %}
A **plug-in** must be compiled with the binary, and is registered as an available option by providing a name, and implementation
of admission.Interface.
{% highlight go %}
func init() {
admission.RegisterPlugin("AlwaysDeny", func(client client.Interface, config io.Reader) (admission.Interface, error) { return NewAlwaysDeny(), nil })
}
{% endhighlight %}
Invocation of admission control is handled by the **APIServer** and not individual **RESTStorage** implementations.
This design assumes that **Issue 297** is adopted, and as a consequence, the general framework of the APIServer request/response flow will ensure the following:
1. Incoming request
2. Authenticate user
3. Authorize user
4. If operation=create|update|delete|connect, then admission.Admit(requestAttributes)
- invoke each admission.Interface object in sequence
5. Case on the operation:
- If operation=create|update, then validate(object) and persist
- If operation=delete, delete the object
- If operation=connect, exec
If at any step, there is an error, the request is canceled.

View File

@ -1,201 +0,0 @@
---
title: "Admission control plugin: LimitRanger"
---
## Background
This document proposes a system for enforcing resource requirements constraints as part of admission control.
## Use cases
1. Ability to enumerate resource requirement constraints per namespace
2. Ability to enumerate min/max resource constraints for a pod
3. Ability to enumerate min/max resource constraints for a container
4. Ability to specify default resource limits for a container
5. Ability to specify default resource requests for a container
6. Ability to enforce a ratio between request and limit for a resource.
## Data Model
The **LimitRange** resource is scoped to a **Namespace**.
### Type
{% highlight go %}
// LimitType is a type of object that is limited
type LimitType string
const (
// Limit that applies to all pods in a namespace
LimitTypePod LimitType = "Pod"
// Limit that applies to all containers in a namespace
LimitTypeContainer LimitType = "Container"
)
// LimitRangeItem defines a min/max usage limit for any resource that matches on kind.
type LimitRangeItem struct {
// Type of resource that this limit applies to.
Type LimitType `json:"type,omitempty"`
// Max usage constraints on this kind by resource name.
Max ResourceList `json:"max,omitempty"`
// Min usage constraints on this kind by resource name.
Min ResourceList `json:"min,omitempty"`
// Default resource requirement limit value by resource name if resource limit is omitted.
Default ResourceList `json:"default,omitempty"`
// DefaultRequest is the default resource requirement request value by resource name if resource request is omitted.
DefaultRequest ResourceList `json:"defaultRequest,omitempty"`
// MaxLimitRequestRatio if specified, the named resource must have a request and limit that are both non-zero where limit divided by request is less than or equal to the enumerated value; this represents the max burst for the named resource.
MaxLimitRequestRatio ResourceList `json:"maxLimitRequestRatio,omitempty"`
}
// LimitRangeSpec defines a min/max usage limit for resources that match on kind.
type LimitRangeSpec struct {
// Limits is the list of LimitRangeItem objects that are enforced.
Limits []LimitRangeItem `json:"limits"`
}
// LimitRange sets resource usage limits for each kind of resource in a Namespace.
type LimitRange struct {
TypeMeta `json:",inline"`
// Standard object's metadata.
// More info: http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#metadata
ObjectMeta `json:"metadata,omitempty"`
// Spec defines the limits enforced.
// More info: http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#spec-and-status
Spec LimitRangeSpec `json:"spec,omitempty"`
}
// LimitRangeList is a list of LimitRange items.
type LimitRangeList struct {
TypeMeta `json:",inline"`
// Standard list metadata.
// More info: http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#types-kinds
ListMeta `json:"metadata,omitempty"`
// Items is a list of LimitRange objects.
// More info: http://releases.k8s.io/release-1.1/docs/design/admission_control_limit_range.md
Items []LimitRange `json:"items"`
}
{% endhighlight %}
### Validation
Validation of a **LimitRange** enforces that for a given named resource the following rules apply:
Min (if specified) <= DefaultRequest (if specified) <= Default (if specified) <= Max (if specified)
### Default Value Behavior
The following default value behaviors are applied to a LimitRange for a given named resource.
```
if LimitRangeItem.Default[resourceName] is undefined
if LimitRangeItem.Max[resourceName] is defined
LimitRangeItem.Default[resourceName] = LimitRangeItem.Max[resourceName]
```
```
if LimitRangeItem.DefaultRequest[resourceName] is undefined
if LimitRangeItem.Default[resourceName] is defined
LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Default[resourceName]
else if LimitRangeItem.Min[resourceName] is defined
LimitRangeItem.DefaultRequest[resourceName] = LimitRangeItem.Min[resourceName]
```
## AdmissionControl plugin: LimitRanger
The **LimitRanger** plug-in introspects all incoming pod requests and evaluates the constraints defined on a LimitRange.
If a constraint is not specified for an enumerated resource, it is not enforced or tracked.
To enable the plug-in and support for LimitRange, the kube-apiserver must be configured as follows:
{% highlight console %}
$ kube-apiserver --admission-control=LimitRanger
{% endhighlight %}
### Enforcement of constraints
**Type: Container**
Supported Resources:
1. memory
2. cpu
Supported Constraints:
Per container, the following must hold true
| Constraint | Behavior |
| ---------- | -------- |
| Min | Min <= Request (required) <= Limit (optional) |
| Max | Limit (required) <= Max |
| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (required, non-zero)) |
Supported Defaults:
1. Default - if the named resource has no enumerated value, the Limit is equal to the Default
2. DefaultRequest - if the named resource has no enumerated value, the Request is equal to the DefaultRequest
**Type: Pod**
Supported Resources:
1. memory
2. cpu
Supported Constraints:
Across all containers in pod, the following must hold true
| Constraint | Behavior |
| ---------- | -------- |
| Min | Min <= Request (required) <= Limit (optional) |
| Max | Limit (required) <= Max |
| LimitRequestRatio | LimitRequestRatio <= ( Limit (required, non-zero) / Request (non-zero) ) |
## Run-time configuration
The default ```LimitRange``` that is applied via Salt configuration will be updated as follows:
```
apiVersion: "v1"
kind: "LimitRange"
metadata:
name: "limits"
namespace: default
spec:
limits:
- type: "Container"
defaultRequests:
cpu: "100m"
```
## Example
An example LimitRange configuration:
| Type | Resource | Min | Max | Default | DefaultRequest | LimitRequestRatio |
| ---- | -------- | --- | --- | ------- | -------------- | ----------------- |
| Container | cpu | .1 | 1 | 500m | 250m | 4 |
| Container | memory | 250Mi | 1Gi | 500Mi | 250Mi | |
Assuming an incoming container that specified no incoming resource requirements,
the following would happen.
1. The incoming container cpu would request 250m with a limit of 500m.
2. The incoming container memory would request 250Mi with a limit of 500Mi
3. If the container is later resized, it's cpu would be constrained to between .1 and 1 and the ratio of limit to request could not exceed 4.

View File

@ -1,201 +0,0 @@
---
title: "Admission control plugin: ResourceQuota"
---
## Background
This document describes a system for enforcing hard resource usage limits per namespace as part of admission control.
## Use cases
1. Ability to enumerate resource usage limits per namespace.
2. Ability to monitor resource usage for tracked resources.
3. Ability to reject resource usage exceeding hard quotas.
## Data Model
The **ResourceQuota** object is scoped to a **Namespace**.
{% highlight go %}
// The following identify resource constants for Kubernetes object types
const (
// Pods, number
ResourcePods ResourceName = "pods"
// Services, number
ResourceServices ResourceName = "services"
// ReplicationControllers, number
ResourceReplicationControllers ResourceName = "replicationcontrollers"
// ResourceQuotas, number
ResourceQuotas ResourceName = "resourcequotas"
// ResourceSecrets, number
ResourceSecrets ResourceName = "secrets"
// ResourcePersistentVolumeClaims, number
ResourcePersistentVolumeClaims ResourceName = "persistentvolumeclaims"
)
// ResourceQuotaSpec defines the desired hard limits to enforce for Quota
type ResourceQuotaSpec struct {
// Hard is the set of desired hard limits for each named resource
Hard ResourceList `json:"hard,omitempty" description:"hard is the set of desired hard limits for each named resource; see http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
}
// ResourceQuotaStatus defines the enforced hard limits and observed use
type ResourceQuotaStatus struct {
// Hard is the set of enforced hard limits for each named resource
Hard ResourceList `json:"hard,omitempty" description:"hard is the set of enforced hard limits for each named resource; see http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
// Used is the current observed total usage of the resource in the namespace
Used ResourceList `json:"used,omitempty" description:"used is the current observed total usage of the resource in the namespace"`
}
// ResourceQuota sets aggregate quota restrictions enforced per namespace
type ResourceQuota struct {
TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty" description:"standard object metadata; see http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#metadata"`
// Spec defines the desired quota
Spec ResourceQuotaSpec `json:"spec,omitempty" description:"spec defines the desired quota; http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#spec-and-status"`
// Status defines the actual enforced quota and its current usage
Status ResourceQuotaStatus `json:"status,omitempty" description:"status defines the actual enforced quota and current usage; http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#spec-and-status"`
}
// ResourceQuotaList is a list of ResourceQuota items
type ResourceQuotaList struct {
TypeMeta `json:",inline"`
ListMeta `json:"metadata,omitempty" description:"standard list metadata; see http://releases.k8s.io/release-1.1/docs/devel/api-conventions.md#metadata"`
// Items is a list of ResourceQuota objects
Items []ResourceQuota `json:"items" description:"items is a list of ResourceQuota objects; see http://releases.k8s.io/release-1.1/docs/design/admission_control_resource_quota.md#admissioncontrol-plugin-resourcequota"`
}
{% endhighlight %}
## Quota Tracked Resources
The following resources are supported by the quota system.
| Resource | Description |
| ------------ | ----------- |
| cpu | Total requested cpu usage |
| memory | Total requested memory usage |
| pods | Total number of active pods where phase is pending or active. |
| services | Total number of services |
| replicationcontrollers | Total number of replication controllers |
| resourcequotas | Total number of resource quotas |
| secrets | Total number of secrets |
| persistentvolumeclaims | Total number of persistent volume claims |
If a third-party wants to track additional resources, it must follow the resource naming conventions prescribed
by Kubernetes. This means the resource must have a fully-qualified name (i.e. mycompany.org/shinynewresource)
## Resource Requirements: Requests vs Limits
If a resource supports the ability to distinguish between a request and a limit for a resource,
the quota tracking system will only cost the request value against the quota usage. If a resource
is tracked by quota, and no request value is provided, the associated entity is rejected as part of admission.
For an example, consider the following scenarios relative to tracking quota on CPU:
| Pod | Container | Request CPU | Limit CPU | Result |
| --- | --------- | ----------- | --------- | ------ |
| X | C1 | 100m | 500m | The quota usage is incremented 100m |
| Y | C2 | 100m | none | The quota usage is incremented 100m |
| Y | C2 | none | 500m | The quota usage is incremented 500m since request will default to limit |
| Z | C3 | none | none | The pod is rejected since it does not enumerate a request. |
The rationale for accounting for the requested amount of a resource versus the limit is the belief
that a user should only be charged for what they are scheduled against in the cluster. In addition,
attempting to track usage against actual usage, where request < actual < limit, is considered highly
volatile.
As a consequence of this decision, the user is able to spread its usage of a resource across multiple tiers
of service. Let's demonstrate this via an example with a 4 cpu quota.
The quota may be allocated as follows:
| Pod | Container | Request CPU | Limit CPU | Tier | Quota Usage |
| --- | --------- | ----------- | --------- | ---- | ----------- |
| X | C1 | 1 | 4 | Burstable | 1 |
| Y | C2 | 2 | 2 | Guaranteed | 2 |
| Z | C3 | 1 | 3 | Burstable | 1 |
It is possible that the pods may consume 9 cpu over a given time period depending on the nodes available cpu
that held pod X and Z, but since we scheduled X and Z relative to the request, we only track the requesting
value against their allocated quota. If one wants to restrict the ratio between the request and limit,
it is encouraged that the user define a **LimitRange** with **LimitRequestRatio** to control burst out behavior.
This would in effect, let an administrator keep the difference between request and limit more in line with
tracked usage if desired.
## Status API
A REST API endpoint to update the status section of the **ResourceQuota** is exposed. It requires an atomic compare-and-swap
in order to keep resource usage tracking consistent.
## Resource Quota Controller
A resource quota controller monitors observed usage for tracked resources in the **Namespace**.
If there is observed difference between the current usage stats versus the current **ResourceQuota.Status**, the controller
posts an update of the currently observed usage metrics to the **ResourceQuota** via the /status endpoint.
The resource quota controller is the only component capable of monitoring and recording usage updates after a DELETE operation
since admission control is incapable of guaranteeing a DELETE request actually succeeded.
## AdmissionControl plugin: ResourceQuota
The **ResourceQuota** plug-in introspects all incoming admission requests.
To enable the plug-in and support for ResourceQuota, the kube-apiserver must be configured as follows:
```
$ kube-apiserver --admission-control=ResourceQuota
```
It makes decisions by evaluating the incoming object against all defined **ResourceQuota.Status.Hard** resource limits in the request
namespace. If acceptance of the resource would cause the total usage of a named resource to exceed its hard limit, the request is denied.
If the incoming request does not cause the total usage to exceed any of the enumerated hard resource limits, the plug-in will post a
**ResourceQuota.Status** document to the server to atomically update the observed usage based on the previously read
**ResourceQuota.ResourceVersion**. This keeps incremental usage atomically consistent, but does introduce a bottleneck (intentionally)
into the system.
To optimize system performance, it is encouraged that all resource quotas are tracked on the same **ResourceQuota** document in a **Namespace**. As a result, its encouraged to impose a cap on the total number of individual quotas that are tracked in the **Namespace**
to 1 in the **ResourceQuota** document.
## kubectl
kubectl is modified to support the **ResourceQuota** resource.
`kubectl describe` provides a human-readable output of quota.
For example,
{% highlight console %}
$ kubectl create -f docs/admin/resourcequota/namespace.yaml
namespace "quota-example" created
$ kubectl create -f docs/admin/resourcequota/quota.yaml --namespace=quota-example
resourcequota "quota" created
$ kubectl describe quota quota --namespace=quota-example
Name: quota
Namespace: quota-example
Resource Used Hard
-------- ---- ----
cpu 0 20
memory 0 1Gi
persistentvolumeclaims 0 10
pods 0 10
replicationcontrollers 0 20
resourcequotas 1 1
secrets 1 10
services 0 5
{% endhighlight %}
## More information
See [resource quota document](../admin/resource-quota) and the [example of Resource Quota](../admin/resourcequota/) for more information.

Binary file not shown.

View File

@ -1,49 +0,0 @@
---
title: "Kubernetes architecture"
---
A running Kubernetes cluster contains node agents (`kubelet`) and master components (APIs, scheduler, etc), on top of a distributed storage solution. This diagram shows our desired eventual state, though we're still working on a few things, like making `kubelet` itself (all our components, really) run within containers, and making the scheduler 100% pluggable.
![Architecture Diagram](architecture.png?raw=true "Architecture overview")
## The Kubernetes Node
When looking at the architecture of the system, we'll break it down to services that run on the worker node and services that compose the cluster-level control plane.
The Kubernetes node has the services necessary to run application containers and be managed from the master systems.
Each node runs Docker, of course. Docker takes care of the details of downloading images and running containers.
### `kubelet`
The `kubelet` manages [pods](../user-guide/pods) and their containers, their images, their volumes, etc.
### `kube-proxy`
Each node also runs a simple network proxy and load balancer (see the [services FAQ](https://github.com/kubernetes/kubernetes/wiki/Services-FAQ) for more details). This reflects `services` (see [the services doc](../user-guide/services) for more details) as defined in the Kubernetes API on each node and can do simple TCP and UDP stream forwarding (round robin) across a set of backends.
Service endpoints are currently found via [DNS](../admin/dns) or through environment variables (both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) and Kubernetes `{FOO}_SERVICE_HOST` and `{FOO}_SERVICE_PORT` variables are supported). These variables resolve to ports managed by the service proxy.
## The Kubernetes Control Plane
The Kubernetes control plane is split into a set of components. Currently they all run on a single _master_ node, but that is expected to change soon in order to support high-availability clusters. These components work together to provide a unified view of the cluster.
### `etcd`
All persistent master state is stored in an instance of `etcd`. This provides a great way to store configuration data reliably. With `watch` support, coordinating components can be notified very quickly of changes.
### Kubernetes API Server
The apiserver serves up the [Kubernetes API](../api). It is intended to be a CRUD-y server, with most/all business logic implemented in separate components or in plug-ins. It mainly processes REST operations, validates them, and updates the corresponding objects in `etcd` (and eventually other stores).
### Scheduler
The scheduler binds unscheduled pods to nodes via the `/binding` API. The scheduler is pluggable, and we expect to support multiple cluster schedulers and even user-provided schedulers in the future.
### Kubernetes Controller Manager Server
All other cluster-level functions are currently performed by the Controller Manager. For instance, `Endpoints` objects are created and updated by the endpoints controller, and nodes are discovered, managed, and monitored by the node controller. These could eventually be split into separate components to make them independently pluggable.
The [`replicationcontroller`](../user-guide/replication-controller) is a mechanism that is layered on top of the simple [`pod`](../user-guide/pods) API. We eventually plan to port it to a generic plug-in mechanism, once one is implemented.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 262 KiB

File diff suppressed because it is too large Load Diff

Before

Width:  |  Height:  |  Size: 50 KiB

View File

@ -1,64 +0,0 @@
---
title: "Clustering in Kubernetes"
---
## Overview
The term "clustering" refers to the process of having all members of the Kubernetes cluster find and trust each other. There are multiple different ways to achieve clustering with different security and usability profiles. This document attempts to lay out the user experiences for clustering that Kubernetes aims to address.
Once a cluster is established, the following is true:
1. **Master -> Node** The master needs to know which nodes can take work and what their current status is wrt capacity.
1. **Location** The master knows the name and location of all of the nodes in the cluster.
* For the purposes of this doc, location and name should be enough information so that the master can open a TCP connection to the Node. Most probably we will make this either an IP address or a DNS name. It is going to be important to be consistent here (master must be able to reach kubelet on that DNS name) so that we can verify certificates appropriately.
2. **Target AuthN** A way to securely talk to the kubelet on that node. Currently we call out to the kubelet over HTTP. This should be over HTTPS and the master should know what CA to trust for that node.
3. **Caller AuthN/Z** This would be the master verifying itself (and permissions) when calling the node. Currently, this is only used to collect statistics as authorization isn't critical. This may change in the future though.
2. **Node -> Master** The nodes currently talk to the master to know which pods have been assigned to them and to publish events.
1. **Location** The nodes must know where the master is at.
2. **Target AuthN** Since the master is assigning work to the nodes, it is critical that they verify whom they are talking to.
3. **Caller AuthN/Z** The nodes publish events and so must be authenticated to the master. Ideally this authentication is specific to each node so that authorization can be narrowly scoped. The details of the work to run (including things like environment variables) might be considered sensitive and should be locked down also.
**Note:** While the description here refers to a singular Master, in the future we should enable multiple Masters operating in an HA mode. While the "Master" is currently the combination of the API Server, Scheduler and Controller Manager, we will restrict ourselves to thinking about the main API and policy engine -- the API Server.
## Current Implementation
A central authority (generally the master) is responsible for determining the set of machines which are members of the cluster. Calls to create and remove worker nodes in the cluster are restricted to this single authority, and any other requests to add or remove worker nodes are rejected. (1.i).
Communication from the master to nodes is currently over HTTP and is not secured or authenticated in any way. (1.ii, 1.iii).
The location of the master is communicated out of band to the nodes. For GCE, this is done via Salt. Other cluster instructions/scripts use other methods. (2.i)
Currently most communication from the node to the master is over HTTP. When it is done over HTTPS there is currently no verification of the cert of the master (2.ii).
Currently, the node/kubelet is authenticated to the master via a token shared across all nodes. This token is distributed out of band (using Salt for GCE) and is optional. If it is not present then the kubelet is unable to publish events to the master. (2.iii)
Our current mix of out of band communication doesn't meet all of our needs from a security point of view and is difficult to set up and configure.
## Proposed Solution
The proposed solution will provide a range of options for setting up and maintaining a secure Kubernetes cluster. We want to both allow for centrally controlled systems (leveraging pre-existing trust and configuration systems) or more ad-hoc automagic systems that are incredibly easy to set up.
The building blocks of an easier solution:
* **Move to TLS** We will move to using TLS for all intra-cluster communication. We will explicitly identify the trust chain (the set of trusted CAs) as opposed to trusting the system CAs. We will also use client certificates for all AuthN.
* [optional] **API driven CA** Optionally, we will run a CA in the master that will mint certificates for the nodes/kubelets. There will be pluggable policies that will automatically approve certificate requests here as appropriate.
* **CA approval policy** This is a pluggable policy object that can automatically approve CA signing requests. Stock policies will include `always-reject`, `queue` and `insecure-always-approve`. With `queue` there would be an API for evaluating and accepting/rejecting requests. Cloud providers could implement a policy here that verifies other out of band information and automatically approves/rejects based on other external factors.
* **Scoped Kubelet Accounts** These accounts are per-node and (optionally) give a node permission to register itself.
* To start with, we'd have the kubelets generate a cert/account in the form of `kubelet:<host>`. To start we would then hard code policy such that we give that particular account appropriate permissions. Over time, we can make the policy engine more generic.
* [optional] **Bootstrap API endpoint** This is a helper service hosted outside of the Kubernetes cluster that helps with initial discovery of the master.
### Static Clustering
In this sequence diagram there is out of band admin entity that is creating all certificates and distributing them. It is also making sure that the kubelets know where to find the master. This provides for a lot of control but is more difficult to set up as lots of information must be communicated outside of Kubernetes.
![Static Sequence Diagram](clustering/static.png)
### Dynamic Clustering
This diagram dynamic clustering using the bootstrap API endpoint. That API endpoint is used to both find the location of the master and communicate the root CA for the master.
This flow has the admin manually approving the kubelet signing requests. This is the `queue` policy defined above.This manual intervention could be replaced by code that can verify the signing requests via other means.
![Dynamic Sequence Diagram](clustering/dynamic.png)

View File

@ -1 +0,0 @@
DroidSansMono.ttf

View File

@ -1,12 +0,0 @@
FROM debian:jessie
RUN apt-get update
RUN apt-get -qy install python-seqdiag make curl
WORKDIR /diagrams
RUN curl -sLo DroidSansMono.ttf https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/DroidSansMono.ttf
ADD . /diagrams
CMD bash -c 'make >/dev/stderr && tar cf - *.png'

View File

@ -1,29 +0,0 @@
FONT := DroidSansMono.ttf
PNGS := $(patsubst %.seqdiag,%.png,$(wildcard *.seqdiag))
.PHONY: all
all: $(PNGS)
.PHONY: watch
watch:
fswatch *.seqdiag | xargs -n 1 sh -c "make || true"
$(FONT):
curl -sLo $@ https://googlefontdirectory.googlecode.com/hg/apache/droidsansmono/$(FONT)
%.png: %.seqdiag $(FONT)
seqdiag --no-transparency -a -f '$(FONT)' $<
# Build the stuff via a docker image
.PHONY: docker
docker:
docker build -t clustering-seqdiag .
docker run --rm clustering-seqdiag | tar xvf -
docker-clean:
docker rmi clustering-seqdiag || true
docker images -q --filter "dangling=true" | xargs docker rmi
fix-clock-skew:
boot2docker ssh sudo date -u -D "%Y%m%d%H%M.%S" --set "$(shell date -u +%Y%m%d%H%M.%S)"

View File

@ -1,34 +0,0 @@
---
title: "Building with Docker"
---
This directory contains diagrams for the clustering design doc.
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index). Assuming you have a non-borked python install, this should be installable with
{% highlight sh %}
pip install seqdiag
{% endhighlight %}
Just call `make` to regenerate the diagrams.
## Building with Docker
If you are on a Mac or your pip install is messed up, you can easily build with docker.
{% highlight sh %}
make docker
{% endhighlight %}
The first run will be slow but things should be fast after that.
To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`.
If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`.
## Automatically rebuild on file changes
If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 71 KiB

View File

@ -1,24 +0,0 @@
seqdiag {
activation = none;
user[label = "Admin User"];
bootstrap[label = "Bootstrap API\nEndpoint"];
master;
kubelet[stacked];
user -> bootstrap [label="createCluster", return="cluster ID"];
user <-- bootstrap [label="returns\n- bootstrap-cluster-uri"];
user ->> master [label="start\n- bootstrap-cluster-uri"];
master => bootstrap [label="setMaster\n- master-location\n- master-ca"];
user ->> kubelet [label="start\n- bootstrap-cluster-uri"];
kubelet => bootstrap [label="get-master", return="returns\n- master-location\n- master-ca"];
kubelet ->> master [label="signCert\n- unsigned-kubelet-cert", return="retuns\n- kubelet-cert"];
user => master [label="getSignRequests"];
user => master [label="approveSignRequests"];
kubelet <<-- master [label="returns\n- kubelet-cert"];
kubelet => master [label="register\n- kubelet-location"]
}

View File

@ -1,34 +0,0 @@
---
title: "Building with Docker"
---
This directory contains diagrams for the clustering design doc.
This depends on the `seqdiag` [utility](http://blockdiag.com/en/seqdiag/index). Assuming you have a non-borked python install, this should be installable with
{% highlight sh %}
pip install seqdiag
{% endhighlight %}
Just call `make` to regenerate the diagrams.
## Building with Docker
If you are on a Mac or your pip install is messed up, you can easily build with docker.
{% highlight sh %}
make docker
{% endhighlight %}
The first run will be slow but things should be fast after that.
To clean up the docker containers that are created (and other cruft that is left around) you can run `make docker-clean`.
If you are using boot2docker and get warnings about clock skew (or if things aren't building for some reason) then you can fix that up with `make fix-clock-skew`.
## Automatically rebuild on file changes
If you have the fswatch utility installed, you can have it monitor the file system and automatically rebuild when files have changed. Just do a `make watch`.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

View File

@ -1,16 +0,0 @@
seqdiag {
activation = none;
admin[label = "Manual Admin"];
ca[label = "Manual CA"]
master;
kubelet[stacked];
admin => ca [label="create\n- master-cert"];
admin ->> master [label="start\n- ca-root\n- master-cert"];
admin => ca [label="create\n- kubelet-cert"];
admin ->> kubelet [label="start\n- ca-root\n- kubelet-cert\n- master-location"];
kubelet => master [label="register\n- kubelet-location"];
}

View File

@ -1,150 +0,0 @@
---
title: "Container Command Execution & Port Forwarding in Kubernetes"
---
## Abstract
This describes an approach for providing support for:
- executing commands in containers, with stdin/stdout/stderr streams attached
- port forwarding to containers
## Background
There are several related issues/PRs:
- [Support attach](http://issue.k8s.io/1521)
- [Real container ssh](http://issue.k8s.io/1513)
- [Provide easy debug network access to services](http://issue.k8s.io/1863)
- [OpenShift container command execution proposal](https://github.com/openshift/origin/pull/576)
## Motivation
Users and administrators are accustomed to being able to access their systems
via SSH to run remote commands, get shell access, and do port forwarding.
Supporting SSH to containers in Kubernetes is a difficult task. You must
specify a "user" and a hostname to make an SSH connection, and `sshd` requires
real users (resolvable by NSS and PAM). Because a container belongs to a pod,
and the pod belongs to a namespace, you need to specify namespace/pod/container
to uniquely identify the target container. Unfortunately, a
namespace/pod/container is not a real user as far as SSH is concerned. Also,
most Linux systems limit user names to 32 characters, which is unlikely to be
large enough to contain namespace/pod/container. We could devise some scheme to
map each namespace/pod/container to a 32-character user name, adding entries to
`/etc/passwd` (or LDAP, etc.) and keeping those entries fully in sync all the
time. Alternatively, we could write custom NSS and PAM modules that allow the
host to resolve a namespace/pod/container to a user without needing to keep
files or LDAP in sync.
As an alternative to SSH, we are using a multiplexed streaming protocol that
runs on top of HTTP. There are no requirements about users being real users,
nor is there any limitation on user name length, as the protocol is under our
control. The only downside is that standard tooling that expects to use SSH
won't be able to work with this mechanism, unless adapters can be written.
## Constraints and Assumptions
- SSH support is not currently in scope
- CGroup confinement is ultimately desired, but implementing that support is not currently in scope
- SELinux confinement is ultimately desired, but implementing that support is not currently in scope
## Use Cases
- As a user of a Kubernetes cluster, I want to run arbitrary commands in a container, attaching my local stdin/stdout/stderr to the container
- As a user of a Kubernetes cluster, I want to be able to connect to local ports on my computer and have them forwarded to ports in the container
## Process Flow
### Remote Command Execution Flow
1. The client connects to the Kubernetes Master to initiate a remote command execution
request
2. The Master proxies the request to the Kubelet where the container lives
3. The Kubelet executes nsenter + the requested command and streams stdin/stdout/stderr back and forth between the client and the container
### Port Forwarding Flow
1. The client connects to the Kubernetes Master to initiate a remote command execution
request
2. The Master proxies the request to the Kubelet where the container lives
3. The client listens on each specified local port, awaiting local connections
4. The client connects to one of the local listening ports
4. The client notifies the Kubelet of the new connection
5. The Kubelet executes nsenter + socat and streams data back and forth between the client and the port in the container
## Design Considerations
### Streaming Protocol
The current multiplexed streaming protocol used is SPDY. This is not the
long-term desire, however. As soon as there is viable support for HTTP/2 in Go,
we will switch to that.
### Master as First Level Proxy
Clients should not be allowed to communicate directly with the Kubelet for
security reasons. Therefore, the Master is currently the only suggested entry
point to be used for remote command execution and port forwarding. This is not
necessarily desirable, as it means that all remote command execution and port
forwarding traffic must travel through the Master, potentially impacting other
API requests.
In the future, it might make more sense to retrieve an authorization token from
the Master, and then use that token to initiate a remote command execution or
port forwarding request with a load balanced proxy service dedicated to this
functionality. This would keep the streaming traffic out of the Master.
### Kubelet as Backend Proxy
The kubelet is currently responsible for handling remote command execution and
port forwarding requests. Just like with the Master described above, this means
that all remote command execution and port forwarding streaming traffic must
travel through the Kubelet, which could result in a degraded ability to service
other requests.
In the future, it might make more sense to use a separate service on the node.
Alternatively, we could possibly inject a process into the container that only
listens for a single request, expose that process's listening port on the node,
and then issue a redirect to the client such that it would connect to the first
level proxy, which would then proxy directly to the injected process's exposed
port. This would minimize the amount of proxying that takes place.
### Scalability
There are at least 2 different ways to execute a command in a container:
`docker exec` and `nsenter`. While `docker exec` might seem like an easier and
more obvious choice, it has some drawbacks.
#### `docker exec`
We could expose `docker exec` (i.e. have Docker listen on an exposed TCP port
on the node), but this would require proxying from the edge and securing the
Docker API. `docker exec` calls go through the Docker daemon, meaning that all
stdin/stdout/stderr traffic is proxied through the Daemon, adding an extra hop.
Additionally, you can't isolate 1 malicious `docker exec` call from normal
usage, meaning an attacker could initiate a denial of service or other attack
and take down the Docker daemon, or the node itself.
We expect remote command execution and port forwarding requests to be long
running and/or high bandwidth operations, and routing all the streaming data
through the Docker daemon feels like a bottleneck we can avoid.
#### `nsenter`
The implementation currently uses `nsenter` to run commands in containers,
joining the appropriate container namespaces. `nsenter` runs directly on the
node and is not proxied through any single daemon process.
### Security
Authentication and authorization hasn't specifically been tested yet with this
functionality. We need to make sure that users are not allowed to execute
remote commands or do port forwarding to containers they aren't allowed to
access.
Additional work is required to ensure that multiple command execution or port forwarding connections from different clients are not able to see each other's data. This can most likely be achieved via SELinux labeling and unique process contexts.

View File

@ -1,125 +0,0 @@
---
title: "DaemonSet in Kubernetes"
---
**Author**: Ananya Kumar (@AnanyaKumar)
**Status**: Implemented.
This document presents the design of the Kubernetes DaemonSet, describes use cases, and gives an overview of the code.
## Motivation
Many users have requested for a way to run a daemon on every node in a Kubernetes cluster, or on a certain set of nodes in a cluster. This is essential for use cases such as building a sharded datastore, or running a logger on every node. In comes the DaemonSet, a way to conveniently create and manage daemon-like workloads in Kubernetes.
## Use Cases
The DaemonSet can be used for user-specified system services, cluster-level applications with strong node ties, and Kubernetes node services. Below are example use cases in each category.
### User-Specified System Services:
Logging: Some users want a way to collect statistics about nodes in a cluster and send those logs to an external database. For example, system administrators might want to know if their machines are performing as expected, if they need to add more machines to the cluster, or if they should switch cloud providers. The DaemonSet can be used to run a data collection service (for example fluentd) on every node and send the data to a service like ElasticSearch for analysis.
### Cluster-Level Applications
Datastore: Users might want to implement a sharded datastore in their cluster. A few nodes in the cluster, labeled 'app=datastore', might be responsible for storing data shards, and pods running on these nodes might serve data. This architecture requires a way to bind pods to specific nodes, so it cannot be achieved using a Replication Controller. A DaemonSet is a convenient way to implement such a datastore.
For other uses, see the related [feature request](https://issues.k8s.io/1518)
## Functionality
The DaemonSet supports standard API features:
- create
- The spec for DaemonSets has a pod template field.
- Using the pod's nodeSelector field, DaemonSets can be restricted to operate over nodes that have a certain label. For example, suppose that in a cluster some nodes are labeled 'app=database'. You can use a DaemonSet to launch a datastore pod on exactly those nodes labeled 'app=database'.
- Using the pod's nodeName field, DaemonSets can be restricted to operate on a specified node.
- The PodTemplateSpec used by the DaemonSet is the same as the PodTemplateSpec used by the Replication Controller.
- The initial implementation will not guarnatee that DaemonSet pods are created on nodes before other pods.
- The initial implementation of DaemonSet does not guarantee that DaemonSet pods show up on nodes (for example because of resource limitations of the node), but makes a best effort to launch DaemonSet pods (like Replication Controllers do with pods). Subsequent revisions might ensure that DaemonSet pods show up on nodes, preempting other pods if necessary.
- The DaemonSet controller adds an annotation "kubernetes.io/created-by: \<json API object reference\>"
- YAML example:
{% highlight yaml %}
apiVersion: v1
kind: DaemonSet
metadata:
labels:
app: datastore
name: datastore
spec:
template:
metadata:
labels:
app: datastore-shard
spec:
nodeSelector:
app: datastore-node
containers:
name: datastore-shard
image: kubernetes/sharded
ports:
- containerPort: 9042
name: main
{% endhighlight %}
- commands that get info
- get (e.g. kubectl get daemonsets)
- describe
- Modifiers
- delete (if --cascade=true, then first the client turns down all the pods controlled by the DaemonSet (by setting the nodeSelector to a uuid pair that is unlikely to be set on any node); then it deletes the DaemonSet; then it deletes the pods)
- label
- annotate
- update operations like patch and replace (only allowed to selector and to nodeSelector and nodeName of pod template)
- DaemonSets have labels, so you could, for example, list all DaemonSets with certain labels (the same way you would for a Replication Controller).
- In general, for all the supported features like get, describe, update, etc, the DaemonSet works in a similar way to the Replication Controller. However, note that the DaemonSet and the Replication Controller are different constructs.
### Persisting Pods
- Ordinary liveness probes specified in the pod template work to keep pods created by a DaemonSet running.
- If a daemon pod is killed or stopped, the DaemonSet will create a new replica of the daemon pod on the node.
### Cluster Mutations
- When a new node is added to the cluster, the DaemonSet controller starts daemon pods on the node for DaemonSets whose pod template nodeSelectors match the node's labels.
- Suppose the user launches a DaemonSet that runs a logging daemon on all nodes labeled 'logger=fluentd'?. If the user then adds the 'logger=fluentd'? label to a node (that did not initially have the label), the logging daemon will launch on the node. Additionally, if a user removes the label from a node, the logging daemon on that node will be killed.
## Alternatives Considered
We considered several alternatives, that were deemed inferior to the approach of creating a new DaemonSet abstraction.
One alternative is to include the daemon in the machine image. In this case it would run outside of Kubernetes proper, and thus not be monitored, health checked, usable as a service endpoint, easily upgradable, etc.
A related alternative is to package daemons as static pods. This would address most of the problems described above, but they would still not be easily upgradable, and more generally could not be managed through the API server interface.
A third alternative is to generalize the Replication Controller. We would do something like: if you set the `replicas` field of the ReplicationConrollerSpec to -1, then it means "run exactly one replica on every node matching the nodeSelector in the pod template." The ReplicationController would pretend `replicas` had been set to some large number -- larger than the largest number of nodes ever expected in the cluster -- and would use some anti-affinity mechanism to ensure that no more than one Pod from the ReplicationController runs on any given node. There are two downsides to this approach. First, there would always be a large number of Pending pods in the scheduler (these will be scheduled onto new machines when they are added to the cluster). The second downside is more philosophical: DaemonSet and the Replication Controller are very different concepts. We believe that having small, targeted controllers for distinct purposes makes Kubernetes easier to understand and use, compared to having larger multi-functional controllers (see ["Convert ReplicationController to a plugin"](http://issues.k8s.io/3058) for some discussion of this topic).
## Design
#### Client
- Add support for DaemonSet commands to kubectl and the client. Client code was added to client/unversioned. The main files in Kubectl that were modified are kubectl/describe.go and kubectl/stop.go, since for other calls like Get, Create, and Update, the client simply forwards the request to the backend via the REST API.
#### Apiserver
- Accept, parse, validate client commands
- REST API calls are handled in registry/daemon
- In particular, the api server will add the object to etcd
- DaemonManager listens for updates to etcd (using Framework.informer)
- API objects for DaemonSet were created in expapi/v1/types.go and expapi/v1/register.go
- Validation code is in expapi/validation
#### Daemon Manager
- Creates new DaemonSets when requested. Launches the corresponding daemon pod on all nodes with labels matching the new DaemonSet's selector.
- Listens for addition of new nodes to the cluster, by setting up a framework.NewInformer that watches for the creation of Node API objects. When a new node is added, the daemon manager will loop through each DaemonSet. If the label of the node matches the selector of the DaemonSet, then the daemon manager will create the corresponding daemon pod in the new node.
- The daemon manager creates a pod on a node by sending a command to the API server, requesting for a pod to be bound to the node (the node will be specified via its hostname)
#### Kubelet
- Does not need to be modified, but health checking will occur for the daemon pods and revive the pods if they are killed (we set the pod restartPolicy to Always). We reject DaemonSet objects with pod templates that don't have restartPolicy set to Always.
## Open Issues
- Should work similarly to [Deployment](http://issues.k8s.io/1743).

View File

@ -1,89 +0,0 @@
---
title: "Kubernetes Event Compression"
---
This document captures the design of event compression.
## Background
Kubernetes components can get into a state where they generate tons of events which are identical except for the timestamp. For example, when pulling a non-existing image, Kubelet will repeatedly generate `image_not_existing` and `container_is_waiting` events until upstream components correct the image. When this happens, the spam from the repeated events makes the entire event mechanism useless. It also appears to cause memory pressure in etcd (see [#3853](http://issue.k8s.io/3853)).
## Proposal
Each binary that generates events (for example, `kubelet`) should keep track of previously generated events so that it can collapse recurring events into a single event instead of creating a new instance for each new event.
Event compression should be best effort (not guaranteed). Meaning, in the worst case, `n` identical (minus timestamp) events may still result in `n` event entries.
## Design
Instead of a single Timestamp, each event object [contains](http://releases.k8s.io/release-1.1/pkg/api/types.go#L1111) the following fields:
* `FirstTimestamp unversioned.Time`
* The date/time of the first occurrence of the event.
* `LastTimestamp unversioned.Time`
* The date/time of the most recent occurrence of the event.
* On first occurrence, this is equal to the FirstTimestamp.
* `Count int`
* The number of occurrences of this event between FirstTimestamp and LastTimestamp
* On first occurrence, this is 1.
Each binary that generates events:
* Maintains a historical record of previously generated events:
* Implemented with ["Least Recently Used Cache"](https://github.com/golang/groupcache/blob/master/lru/lru.go) in [`pkg/client/record/events_cache.go`](https://releases.k8s.io/release-1.1/pkg/client/record/events_cache.go).
* The key in the cache is generated from the event object minus timestamps/count/transient fields, specifically the following events fields are used to construct a unique key for an event:
* `event.Source.Component`
* `event.Source.Host`
* `event.InvolvedObject.Kind`
* `event.InvolvedObject.Namespace`
* `event.InvolvedObject.Name`
* `event.InvolvedObject.UID`
* `event.InvolvedObject.APIVersion`
* `event.Reason`
* `event.Message`
* The LRU cache is capped at 4096 events. That means if a component (e.g. kubelet) runs for a long period of time and generates tons of unique events, the previously generated events cache will not grow unchecked in memory. Instead, after 4096 unique events are generated, the oldest events are evicted from the cache.
* When an event is generated, the previously generated events cache is checked (see [`pkg/client/unversioned/record/event.go`](http://releases.k8s.io/release-1.1/pkg/client/unversioned/record/event.go)).
* If the key for the new event matches the key for a previously generated event (meaning all of the above fields match between the new event and some previously generated event), then the event is considered to be a duplicate and the existing event entry is updated in etcd:
* The new PUT (update) event API is called to update the existing event entry in etcd with the new last seen timestamp and count.
* The event is also updated in the previously generated events cache with an incremented count, updated last seen timestamp, name, and new resource version (all required to issue a future event update).
* If the key for the new event does not match the key for any previously generated event (meaning none of the above fields match between the new event and any previously generated events), then the event is considered to be new/unique and a new event entry is created in etcd:
* The usual POST/create event API is called to create a new event entry in etcd.
* An entry for the event is also added to the previously generated events cache.
## Issues/Risks
* Compression is not guaranteed, because each component keeps track of event history in memory
* An application restart causes event history to be cleared, meaning event history is not preserved across application restarts and compression will not occur across component restarts.
* Because an LRU cache is used to keep track of previously generated events, if too many unique events are generated, old events will be evicted from the cache, so events will only be compressed until they age out of the events cache, at which point any new instance of the event will cause a new entry to be created in etcd.
## Example
Sample kubectl output
{% highlight console %}
FIRSTSEEN LASTSEEN COUNT NAME KIND SUBOBJECT REASON SOURCE MESSAGE
Thu, 12 Feb 2015 01:13:02 +0000 Thu, 12 Feb 2015 01:13:02 +0000 1 kubernetes-minion-4.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-1.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-1.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-3.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-3.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:09 +0000 Thu, 12 Feb 2015 01:13:09 +0000 1 kubernetes-minion-2.c.saad-dev-vms.internal Minion starting {kubelet kubernetes-minion-2.c.saad-dev-vms.internal} Starting kubelet.
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-influx-grafana-controller-0133o Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 elasticsearch-logging-controller-fplln Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 kibana-logging-controller-gziey Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 skydns-ls6k1 Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:05 +0000 Thu, 12 Feb 2015 01:13:12 +0000 4 monitoring-heapster-controller-oh43e Pod failedScheduling {scheduler } Error scheduling: no nodes available to schedule pods
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey BoundPod implicitly required container POD pulled {kubelet kubernetes-minion-4.c.saad-dev-vms.internal} Successfully pulled image "kubernetes/pause:latest"
Thu, 12 Feb 2015 01:13:20 +0000 Thu, 12 Feb 2015 01:13:20 +0000 1 kibana-logging-controller-gziey Pod scheduled {scheduler } Successfully assigned kibana-logging-controller-gziey to kubernetes-minion-4.c.saad-dev-vms.internal
{% endhighlight %}
This demonstrates what would have been 20 separate entries (indicating scheduling failure) collapsed/compressed down to 5 entries.
## Related Pull Requests/Issues
* Issue [#4073](http://issue.k8s.io/4073): Compress duplicate events
* PR [#4157](http://issue.k8s.io/4157): Add "Update Event" to Kubernetes API
* PR [#4206](http://issue.k8s.io/4206): Modify Event struct to allow compressing multiple recurring events in to a single event
* PR [#4306](http://issue.k8s.io/4306): Compress recurring events in to a single event to optimize etcd storage
* PR [#4444](http://pr.k8s.io/4444): Switch events history to use LRU cache instead of map

View File

@ -1,402 +0,0 @@
---
title: "Variable expansion in pod command, args, and env"
---
## Abstract
A proposal for the expansion of environment variables using a simple `$(var)` syntax.
## Motivation
It is extremely common for users to need to compose environment variables or pass arguments to
their commands using the values of environment variables. Kubernetes should provide a facility for
the 80% cases in order to decrease coupling and the use of workarounds.
## Goals
1. Define the syntax format
2. Define the scoping and ordering of substitutions
3. Define the behavior for unmatched variables
4. Define the behavior for unexpected/malformed input
## Constraints and Assumptions
* This design should describe the simplest possible syntax to accomplish the use-cases
* Expansion syntax will not support more complicated shell-like behaviors such as default values
(viz: `$(VARIABLE_NAME:"default")`), inline substitution, etc.
## Use Cases
1. As a user, I want to compose new environment variables for a container using a substitution
syntax to reference other variables in the container's environment and service environment
variables
1. As a user, I want to substitute environment variables into a container's command
1. As a user, I want to do the above without requiring the container's image to have a shell
1. As a user, I want to be able to specify a default value for a service variable which may
not exist
1. As a user, I want to see an event associated with the pod if an expansion fails (ie, references
variable names that cannot be expanded)
### Use Case: Composition of environment variables
Currently, containers are injected with docker-style environment variables for the services in
their pod's namespace. There are several variables for each service, but users routinely need
to compose URLs based on these variables because there is not a variable for the exact format
they need. Users should be able to build new environment variables with the exact format they need.
Eventually, it should also be possible to turn off the automatic injection of the docker-style
variables into pods and let the users consume the exact information they need via the downward API
and composition.
#### Expanding expanded variables
It should be possible to reference an variable which is itself the result of an expansion, if the
referenced variable is declared in the container's environment prior to the one referencing it.
Put another way -- a container's environment is expanded in order, and expanded variables are
available to subsequent expansions.
### Use Case: Variable expansion in command
Users frequently need to pass the values of environment variables to a container's command.
Currently, Kubernetes does not perform any expansion of variables. The workaround is to invoke a
shell in the container's command and have the shell perform the substitution, or to write a wrapper
script that sets up the environment and runs the command. This has a number of drawbacks:
1. Solutions that require a shell are unfriendly to images that do not contain a shell
2. Wrapper scripts make it harder to use images as base images
3. Wrapper scripts increase coupling to Kubernetes
Users should be able to do the 80% case of variable expansion in command without writing a wrapper
script or adding a shell invocation to their containers' commands.
### Use Case: Images without shells
The current workaround for variable expansion in a container's command requires the container's
image to have a shell. This is unfriendly to images that do not contain a shell (`scratch` images,
for example). Users should be able to perform the other use-cases in this design without regard to
the content of their images.
### Use Case: See an event for incomplete expansions
It is possible that a container with incorrect variable values or command line may continue to run
for a long period of time, and that the end-user would have no visual or obvious warning of the
incorrect configuration. If the kubelet creates an event when an expansion references a variable
that cannot be expanded, it will help users quickly detect problems with expansions.
## Design Considerations
### What features should be supported?
In order to limit complexity, we want to provide the right amount of functionality so that the 80%
cases can be realized and nothing more. We felt that the essentials boiled down to:
1. Ability to perform direct expansion of variables in a string
2. Ability to specify default values via a prioritized mapping function but without support for
defaults as a syntax-level feature
### What should the syntax be?
The exact syntax for variable expansion has a large impact on how users perceive and relate to the
feature. We considered implementing a very restrictive subset of the shell `${var}` syntax. This
syntax is an attractive option on some level, because many people are familiar with it. However,
this syntax also has a large number of lesser known features such as the ability to provide
default values for unset variables, perform inline substitution, etc.
In the interest of preventing conflation of the expansion feature in Kubernetes with the shell
feature, we chose a different syntax similar to the one in Makefiles, `$(var)`. We also chose not
to support the bar `$var` format, since it is not required to implement the required use-cases.
Nested references, ie, variable expansion within variable names, are not supported.
#### How should unmatched references be treated?
Ideally, it should be extremely clear when a variable reference couldn't be expanded. We decided
the best experience for unmatched variable references would be to have the entire reference, syntax
included, show up in the output. As an example, if the reference `$(VARIABLE_NAME)` cannot be
expanded, then `$(VARIABLE_NAME)` should be present in the output.
#### Escaping the operator
Although the `$(var)` syntax does overlap with the `$(command)` form of command substitution
supported by many shells, because unexpanded variables are present verbatim in the output, we
expect this will not present a problem to many users. If there is a collision between a variable
name and command substitution syntax, the syntax can be escaped with the form `$$(VARIABLE_NAME)`,
which will evaluate to `$(VARIABLE_NAME)` whether `VARIABLE_NAME` can be expanded or not.
## Design
This design encompasses the variable expansion syntax and specification and the changes needed to
incorporate the expansion feature into the container's environment and command.
### Syntax and expansion mechanics
This section describes the expansion syntax, evaluation of variable values, and how unexpected or
malformed inputs are handled.
#### Syntax
The inputs to the expansion feature are:
1. A utf-8 string (the input string) which may contain variable references
2. A function (the mapping function) that maps the name of a variable to the variable's value, of
type `func(string) string`
Variable references in the input string are indicated exclusively with the syntax
`$(<variable-name>)`. The syntax tokens are:
- `$`: the operator
- `(`: the reference opener
- `)`: the reference closer
The operator has no meaning unless accompanied by the reference opener and closer tokens. The
operator can be escaped using `$$`. One literal `$` will be emitted for each `$$` in the input.
The reference opener and closer characters have no meaning when not part of a variable reference.
If a variable reference is malformed, viz: `$(VARIABLE_NAME` without a closing expression, the
operator and expression opening characters are treated as ordinary characters without special
meanings.
#### Scope and ordering of substitutions
The scope in which variable references are expanded is defined by the mapping function. Within the
mapping function, any arbitrary strategy may be used to determine the value of a variable name.
The most basic implementation of a mapping function is to use a `map[string]string` to lookup the
value of a variable.
In order to support default values for variables like service variables presented by the kubelet,
which may not be bound because the service that provides them does not yet exist, there should be a
mapping function that uses a list of `map[string]string` like:
{% highlight go %}
func MakeMappingFunc(maps ...map[string]string) func(string) string {
return func(input string) string {
for _, context := range maps {
val, ok := context[input]
if ok {
return val
}
}
return ""
}
}
// elsewhere
containerEnv := map[string]string{
"FOO": "BAR",
"ZOO": "ZAB",
"SERVICE2_HOST": "some-host",
}
serviceEnv := map[string]string{
"SERVICE_HOST": "another-host",
"SERVICE_PORT": "8083",
}
// single-map variation
mapping := MakeMappingFunc(containerEnv)
// default variables not found in serviceEnv
mappingWithDefaults := MakeMappingFunc(serviceEnv, containerEnv)
{% endhighlight %}
### Implementation changes
The necessary changes to implement this functionality are:
1. Add a new interface, `ObjectEventRecorder`, which is like the `EventRecorder` interface, but
scoped to a single object, and a function that returns an `ObjectEventRecorder` given an
`ObjectReference` and an `EventRecorder`
2. Introduce `third_party/golang/expansion` package that provides:
1. An `Expand(string, func(string) string) string` function
2. A `MappingFuncFor(ObjectEventRecorder, ...map[string]string) string` function
3. Make the kubelet expand environment correctly
4. Make the kubelet expand command correctly
#### Event Recording
In order to provide an event when an expansion references undefined variables, the mapping function
must be able to create an event. In order to facilitate this, we should create a new interface in
the `api/client/record` package which is similar to `EventRecorder`, but scoped to a single object:
{% highlight go %}
// ObjectEventRecorder knows how to record events about a single object.
type ObjectEventRecorder interface {
// Event constructs an event from the given information and puts it in the queue for sending.
// 'reason' is the reason this event is generated. 'reason' should be short and unique; it will
// be used to automate handling of events, so imagine people writing switch statements to
// handle them. You want to make that easy.
// 'message' is intended to be human readable.
//
// The resulting event will be created in the same namespace as the reference object.
Event(reason, message string)
// Eventf is just like Event, but with Sprintf for the message field.
Eventf(reason, messageFmt string, args ...interface{})
// PastEventf is just like Eventf, but with an option to specify the event's 'timestamp' field.
PastEventf(timestamp unversioned.Time, reason, messageFmt string, args ...interface{})
}
{% endhighlight %}
There should also be a function that can construct an `ObjectEventRecorder` from a `runtime.Object`
and an `EventRecorder`:
{% highlight go %}
type objectRecorderImpl struct {
object runtime.Object
recorder EventRecorder
}
func (r *objectRecorderImpl) Event(reason, message string) {
r.recorder.Event(r.object, reason, message)
}
func ObjectEventRecorderFor(object runtime.Object, recorder EventRecorder) ObjectEventRecorder {
return &objectRecorderImpl{object, recorder}
}
{% endhighlight %}
#### Expansion package
The expansion package should provide two methods:
{% highlight go %}
// MappingFuncFor returns a mapping function for use with Expand that
// implements the expansion semantics defined in the expansion spec; it
// returns the input string wrapped in the expansion syntax if no mapping
// for the input is found. If no expansion is found for a key, an event
// is raised on the given recorder.
func MappingFuncFor(recorder record.ObjectEventRecorder, context ...map[string]string) func(string) string {
// ...
}
// Expand replaces variable references in the input string according to
// the expansion spec using the given mapping function to resolve the
// values of variables.
func Expand(input string, mapping func(string) string) string {
// ...
}
{% endhighlight %}
#### Kubelet changes
The Kubelet should be made to correctly expand variables references in a container's environment,
command, and args. Changes will need to be made to:
1. The `makeEnvironmentVariables` function in the kubelet; this is used by
`GenerateRunContainerOptions`, which is used by both the docker and rkt container runtimes
2. The docker manager `setEntrypointAndCommand` func has to be changed to perform variable
expansion
3. The rkt runtime should be made to support expansion in command and args when support for it is
implemented
### Examples
#### Inputs and outputs
These examples are in the context of the mapping:
| Name | Value |
|-------------|------------|
| `VAR_A` | `"A"` |
| `VAR_B` | `"B"` |
| `VAR_C` | `"C"` |
| `VAR_REF` | `$(VAR_A)` |
| `VAR_EMPTY` | `""` |
No other variables are defined.
| Input | Result |
|--------------------------------|----------------------------|
| `"$(VAR_A)"` | `"A"` |
| `"___$(VAR_B)___"` | `"___B___"` |
| `"___$(VAR_C)"` | `"___C"` |
| `"$(VAR_A)-$(VAR_A)"` | `"A-A"` |
| `"$(VAR_A)-1"` | `"A-1"` |
| `"$(VAR_A)_$(VAR_B)_$(VAR_C)"` | `"A_B_C"` |
| `"$$(VAR_B)_$(VAR_A)"` | `"$(VAR_B)_A"` |
| `"$$(VAR_A)_$$(VAR_B)"` | `"$(VAR_A)_$(VAR_B)"` |
| `"f000-$$VAR_A"` | `"f000-$VAR_A"` |
| `"foo\\$(VAR_C)bar"` | `"foo\Cbar"` |
| `"foo\\\\$(VAR_C)bar"` | `"foo\\Cbar"` |
| `"foo\\\\\\\\$(VAR_A)bar"` | `"foo\\\\Abar"` |
| `"$(VAR_A$(VAR_B))"` | `"$(VAR_A$(VAR_B))"` |
| `"$(VAR_A$(VAR_B)"` | `"$(VAR_A$(VAR_B)"` |
| `"$(VAR_REF)"` | `"$(VAR_A)"` |
| `"%%$(VAR_REF)--$(VAR_REF)%%"` | `"%%$(VAR_A)--$(VAR_A)%%"` |
| `"foo$(VAR_EMPTY)bar"` | `"foobar"` |
| `"foo$(VAR_Awhoops!"` | `"foo$(VAR_Awhoops!"` |
| `"f00__(VAR_A)__"` | `"f00__(VAR_A)__"` |
| `"$?_boo_$!"` | `"$?_boo_$!"` |
| `"$VAR_A"` | `"$VAR_A"` |
| `"$(VAR_DNE)"` | `"$(VAR_DNE)"` |
| `"$$$$$$(BIG_MONEY)"` | `"$$$(BIG_MONEY)"` |
| `"$$$$$$(VAR_A)"` | `"$$$(VAR_A)"` |
| `"$$$$$$$(GOOD_ODDS)"` | `"$$$$(GOOD_ODDS)"` |
| `"$$$$$$$(VAR_A)"` | `"$$$A"` |
| `"$VAR_A)"` | `"$VAR_A)"` |
| `"${VAR_A}"` | `"${VAR_A}"` |
| `"$(VAR_B)_______$(A"` | `"B_______$(A"` |
| `"$(VAR_C)_______$("` | `"C_______$("` |
| `"$(VAR_A)foobarzab$"` | `"Afoobarzab$"` |
| `"foo-\\$(VAR_A"` | `"foo-\$(VAR_A"` |
| `"--$($($($($--"` | `"--$($($($($--"` |
| `"$($($($($--foo$("` | `"$($($($($--foo$("` |
| `"foo0--$($($($("` | `"foo0--$($($($("` |
| `"$(foo$$var)` | `$(foo$$var)` |
#### In a pod: building a URL
Notice the `$(var)` syntax.
{% highlight yaml %}
apiVersion: v1
kind: Pod
metadata:
name: expansion-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: PUBLIC_URL
value: "http://$(GITSERVER_SERVICE_HOST):$(GITSERVER_SERVICE_PORT)"
restartPolicy: Never
{% endhighlight %}
#### In a pod: building a URL using downward API
{% highlight yaml %}
apiVersion: v1
kind: Pod
metadata:
name: expansion-pod
spec:
containers:
- name: test-container
image: gcr.io/google_containers/busybox
command: [ "/bin/sh", "-c", "env" ]
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: "metadata.namespace"
- name: PUBLIC_URL
value: "http://gitserver.$(POD_NAMESPACE):$(SERVICE_PORT)"
restartPolicy: Never
{% endhighlight %}

View File

@ -1,204 +0,0 @@
---
title: "Adding custom resources to the Kubernetes API server"
---
This document describes the design for implementing the storage of custom API types in the Kubernetes API Server.
## Resource Model
### The ThirdPartyResource
The `ThirdPartyResource` resource describes the multiple versions of a custom resource that the user wants to add
to the Kubernetes API. `ThirdPartyResource` is a non-namespaced resource, attempting to place it in a resource
will return an error.
Each `ThirdPartyResource` resource has the following:
* Standard Kubernetes object metadata.
* ResourceKind - The kind of the resources described by this third party resource.
* Description - A free text description of the resource.
* APIGroup - An API group that this resource should be placed into.
* Versions - One or more `Version` objects.
### The `Version` Object
The `Version` object describes a single concrete version of a custom resource. The `Version` object currently
only specifies:
* The `Name` of the version.
* The `APIGroup` this version should belong to.
## Expectations about third party objects
Every object that is added to a third-party Kubernetes object store is expected to contain Kubernetes
compatible [object metadata](../devel/api-conventions.html#metadata). This requirement enables the
Kubernetes API server to provide the following features:
* Filtering lists of objects via LabelQueries
* `resourceVersion`-based optimistic concurrency via compare-and-swap
* Versioned storage
* Event recording
* Integration with basic `kubectl` command line tooling.
* Watch for resource changes.
The `Kind` for an instance of a third-party object (e.g. CronTab) below is expected to be
programmatically convertible to the name of the resource using
the following conversion. Kinds are expected to be of the form `<CamelCaseKind>`, the
`APIVersion` for the object is expected to be `<domain-name>/<api-group>/<api-version>`.
For example `example.com/stable/v1`
`domain-name` is expected to be a fully qualified domain name.
'CamelCaseKind' is the specific type name.
To convert this into the `metadata.name` for the `ThirdPartyResource` resource instance,
the `<domain-name>` is copied verbatim, the `CamelCaseKind` is
then converted
using '-' instead of capitalization ('camel-case'), with the first character being assumed to be
capitalized. In pseudo code:
{% highlight go %}
var result string
for ix := range kindName {
if isCapital(kindName[ix]) {
result = append(result, '-')
}
result = append(result, toLowerCase(kindName[ix])
}
{% endhighlight %}
As a concrete example, the resource named `camel-case-kind.example.com` defines resources of Kind `CamelCaseKind`, in
the APIGroup with the prefix `example.com/...`.
The reason for this is to enable rapid lookup of a `ThirdPartyResource` object given the kind information.
This is also the reason why `ThirdPartyResource` is not namespaced.
## Usage
When a user creates a new `ThirdPartyResource`, the Kubernetes API Server reacts by creating a new, namespaced
RESTful resource path. For now, non-namespaced objects are not supported. As with existing built-in objects
deleting a namespace, deletes all third party resources in that namespace.
For example, if a user creates:
{% highlight yaml %}
metadata:
name: cron-tab.example.com
apiVersion: extensions/v1beta1
kind: ThirdPartyResource
description: "A specification of a Pod to run on a cron style schedule"
versions:
- name: stable/v1
- name: experimental/v2
{% endhighlight %}
Then the API server will program in two new RESTful resource paths:
* `/thirdparty/example.com/stable/v1/namespaces/<namespace>/crontabs/...`
* `/thirdparty/example.com/experimental/v2/namespaces/<namespace>/crontabs/...`
Now that this schema has been created, a user can `POST`:
{% highlight json %}
{
"metadata": {
"name": "my-new-cron-object"
},
"apiVersion": "example.com/stable/v1",
"kind": "CronTab",
"cronSpec": "* * * * /5",
"image": "my-awesome-chron-image"
}
{% endhighlight %}
to: `/third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object`
and the corresponding data will be stored into etcd by the APIServer, so that when the user issues:
```
GET /third-party/example.com/stable/v1/namespaces/default/crontabs/my-new-cron-object`
```
And when they do that, they will get back the same data, but with additional Kubernetes metadata
(e.g. `resourceVersion`, `createdTimestamp`) filled in.
Likewise, to list all resources, a user can issue:
```
GET /third-party/example.com/stable/v1/namespaces/default/crontabs
```
and get back:
{% highlight json %}
{
"apiVersion": "example.com/stable/v1",
"kind": "CronTabList",
"items": [
{
"metadata": {
"name": "my-new-cron-object"
},
"apiVersion": "example.com/stable/v1",
"kind": "CronTab",
"cronSpec": "* * * * /5",
"image": "my-awesome-chron-image"
}
]
}
{% endhighlight %}
Because all objects are expected to contain standard Kubernetes metadata fields, these
list operations can also use `Label` queries to filter requests down to specific subsets.
Likewise, clients can use watch endpoints to watch for changes to stored objects.
## Storage
In order to store custom user data in a versioned fashion inside of etcd, we need to also introduce a
`Codec`-compatible object for persistent storage in etcd. This object is `ThirdPartyResourceData` and it contains:
* Standard API Metadata
* `Data`: The raw JSON data for this custom object.
### Storage key specification
Each custom object stored by the API server needs a custom key in storage, this is described below:
#### Definitions
* `resource-namespace` : the namespace of the particular resource that is being stored
* `resource-name`: the name of the particular resource being stored
* `third-party-resource-namespace`: the namespace of the `ThirdPartyResource` resource that represents the type for the specific instance being stored.
* `third-party-resource-name`: the name of the `ThirdPartyResource` resource that represents the type for the specific instance being stored.
#### Key
Given the definitions above, the key for a specific third-party object is:
```
${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/${resource-name}
```
Thus, listing a third-party resource can be achieved by listing the directory:
```
${standard-k8s-prefix}/third-party-resources/${third-party-resource-namespace}/${third-party-resource-name}/${resource-namespace}/
```

View File

@ -1,246 +0,0 @@
---
title: "Horizontal Pod Autoscaling"
---
## Preface
This document briefly describes the design of the horizontal autoscaler for pods.
The autoscaler (implemented as a Kubernetes API resource and controller) is responsible for dynamically controlling
the number of replicas of some collection (e.g. the pods of a ReplicationController) to meet some objective(s),
for example a target per-pod CPU utilization.
This design supersedes [autoscaling.md](http://releases.k8s.io/release-1.1/docs/proposals/autoscaling.md).
## Overview
The resource usage of a serving application usually varies over time: sometimes the demand for the application rises,
and sometimes it drops.
In Kubernetes version 1.0, a user can only manually set the number of serving pods.
Our aim is to provide a mechanism for the automatic adjustment of the number of pods based on CPU utilization statistics
(a future version will allow autoscaling based on other resources/metrics).
## Scale Subresource
In Kubernetes version 1.1, we are introducing Scale subresource and implementing horizontal autoscaling of pods based on it.
Scale subresource is supported for replication controllers and deployments.
Scale subresource is a Virtual Resource (does not correspond to an object stored in etcd).
It is only present in the API as an interface that a controller (in this case the HorizontalPodAutoscaler) can use to dynamically scale
the number of replicas controlled by some other API object (currently ReplicationController and Deployment) and to learn the current number of replicas.
Scale is a subresource of the API object that it serves as the interface for.
The Scale subresource is useful because whenever we introduce another type we want to autoscale, we just need to implement the Scale subresource for it.
The wider discussion regarding Scale took place in [#1629](https://github.com/kubernetes/kubernetes/issues/1629).
Scale subresource is in API for replication controller or deployment under the following paths:
`apis/extensions/v1beta1/replicationcontrollers/myrc/scale`
`apis/extensions/v1beta1/deployments/mydeployment/scale`
It has the following structure:
{% highlight go %}
// represents a scaling request for a resource.
type Scale struct {
unversioned.TypeMeta
api.ObjectMeta
// defines the behavior of the scale.
Spec ScaleSpec
// current status of the scale.
Status ScaleStatus
}
// describes the attributes of a scale subresource
type ScaleSpec struct {
// desired number of instances for the scaled object.
Replicas int `json:"replicas,omitempty"`
}
// represents the current status of a scale subresource.
type ScaleStatus struct {
// actual number of observed instances of the scaled object.
Replicas int `json:"replicas"`
// label query over pods that should match the replicas count.
Selector map[string]string `json:"selector,omitempty"`
}
{% endhighlight %}
Writing to `ScaleSpec.Replicas` resizes the replication controller/deployment associated with
the given Scale subresource.
`ScaleStatus.Replicas` reports how many pods are currently running in the replication controller/deployment,
and `ScaleStatus.Selector` returns selector for the pods.
## HorizontalPodAutoscaler Object
In Kubernetes version 1.1, we are introducing HorizontalPodAutoscaler object. It is accessible under:
`apis/extensions/v1beta1/horizontalpodautoscalers/myautoscaler`
It has the following structure:
{% highlight go %}
// configuration of a horizontal pod autoscaler.
type HorizontalPodAutoscaler struct {
unversioned.TypeMeta
api.ObjectMeta
// behavior of autoscaler.
Spec HorizontalPodAutoscalerSpec
// current information about the autoscaler.
Status HorizontalPodAutoscalerStatus
}
// specification of a horizontal pod autoscaler.
type HorizontalPodAutoscalerSpec struct {
// reference to Scale subresource; horizontal pod autoscaler will learn the current resource
// consumption from its status,and will set the desired number of pods by modifying its spec.
ScaleRef SubresourceReference
// lower limit for the number of pods that can be set by the autoscaler, default 1.
MinReplicas *int
// upper limit for the number of pods that can be set by the autoscaler.
// It cannot be smaller than MinReplicas.
MaxReplicas int
// target average CPU utilization (represented as a percentage of requested CPU) over all the pods;
// if not specified it defaults to the target CPU utilization at 80% of the requested resources.
CPUUtilization *CPUTargetUtilization
}
type CPUTargetUtilization struct {
// fraction of the requested CPU that should be utilized/used,
// e.g. 70 means that 70% of the requested CPU should be in use.
TargetPercentage int
}
// current status of a horizontal pod autoscaler
type HorizontalPodAutoscalerStatus struct {
// most recent generation observed by this autoscaler.
ObservedGeneration *int64
// last time the HorizontalPodAutoscaler scaled the number of pods;
// used by the autoscaler to control how often the number of pods is changed.
LastScaleTime *unversioned.Time
// current number of replicas of pods managed by this autoscaler.
CurrentReplicas int
// desired number of replicas of pods managed by this autoscaler.
DesiredReplicas int
// current average CPU utilization over all pods, represented as a percentage of requested CPU,
// e.g. 70 means that an average pod is using now 70% of its requested CPU.
CurrentCPUUtilizationPercentage *int
}
{% endhighlight %}
`ScaleRef` is a reference to the Scale subresource.
`MinReplicas`, `MaxReplicas` and `CPUUtilization` define autoscaler configuration.
We are also introducing HorizontalPodAutoscalerList object to enable listing all autoscalers in a namespace:
{% highlight go %}
// list of horizontal pod autoscaler objects.
type HorizontalPodAutoscalerList struct {
unversioned.TypeMeta
unversioned.ListMeta
// list of horizontal pod autoscaler objects.
Items []HorizontalPodAutoscaler
}
{% endhighlight %}
## Autoscaling Algorithm
The autoscaler is implemented as a control loop. It periodically queries pods described by `Status.PodSelector` of Scale subresource, and collects their CPU utilization.
Then, it compares the arithmetic mean of the pods' CPU utilization with the target defined in `Spec.CPUUtilization`,
and adjust the replicas of the Scale if needed to match the target
(preserving condition: MinReplicas <= Replicas <= MaxReplicas).
The period of the autoscaler is controlled by `--horizontal-pod-autoscaler-sync-period` flag of controller manager.
The default value is 30 seconds.
CPU utilization is the recent CPU usage of a pod (average across the last 1 minute) divided by the CPU requested by the pod.
In Kubernetes version 1.1, CPU usage is taken directly from Heapster.
In future, there will be API on master for this purpose
(see [#11951](https://github.com/kubernetes/kubernetes/issues/11951)).
The target number of pods is calculated from the following formula:
```
TargetNumOfPods = ceil(sum(CurrentPodsCPUUtilization) / Target)
```
Starting and stopping pods may introduce noise to the metric (for instance, starting may temporarily increase CPU).
So, after each action, the autoscaler should wait some time for reliable data.
Scale-up can only happen if there was no rescaling within the last 3 minutes.
Scale-down will wait for 5 minutes from the last rescaling.
Moreover any scaling will only be made if: `avg(CurrentPodsConsumption) / Target` drops below 0.9 or increases above 1.1 (10% tolerance).
Such approach has two benefits:
* Autoscaler works in a conservative way.
If new user load appears, it is important for us to rapidly increase the number of pods,
so that user requests will not be rejected.
Lowering the number of pods is not that urgent.
* Autoscaler avoids thrashing, i.e.: prevents rapid execution of conflicting decision if the load is not stable.
## Relative vs. absolute metrics
We chose values of the target metric to be relative (e.g. 90% of requested CPU resource) rather than absolute (e.g. 0.6 core) for the following reason.
If we choose absolute metric, user will need to guarantee that the target is lower than the request.
Otherwise, overloaded pods may not be able to consume more than the autoscaler's absolute target utilization,
thereby preventing the autoscaler from seeing high enough utilization to trigger it to scale up.
This may be especially troublesome when user changes requested resources for a pod
because they would need to also change the autoscaler utilization threshold.
Therefore, we decided to choose relative metric.
For user, it is enough to set it to a value smaller than 100%, and further changes of requested resources will not invalidate it.
## Support in kubectl
To make manipulation of HorizontalPodAutoscaler object simpler, we added support for
creating/updating/deleting/listing of HorizontalPodAutoscaler to kubectl.
In addition, in future, we are planning to add kubectl support for the following use-cases:
* When creating a replication controller or deployment with `kubectl create [-f]`, there should be
a possibility to specify an additional autoscaler object.
(This should work out-of-the-box when creation of autoscaler is supported by kubectl as we may include
multiple objects in the same config file).
* *[future]* When running an image with `kubectl run`, there should be an additional option to create
an autoscaler for it.
* *[future]* We will add a new command `kubectl autoscale` that will allow for easy creation of an autoscaler object
for already existing replication controller/deployment.
## Next steps
We list here some features that are not supported in Kubernetes version 1.1.
However, we want to keep them in mind, as they will most probably be needed in future.
Our design is in general compatible with them.
* *[future]* **Autoscale pods based on metrics different than CPU** (e.g. memory, network traffic, qps).
This includes scaling based on a custom/application metric.
* *[future]* **Autoscale pods base on an aggregate metric.**
Autoscaler, instead of computing average for a target metric across pods, will use a single, external, metric (e.g. qps metric from load balancer).
The metric will be aggregated while the target will remain per-pod
(e.g. when observing 100 qps on load balancer while the target is 20 qps per pod, autoscaler will set the number of replicas to 5).
* *[future]* **Autoscale pods based on multiple metrics.**
If the target numbers of pods for different metrics are different, choose the largest target number of pods.
* *[future]* **Scale the number of pods starting from 0.**
All pods can be turned-off, and then turned-on when there is a demand for them.
When a request to service with no pods arrives, kube-proxy will generate an event for autoscaler
to create a new pod.
Discussed in [#3247](https://github.com/kubernetes/kubernetes/issues/3247).
* *[future]* **When scaling down, make more educated decision which pods to kill.**
E.g.: if two or more pods from the same replication controller are on the same node, kill one of them.
Discussed in [#4301](https://github.com/kubernetes/kubernetes/issues/4301).

View File

@ -1,96 +0,0 @@
---
title: "Identifiers and Names in Kubernetes"
---
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](http://issue.k8s.io/199).
## Definitions
UID
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
Name
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
: An alphanumeric (a-z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
: One or more lowercase rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
[rfc6335](https://tools.ietf.org/rfc/rfc6335.txt) port name (IANA_SVC_NAME)
: An alphanumeric (a-z, and 0-9) string, with a maximum length of 15 characters, with the '-' character allowed anywhere except the first or the last character or adjacent to another '-' character, it must contain at least a (a-z) character
## Objectives for names and UIDs
1. Uniquely identify (via a UID) an object across space and time
2. Uniquely name (via a name) an object across space
3. Provide human-friendly names in API operations and/or configuration files
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
5. Allow DNS names to be automatically generated for some objects
## General design
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
* Examples: "guestbook.user", "backend-x4eb1"
2. When an object is created via an API, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
* Example: "api.k8s.example.com"
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
* Example: "01234567-89ab-cdef-0123-456789abcdef"
## Case study: Scheduling a pod
Pods can be placed onto a particular node in a number of ways. This case
study demonstrates how the above design can be applied to satisfy the
objectives.
### A pod scheduled by a user through the apiserver
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
2. The apiserver validates the input.
1. A default Namespace is assigned.
2. The pod name must be space-unique within the Namespace.
3. Each container within the pod has a name which must be space-unique within the pod.
3. The pod is accepted.
1. A new UID is assigned.
4. The pod is bound to a node.
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
5. Kubelet validates the input.
6. Kubelet runs the pod.
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
* This may correspond to Docker's container ID.
### A pod placed by a config file on the node
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
2. Kubelet validates the input.
1. Since UID is not provided, kubelet generates one.
2. Since Namespace is not provided, kubelet generates one.
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
3. Kubelet runs the pod.
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
1. This may correspond to Docker's container ID.

View File

@ -1,21 +0,0 @@
---
title: "Kubernetes Design Overview"
---
Kubernetes is a system for managing containerized applications across multiple hosts, providing basic mechanisms for deployment, maintenance, and scaling of applications.
Kubernetes establishes robust declarative primitives for maintaining the desired state requested by the user. We see these primitives as the main value added by Kubernetes. Self-healing mechanisms, such as auto-restarting, re-scheduling, and replicating containers require active controllers, not just imperative orchestration.
Kubernetes is primarily targeted at applications composed of multiple containers, such as elastic, distributed micro-services. It is also designed to facilitate migration of non-containerized application stacks to Kubernetes. It therefore includes abstractions for grouping containers in both loosely coupled and tightly coupled formations, and provides ways for containers to find and communicate with each other in relatively familiar ways.
Kubernetes enables users to ask a cluster to run a set of containers. The system automatically chooses hosts to run those containers on. While Kubernetes's scheduler is currently very simple, we expect it to grow in sophistication over time. Scheduling is a policy-rich, topology-aware, workload-specific function that significantly impacts availability, performance, and capacity. The scheduler needs to take into account individual and collective resource requirements, quality of service requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, deadlines, and so on. Workload-specific requirements will be exposed through the API as necessary.
Kubernetes is intended to run on a number of cloud providers, as well as on physical hosts.
A single Kubernetes cluster is not intended to span multiple availability zones. Instead, we recommend building a higher-level layer to replicate complete deployments of highly available applications across multiple zones (see [the multi-cluster doc](../admin/multi-cluster) and [cluster federation proposal](../proposals/federation) for more details).
Finally, Kubernetes aspires to be an extensible, pluggable, building-block OSS platform and toolkit. Therefore, architecturally, we want Kubernetes to be built as a collection of pluggable components and layers, with the ability to use alternative schedulers, controllers, storage systems, and distribution mechanisms, and we're evolving its current code in that direction. Furthermore, we want others to be able to extend Kubernetes functionality, such as with higher-level PaaS functionality or multi-cluster layers, without modification of core Kubernetes source. Therefore, its API isn't just (or even necessarily mainly) targeted at end users, but at tool and extension developers. Its APIs are intended to serve as the foundation for an open ecosystem of tools, automation systems, and higher-level API layers. Consequently, there are no "internal" inter-component APIs. All APIs are visible and available, including the APIs used by the scheduler, the node controller, the replication-controller manager, Kubelet's API, etc. There's no glass to break -- in order to handle more complex use cases, one can just access the lower-level APIs in a fully transparent, composable manner.
For more about the Kubernetes architecture, see [architecture](architecture).

View File

@ -1,353 +0,0 @@
---
title: "Namespaces"
---
## Abstract
A Namespace is a mechanism to partition resources created by users into
a logically named group.
## Motivation
A single cluster should be able to satisfy the needs of multiple user communities.
Each user community wants to be able to work in isolation from other communities.
Each user community has its own:
1. resources (pods, services, replication controllers, etc.)
2. policies (who can or cannot perform actions in their community)
3. constraints (this community is allowed this much quota, etc.)
A cluster operator may create a Namespace for each unique user community.
The Namespace provides a unique scope for:
1. named resources (to avoid basic naming collisions)
2. delegated management authority to trusted users
3. ability to limit community resource consumption
## Use cases
1. As a cluster operator, I want to support multiple user communities on a single cluster.
2. As a cluster operator, I want to delegate authority to partitions of the cluster to trusted users
in those communities.
3. As a cluster operator, I want to limit the amount of resources each community can consume in order
to limit the impact to other communities using the cluster.
4. As a cluster user, I want to interact with resources that are pertinent to my user community in
isolation of what other user communities are doing on the cluster.
## Design
### Data Model
A *Namespace* defines a logically named group for multiple *Kind*s of resources.
{% highlight go %}
type Namespace struct {
TypeMeta `json:",inline"`
ObjectMeta `json:"metadata,omitempty"`
Spec NamespaceSpec `json:"spec,omitempty"`
Status NamespaceStatus `json:"status,omitempty"`
}
{% endhighlight %}
A *Namespace* name is a DNS compatible label.
A *Namespace* must exist prior to associating content with it.
A *Namespace* must not be deleted if there is content associated with it.
To associate a resource with a *Namespace* the following conditions must be satisfied:
1. The resource's *Kind* must be registered as having *RESTScopeNamespace* with the server
2. The resource's *TypeMeta.Namespace* field must have a value that references an existing *Namespace*
The *Name* of a resource associated with a *Namespace* is unique to that *Kind* in that *Namespace*.
It is intended to be used in resource URLs; provided by clients at creation time, and encouraged to be
human friendly; intended to facilitate idempotent creation, space-uniqueness of singleton objects,
distinguish distinct entities, and reference particular entities across operations.
### Authorization
A *Namespace* provides an authorization scope for accessing content associated with the *Namespace*.
See [Authorization plugins](../admin/authorization)
### Limit Resource Consumption
A *Namespace* provides a scope to limit resource consumption.
A *LimitRange* defines min/max constraints on the amount of resources a single entity can consume in
a *Namespace*.
See [Admission control: Limit Range](admission_control_limit_range)
A *ResourceQuota* tracks aggregate usage of resources in the *Namespace* and allows cluster operators
to define *Hard* resource usage limits that a *Namespace* may consume.
See [Admission control: Resource Quota](admission_control_resource_quota)
### Finalizers
Upon creation of a *Namespace*, the creator may provide a list of *Finalizer* objects.
{% highlight go %}
type FinalizerName string
// These are internal finalizers to Kubernetes, must be qualified name unless defined here
const (
FinalizerKubernetes FinalizerName = "kubernetes"
)
// NamespaceSpec describes the attributes on a Namespace
type NamespaceSpec struct {
// Finalizers is an opaque list of values that must be empty to permanently remove object from storage
Finalizers []FinalizerName
}
{% endhighlight %}
A *FinalizerName* is a qualified name.
The API Server enforces that a *Namespace* can only be deleted from storage if and only if
it's *Namespace.Spec.Finalizers* is empty.
A *finalize* operation is the only mechanism to modify the *Namespace.Spec.Finalizers* field post creation.
Each *Namespace* created has *kubernetes* as an item in its list of initial *Namespace.Spec.Finalizers*
set by default.
### Phases
A *Namespace* may exist in the following phases.
{% highlight go %}
type NamespacePhase string
const(
NamespaceActive NamespacePhase = "Active"
NamespaceTerminating NamespaceTerminating = "Terminating"
)
type NamespaceStatus struct {
...
Phase NamespacePhase
}
{% endhighlight %}
A *Namespace* is in the **Active** phase if it does not have a *ObjectMeta.DeletionTimestamp*.
A *Namespace* is in the **Terminating** phase if it has a *ObjectMeta.DeletionTimestamp*.
**Active**
Upon creation, a *Namespace* goes in the *Active* phase. This means that content may be associated with
a namespace, and all normal interactions with the namespace are allowed to occur in the cluster.
If a DELETE request occurs for a *Namespace*, the *Namespace.ObjectMeta.DeletionTimestamp* is set
to the current server time. A *namespace controller* observes the change, and sets the *Namespace.Status.Phase*
to *Terminating*.
**Terminating**
A *namespace controller* watches for *Namespace* objects that have a *Namespace.ObjectMeta.DeletionTimestamp*
value set in order to know when to initiate graceful termination of the *Namespace* associated content that
are known to the cluster.
The *namespace controller* enumerates each known resource type in that namespace and deletes it one by one.
Admission control blocks creation of new resources in that namespace in order to prevent a race-condition
where the controller could believe all of a given resource type had been deleted from the namespace,
when in fact some other rogue client agent had created new objects. Using admission control in this
scenario allows each of registry implementations for the individual objects to not need to take into account Namespace life-cycle.
Once all objects known to the *namespace controller* have been deleted, the *namespace controller*
executes a *finalize* operation on the namespace that removes the *kubernetes* value from
the *Namespace.Spec.Finalizers* list.
If the *namespace controller* sees a *Namespace* whose *ObjectMeta.DeletionTimestamp* is set, and
whose *Namespace.Spec.Finalizers* list is empty, it will signal the server to permanently remove
the *Namespace* from storage by sending a final DELETE action to the API server.
### REST API
To interact with the Namespace API:
| Action | HTTP Verb | Path | Description |
| ------ | --------- | ---- | ----------- |
| CREATE | POST | /api/{version}/namespaces | Create a namespace |
| LIST | GET | /api/{version}/namespaces | List all namespaces |
| UPDATE | PUT | /api/{version}/namespaces/{namespace} | Update namespace {namespace} |
| DELETE | DELETE | /api/{version}/namespaces/{namespace} | Delete namespace {namespace} |
| FINALIZE | POST | /api/{version}/namespaces/{namespace}/finalize | Finalize namespace {namespace} |
| WATCH | GET | /api/{version}/watch/namespaces | Watch all namespaces |
This specification reserves the name *finalize* as a sub-resource to namespace.
As a consequence, it is invalid to have a *resourceType* managed by a namespace whose kind is *finalize*.
To interact with content associated with a Namespace:
| Action | HTTP Verb | Path | Description |
| ---- | ---- | ---- | ---- |
| CREATE | POST | /api/{version}/namespaces/{namespace}/{resourceType}/ | Create instance of {resourceType} in namespace {namespace} |
| GET | GET | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Get instance of {resourceType} in namespace {namespace} with {name} |
| UPDATE | PUT | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Update instance of {resourceType} in namespace {namespace} with {name} |
| DELETE | DELETE | /api/{version}/namespaces/{namespace}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {namespace} with {name} |
| LIST | GET | /api/{version}/namespaces/{namespace}/{resourceType} | List instances of {resourceType} in namespace {namespace} |
| WATCH | GET | /api/{version}/watch/namespaces/{namespace}/{resourceType} | Watch for changes to a {resourceType} in namespace {namespace} |
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
The API server verifies the *Namespace* on resource creation matches the *{namespace}* on the path.
The API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context
of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request,
then the API server will reject the request.
### Storage
A namespace provides a unique identifier space and therefore must be in the storage path of a resource.
In etcd, we want to continue to still support efficient WATCH across namespaces.
Resources that persist content in etcd will have storage paths as follows:
/{k8s_storage_prefix}/{resourceType}/{resource.Namespace}/{resource.Name}
This enables consumers to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}.
### Kubelet
The kubelet will register pod's it sources from a file or http source with a namespace associated with the
*cluster-id*
### Example: OpenShift Origin managing a Kubernetes Namespace
In this example, we demonstrate how the design allows for agents built on-top of
Kubernetes that manage their own set of resource types associated with a *Namespace*
to take part in Namespace termination.
OpenShift creates a Namespace in Kubernetes
{% highlight json %}
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": ["openshift.com/origin", "kubernetes"]
},
"status": {
"phase": "Active"
}
}
{% endhighlight %}
OpenShift then goes and creates a set of resources (pods, services, etc) associated
with the "development" namespace. It also creates its own set of resources in its
own storage associated with the "development" namespace unknown to Kubernetes.
User deletes the Namespace in Kubernetes, and Namespace now has following state:
{% highlight json %}
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"deletionTimestamp": "..."
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": ["openshift.com/origin", "kubernetes"]
},
"status": {
"phase": "Terminating"
}
}
{% endhighlight %}
The Kubernetes *namespace controller* observes the namespace has a *deletionTimestamp*
and begins to terminate all of the content in the namespace that it knows about. Upon
success, it executes a *finalize* action that modifies the *Namespace* by
removing *kubernetes* from the list of finalizers:
{% highlight json %}
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"deletionTimestamp": "..."
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": ["openshift.com/origin"]
},
"status": {
"phase": "Terminating"
}
}
{% endhighlight %}
OpenShift Origin has its own *namespace controller* that is observing cluster state, and
it observes the same namespace had a *deletionTimestamp* assigned to it. It too will go
and purge resources from its own storage that it manages associated with that namespace.
Upon completion, it executes a *finalize* action and removes the reference to "openshift.com/origin"
from the list of finalizers.
This results in the following state:
{% highlight json %}
{
"apiVersion":"v1",
"kind": "Namespace",
"metadata": {
"name": "development",
"deletionTimestamp": "..."
"labels": {
"name": "development"
}
},
"spec": {
"finalizers": []
},
"status": {
"phase": "Terminating"
}
}
{% endhighlight %}
At this point, the Kubernetes *namespace controller* in its sync loop will see that the namespace
has a deletion timestamp and that its list of finalizers is empty. As a result, it knows all
content associated from that namespace has been purged. It performs a final DELETE action
to remove that Namespace from the storage.
At this point, all content associated with that Namespace, and the Namespace itself are gone.

View File

@ -1,182 +0,0 @@
---
title: "Networking"
---
There are 4 distinct networking problems to solve:
1. Highly-coupled container-to-container communications
2. Pod-to-Pod communications
3. Pod-to-Service communications
4. External-to-internal communications
## Model and motivation
Kubernetes deviates from the default Docker networking model (though as of
Docker 1.8 their network plugins are getting closer). The goal is for each pod
to have an IP in a flat shared networking namespace that has full communication
with other physical computers and containers across the network. IP-per-pod
creates a clean, backward-compatible model where pods can be treated much like
VMs or physical hosts from the perspectives of port allocation, networking,
naming, service discovery, load balancing, application configuration, and
migration.
Dynamic port allocation, on the other hand, requires supporting both static
ports (e.g., for externally accessible services) and dynamically allocated
ports, requires partitioning centrally allocated and locally acquired dynamic
ports, complicates scheduling (since ports are a scarce resource), is
inconvenient for users, complicates application configuration, is plagued by
port conflicts and reuse and exhaustion, requires non-standard approaches to
naming (e.g. consul or etcd rather than DNS), requires proxies and/or
redirection for programs using standard naming/addressing mechanisms (e.g. web
browsers), requires watching and cache invalidation for address/port changes
for instances in addition to watching group membership changes, and obstructs
container/pod migration (e.g. using CRIU). NAT introduces additional complexity
by fragmenting the addressing space, which breaks self-registration mechanisms,
among other problems.
## Container to container
All containers within a pod behave as if they are on the same host with regard
to networking. They can all reach each other's ports on localhost. This offers
simplicity (static ports know a priori), security (ports bound to localhost
are visible within the pod but never outside it), and performance. This also
reduces friction for applications moving from the world of uncontainerized apps
on physical or virtual hosts. People running application stacks together on
the same host have already figured out how to make ports not conflict and have
arranged for clients to find them.
The approach does reduce isolation between containers within a pod &mdash;
ports could conflict, and there can be no container-private ports, but these
seem to be relatively minor issues with plausible future workarounds. Besides,
the premise of pods is that containers within a pod share some resources
(volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation.
Additionally, the user can control what containers belong to the same pod
whereas, in general, they don't control what pods land together on a host.
## Pod to pod
Because every pod gets a "real" (not machine-private) IP address, pods can
communicate without proxies or translations. The pod can use well-known port
numbers and can avoid the use of higher-level service discovery systems like
DNS-SD, Consul, or Etcd.
When any container calls ioctl(SIOCGIFADDR) (get the address of an interface),
it sees the same IP that any peer container would see them coming from &mdash;
each pod has its own IP address that other pods can know. By making IP addresses
and ports the same both inside and outside the pods, we create a NAT-less, flat
address space. Running "ip addr show" should work as expected. This would enable
all existing naming/discovery mechanisms to work out of the box, including
self-registration mechanisms and applications that distribute IP addresses. We
should be optimizing for inter-pod network communication. Within a pod,
containers are more likely to use communication through volumes (e.g., tmpfs) or
IPC.
This is different from the standard Docker model. In that mode, each container
gets an IP in the 172-dot space and would only see that 172-dot address from
SIOCGIFADDR. If these containers connect to another container the peer would see
the connect coming from a different IP than the container itself knows. In short
&mdash; you can never self-register anything from a container, because a
container can not be reached on its private IP.
An alternative we considered was an additional layer of addressing: pod-centric
IP per container. Each container would have its own local IP address, visible
only within that pod. This would perhaps make it easier for containerized
applications to move from physical/virtual hosts to pods, but would be more
complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS)
and to reason about, due to the additional layer of address translation, and
would break self-registration and IP distribution mechanisms.
Like Docker, ports can still be published to the host node's interface(s), but
the need for this is radically diminished.
## Implementation
For the Google Compute Engine cluster configuration scripts, we use [advanced
routing rules](https://developers.google.com/compute/docs/networking#routing)
and ip-forwarding-enabled VMs so that each VM has an extra 256 IP addresses that
get routed to it. This is in addition to the 'main' IP address assigned to the
VM that is NAT-ed for Internet access. The container bridge (called `cbr0` to
differentiate it from `docker0`) is set up outside of Docker proper.
Example of GCE's advanced routing rules:
{% highlight sh %}
gcloud compute routes add "${MINION_NAMES[$i]}" \
--project "${PROJECT}" \
--destination-range "${MINION_IP_RANGES[$i]}" \
--network "${NETWORK}" \
--next-hop-instance "${MINION_NAMES[$i]}" \
--next-hop-instance-zone "${ZONE}" &
{% endhighlight %}
GCE itself does not know anything about these IPs, though. This means that when
a pod tries to egress beyond GCE's project the packets must be SNAT'ed
(masqueraded) to the VM's IP, which GCE recognizes and allows.
### Other implementations
With the primary aim of providing IP-per-pod-model, other implementations exist
to serve the purpose outside of GCE.
- [OpenVSwitch with GRE/VxLAN](../admin/ovs-networking)
- [Flannel](https://github.com/coreos/flannel#flannel)
- [L2 networks](http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/)
("With Linux Bridge devices" section)
- [Weave](https://github.com/zettio/weave) is yet another way to build an
overlay network, primarily aiming at Docker integration.
- [Calico](https://github.com/Metaswitch/calico) uses BGP to enable real
container IPs.
## Pod to service
The [service](../user-guide/services) abstraction provides a way to group pods under a
common access policy (e.g. load-balanced). The implementation of this creates a
virtual IP which clients can access and which is transparently proxied to the
pods in a Service. Each node runs a kube-proxy process which programs
`iptables` rules to trap access to service IPs and redirect them to the correct
backends. This provides a highly-available load-balancing solution with low
performance overhead by balancing client traffic from a node on that same node.
## External to internal
So far the discussion has been about how to access a pod or service from within
the cluster. Accessing a pod from outside the cluster is a bit more tricky. We
want to offer highly-available, high-performance load balancing to target
Kubernetes Services. Most public cloud providers are simply not flexible enough
yet.
The way this is generally implemented is to set up external load balancers (e.g.
GCE's ForwardingRules or AWS's ELB) which target all nodes in a cluster. When
traffic arrives at a node it is recognized as being part of a particular Service
and routed to an appropriate backend Pod. This does mean that some traffic will
get double-bounced on the network. Once cloud providers have better offerings
we can take advantage of those.
## Challenges and future work
### Docker API
Right now, docker inspect doesn't show the networking configuration of the
containers, since they derive it from another container. That information should
be exposed somehow.
### External IP assignment
We want to be able to assign IP addresses externally from Docker
[#6743](https://github.com/dotcloud/docker/issues/6743) so that we don't need
to statically allocate fixed-size IP ranges to each node, so that IP addresses
can be made stable across pod infra container restarts
([#2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate
pod migration. Right now, if the pod infra container dies, all the user
containers must be stopped and restarted because the netns of the pod infra
container will change on restart, and any subsequent user container restart
will join that new netns, thereby not being able to see its peers.
Additionally, a change in IP address would encounter DNS caching/TTL problems.
External IP assignment would also simplify DNS support (see below).
### IPv6
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)

View File

@ -1,222 +0,0 @@
---
title: "Persistent Storage"
---
This document proposes a model for managing persistent, cluster-scoped storage for applications requiring long lived data.
### tl;dr
Two new API kinds:
A `PersistentVolume` (PV) is a storage resource provisioned by an administrator. It is analogous to a node. See [Persistent Volume Guide](../user-guide/persistent-volumes/) for how to use it.
A `PersistentVolumeClaim` (PVC) is a user's request for a persistent volume to use in a pod. It is analogous to a pod.
One new system component:
`PersistentVolumeClaimBinder` is a singleton running in master that watches all PersistentVolumeClaims in the system and binds them to the closest matching available PersistentVolume. The volume manager watches the API for newly created volumes to manage.
One new volume:
`PersistentVolumeClaimVolumeSource` references the user's PVC in the same namespace. This volume finds the bound PV and mounts that volume for the pod. A `PersistentVolumeClaimVolumeSource` is, essentially, a wrapper around another type of volume that is owned by someone else (the system).
Kubernetes makes no guarantees at runtime that the underlying storage exists or is available. High availability is left to the storage provider.
### Goals
* Allow administrators to describe available storage
* Allow pod authors to discover and request persistent volumes to use with pods
* Enforce security through access control lists and securing storage to the same namespace as the pod volume
* Enforce quotas through admission control
* Enforce scheduler rules by resource counting
* Ensure developers can rely on storage being available without being closely bound to a particular disk, server, network, or storage device.
#### Describe available storage
Cluster administrators use the API to manage *PersistentVolumes*. A custom store `NewPersistentVolumeOrderedIndex` will index volumes by access modes and sort by storage capacity. The `PersistentVolumeClaimBinder` watches for new claims for storage and binds them to an available volume by matching the volume's characteristics (AccessModes and storage size) to the user's request.
PVs are system objects and, thus, have no namespace.
Many means of dynamic provisioning will be eventually be implemented for various storage types.
##### PersistentVolume API
| Action | HTTP Verb | Path | Description |
| ---- | ---- | ---- | ---- |
| CREATE | POST | /api/{version}/persistentvolumes/ | Create instance of PersistentVolume |
| GET | GET | /api/{version}persistentvolumes/{name} | Get instance of PersistentVolume with {name} |
| UPDATE | PUT | /api/{version}/persistentvolumes/{name} | Update instance of PersistentVolume with {name} |
| DELETE | DELETE | /api/{version}/persistentvolumes/{name} | Delete instance of PersistentVolume with {name} |
| LIST | GET | /api/{version}/persistentvolumes | List instances of PersistentVolume |
| WATCH | GET | /api/{version}/watch/persistentvolumes | Watch for changes to a PersistentVolume |
#### Request Storage
Kubernetes users request persistent storage for their pod by creating a ```PersistentVolumeClaim```. Their request for storage is described by their requirements for resources and mount capabilities.
Requests for volumes are bound to available volumes by the volume manager, if a suitable match is found. Requests for resources can go unfulfilled.
Users attach their claim to their pod using a new ```PersistentVolumeClaimVolumeSource``` volume source.
##### PersistentVolumeClaim API
| Action | HTTP Verb | Path | Description |
| ---- | ---- | ---- | ---- |
| CREATE | POST | /api/{version}/namespaces/{ns}/persistentvolumeclaims/ | Create instance of PersistentVolumeClaim in namespace {ns} |
| GET | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Get instance of PersistentVolumeClaim in namespace {ns} with {name} |
| UPDATE | PUT | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Update instance of PersistentVolumeClaim in namespace {ns} with {name} |
| DELETE | DELETE | /api/{version}/namespaces/{ns}/persistentvolumeclaims/{name} | Delete instance of PersistentVolumeClaim in namespace {ns} with {name} |
| LIST | GET | /api/{version}/namespaces/{ns}/persistentvolumeclaims | List instances of PersistentVolumeClaim in namespace {ns} |
| WATCH | GET | /api/{version}/watch/namespaces/{ns}/persistentvolumeclaims | Watch for changes to PersistentVolumeClaim in namespace {ns} |
#### Scheduling constraints
Scheduling constraints are to be handled similar to pod resource constraints. Pods will need to be annotated or decorated with the number of resources it requires on a node. Similarly, a node will need to list how many it has used or available.
TBD
#### Events
The implementation of persistent storage will not require events to communicate to the user the state of their claim. The CLI for bound claims contains a reference to the backing persistent volume. This is always present in the API and CLI, making an event to communicate the same unnecessary.
Events that communicate the state of a mounted volume are left to the volume plugins.
### Example
#### Admin provisions storage
An administrator provisions storage by posting PVs to the API. Various way to automate this task can be scripted. Dynamic provisioning is a future feature that can maintain levels of PVs.
{% highlight yaml %}
POST:
kind: PersistentVolume
apiVersion: v1
metadata:
name: pv0001
spec:
capacity:
storage: 10
persistentDisk:
pdName: "abc123"
fsType: "ext4"
{% endhighlight %}
{% highlight console %}
$ kubectl get pv
NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
pv0001 map[] 10737418240 RWO Pending
{% endhighlight %}
#### Users request storage
A user requests storage by posting a PVC to the API. Their request contains the AccessModes they wish their volume to have and the minimum size needed.
The user must be within a namespace to create PVCs.
{% highlight yaml %}
POST:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: myclaim-1
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3
{% endhighlight %}
{% highlight console %}
$ kubectl get pvc
NAME LABELS STATUS VOLUME
myclaim-1 map[] pending
{% endhighlight %}
#### Matching and binding
The ```PersistentVolumeClaimBinder``` attempts to find an available volume that most closely matches the user's request. If one exists, they are bound by putting a reference on the PV to the PVC. Requests can go unfulfilled if a suitable match is not found.
{% highlight console %}
$ kubectl get pv
NAME LABELS CAPACITY ACCESSMODES STATUS CLAIM REASON
pv0001 map[] 10737418240 RWO Bound myclaim-1 / f4b3d283-c0ef-11e4-8be4-80e6500a981e
kubectl get pvc
NAME LABELS STATUS VOLUME
myclaim-1 map[] Bound b16e91d6-c0ef-11e4-8be4-80e6500a981e
{% endhighlight %}
#### Claim usage
The claim holder can use their claim as a volume. The ```PersistentVolumeClaimVolumeSource``` knows to fetch the PV backing the claim and mount its volume for a pod.
The claim holder owns the claim and its data for as long as the claim exists. The pod using the claim can be deleted, but the claim remains in the user's namespace. It can be used again and again by many pods.
{% highlight yaml %}
POST:
kind: Pod
apiVersion: v1
metadata:
name: mypod
spec:
containers:
- image: nginx
name: myfrontend
volumeMounts:
- mountPath: "/var/www/html"
name: mypd
volumes:
- name: mypd
source:
persistentVolumeClaim:
accessMode: ReadWriteOnce
claimRef:
name: myclaim-1
{% endhighlight %}
#### Releasing a claim and Recycling a volume
When a claim holder is finished with their data, they can delete their claim.
{% highlight console %}
$ kubectl delete pvc myclaim-1
{% endhighlight %}
The ```PersistentVolumeClaimBinder``` will reconcile this by removing the claim reference from the PV and change the PVs status to 'Released'.
Admins can script the recycling of released volumes. Future dynamic provisioners will understand how a volume should be recycled.

View File

@ -1,59 +0,0 @@
---
title: "Design Principles"
---
Principles to follow when extending Kubernetes.
## API
See also the [API conventions](../devel/api-conventions).
* All APIs should be declarative.
* API objects should be complementary and composable, not opaque wrappers.
* The control plane should be transparent -- there are no hidden internal APIs.
* The cost of API operations should be proportional to the number of objects intentionally operated upon. Therefore, common filtered lookups must be indexed. Beware of patterns of multiple API calls that would incur quadratic behavior.
* Object status must be 100% reconstructable by observation. Any history kept must be just an optimization and not required for correct operation.
* Cluster-wide invariants are difficult to enforce correctly. Try not to add them. If you must have them, don't enforce them atomically in master components, that is contention-prone and doesn't provide a recovery path in the case of a bug allowing the invariant to be violated. Instead, provide a series of checks to reduce the probability of a violation, and make every component involved able to recover from an invariant violation.
* Low-level APIs should be designed for control by higher-level systems. Higher-level APIs should be intent-oriented (think SLOs) rather than implementation-oriented (think control knobs).
## Control logic
* Functionality must be *level-based*, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization.
* Assume an open world: continually verify assumptions and gracefully adapt to external events and/or actors. Example: we allow users to kill pods under control of a replication controller; it just replaces them.
* Do not define comprehensive state machines for objects with behaviors associated with state transitions and/or "assumed" states that cannot be ascertained by observation.
* Don't assume a component's decisions will not be overridden or rejected, nor for the component to always understand why. For example, etcd may reject writes. Kubelet may reject pods. The scheduler may not be able to schedule pods. Retry, but back off and/or make alternative decisions.
* Components should be self-healing. For example, if you must keep some state (e.g., cache) the content needs to be periodically refreshed, so that if an item does get erroneously stored or a deletion event is missed etc, it will be soon fixed, ideally on timescales that are shorter than what will attract attention from humans.
* Component behavior should degrade gracefully. Prioritize actions so that the most important activities can continue to function even when overloaded and/or in states of partial failure.
## Architecture
* Only the apiserver should communicate with etcd/store, and not other components (scheduler, kubelet, etc.).
* Compromising a single node shouldn't compromise the cluster.
* Components should continue to do what they were last told in the absence of new instructions (e.g., due to network partition or component outage).
* All components should keep all relevant state in memory all the time. The apiserver should write through to etcd/store, other components should write through to the apiserver, and they should watch for updates made by other clients.
* Watch is preferred over polling.
## Extensibility
TODO: pluggability
## Bootstrapping
* [Self-hosting](http://issue.k8s.io/246) of all components is a goal.
* Minimize the number of dependencies, particularly those required for steady-state operation.
* Stratify the dependencies that remain via principled layering.
* Break any circular dependencies by converting hard dependencies to soft dependencies.
* Also accept that data from other components from another source, such as local files, which can then be manually populated at bootstrap time and then continuously updated once those other components are available.
* State should be rediscoverable and/or reconstructable.
* Make it easy to run temporary, bootstrap instances of all components in order to create the runtime state needed to run the components in the steady state; use a lock (master election for distributed components, file lock for local components like Kubelet) to coordinate handoff. We call this technique "pivoting".
* Have a solution to restart dead components. For distributed components, replication works well. For local components such as Kubelet, a process manager or even a simple shell loop works.
## Availability
TODO
## General principles
* [Eric Raymond's 17 UNIX rules](https://en.wikipedia.org/wiki/Unix_philosophy#Eric_Raymond.E2.80.99s_17_Unix_Rules)

View File

@ -1,245 +0,0 @@
---
title: "The Kubernetes resource model"
---
**Note: this is a design doc, which describes features that have not been completely implemented.
User documentation of the current state is [here](../user-guide/compute-resources). The tracking issue for
implementation of this model is
[#168](http://issue.k8s.io/168). Currently, both limits and requests of memory and
cpu on containers (not pods) are supported. "memory" is in bytes and "cpu" is in
milli-cores.**
# The Kubernetes resource model
To do good pod placement, Kubernetes needs to know how big pods are, as well as the sizes of the nodes onto which they are being placed. The definition of "how big" is given by the Kubernetes resource model &mdash; the subject of this document.
The resource model aims to be:
* simple, for common cases;
* extensible, to accommodate future growth;
* regular, with few special cases; and
* precise, to avoid misunderstandings and promote pod portability.
## The resource model
A Kubernetes _resource_ is something that can be requested by, allocated to, or consumed by a pod or container. Examples include memory (RAM), CPU, disk-time, and network bandwidth.
Once resources on a node have been allocated to one pod, they should not be allocated to another until that pod is removed or exits. This means that Kubernetes schedulers should ensure that the sum of the resources allocated (requested and granted) to its pods never exceeds the usable capacity of the node. Testing whether a pod will fit on a node is called _feasibility checking_.
Note that the resource model currently prohibits over-committing resources; we will want to relax that restriction later.
### Resource types
All resources have a _type_ that is identified by their _typename_ (a string, e.g., "memory"). Several resource types are predefined by Kubernetes (a full list is below), although only two will be supported at first: CPU and memory. Users and system administrators can define their own resource types if they wish (e.g., Hadoop slots).
A fully-qualified resource typename is constructed from a DNS-style _subdomain_, followed by a slash `/`, followed by a name.
* The subdomain must conform to [RFC 1123](http://www.ietf.org/rfc/rfc1123.txt) (e.g., `kubernetes.io`, `example.com`).
* The name must be not more than 63 characters, consisting of upper- or lower-case alphanumeric characters, with the `-`, `_`, and `.` characters allowed anywhere except the first or last character.
* As a shorthand, any resource typename that does not start with a subdomain and a slash will automatically be prefixed with the built-in Kubernetes _namespace_, `kubernetes.io/` in order to fully-qualify it. This namespace is reserved for code in the open source Kubernetes repository; as a result, all user typenames MUST be fully qualified, and cannot be created in this namespace.
Some example typenames include `memory` (which will be fully-qualified as `kubernetes.io/memory`), and `example.com/Shiny_New-Resource.Type`.
For future reference, note that some resources, such as CPU and network bandwidth, are _compressible_, which means that their usage can potentially be throttled in a relatively benign manner. All other resources are _incompressible_, which means that any attempt to throttle them is likely to cause grief. This distinction will be important if a Kubernetes implementation supports over-committing of resources.
### Resource quantities
Initially, all Kubernetes resource types are _quantitative_, and have an associated _unit_ for quantities of the associated resource (e.g., bytes for memory, bytes per seconds for bandwidth, instances for software licences). The units will always be a resource type's natural base units (e.g., bytes, not MB), to avoid confusion between binary and decimal multipliers and the underlying unit multiplier (e.g., is memory measured in MiB, MB, or GB?).
Resource quantities can be added and subtracted: for example, a node has a fixed quantity of each resource type that can be allocated to pods/containers; once such an allocation has been made, the allocated resources cannot be made available to other pods/containers without over-committing the resources.
To make life easier for people, quantities can be represented externally as unadorned integers, or as fixed-point integers with one of these SI suffices (E, P, T, G, M, K, m) or their power-of-two equivalents (Ei, Pi, Ti, Gi, Mi, Ki). For example, the following represent roughly the same value: 128974848, "129e6", "129M" , "123Mi". Small quantities can be represented directly as decimals (e.g., 0.3), or using milli-units (e.g., "300m").
* "Externally" means in user interfaces, reports, graphs, and in JSON or YAML resource specifications that might be generated or read by people.
* Case is significant: "m" and "M" are not the same, so "k" is not a valid SI suffix. There are no power-of-two equivalents for SI suffixes that represent multipliers less than 1.
* These conventions only apply to resource quantities, not arbitrary values.
Internally (i.e., everywhere else), Kubernetes will represent resource quantities as integers so it can avoid problems with rounding errors, and will not use strings to represent numeric values. To achieve this, quantities that naturally have fractional parts (e.g., CPU seconds/second) will be scaled to integral numbers of milli-units (e.g., milli-CPUs) as soon as they are read in. Internal APIs, data structures, and protobufs will use these scaled integer units. Raw measurement data such as usage may still need to be tracked and calculated using floating point values, but internally they should be rescaled to avoid some values being in milli-units and some not.
* Note that reading in a resource quantity and writing it out again may change the way its values are represented, and truncate precision (e.g., 1.0001 may become 1.000), so comparison and difference operations (e.g., by an updater) must be done on the internal representations.
* Avoiding milli-units in external representations has advantages for people who will use Kubernetes, but runs the risk of developers forgetting to rescale or accidentally using floating-point representations. That seems like the right choice. We will try to reduce the risk by providing libraries that automatically do the quantization for JSON/YAML inputs.
### Resource specifications
Both users and a number of system components, such as schedulers, (horizontal) auto-scalers, (vertical) auto-sizers, load balancers, and worker-pool managers need to reason about resource requirements of workloads, resource capacities of nodes, and resource usage. Kubernetes divides specifications of *desired state*, aka the Spec, and representations of *current state*, aka the Status. Resource requirements and total node capacity fall into the specification category, while resource usage, characterizations derived from usage (e.g., maximum usage, histograms), and other resource demand signals (e.g., CPU load) clearly fall into the status category and are discussed in the Appendix for now.
Resource requirements for a container or pod should have the following form:
{% highlight yaml %}
resourceRequirementSpec: [
request: [ cpu: 2.5, memory: "40Mi" ],
limit: [ cpu: 4.0, memory: "99Mi" ],
]
{% endhighlight %}
Where:
* _request_ [optional]: the amount of resources being requested, or that were requested and have been allocated. Scheduler algorithms will use these quantities to test feasibility (whether a pod will fit onto a node). If a container (or pod) tries to use more resources than its _request_, any associated SLOs are voided &mdash; e.g., the program it is running may be throttled (compressible resource types), or the attempt may be denied. If _request_ is omitted for a container, it defaults to _limit_ if that is explicitly specified, otherwise to an implementation-defined value; this will always be 0 for a user-defined resource type. If _request_ is omitted for a pod, it defaults to the sum of the (explicit or implicit) _request_ values for the containers it encloses.
* _limit_ [optional]: an upper bound or cap on the maximum amount of resources that will be made available to a container or pod; if a container or pod uses more resources than its _limit_, it may be terminated. The _limit_ defaults to "unbounded"; in practice, this probably means the capacity of an enclosing container, pod, or node, but may result in non-deterministic behavior, especially for memory.
Total capacity for a node should have a similar structure:
{% highlight yaml %}
resourceCapacitySpec: [
total: [ cpu: 12, memory: "128Gi" ]
]
{% endhighlight %}
Where:
* _total_: the total allocatable resources of a node. Initially, the resources at a given scope will bound the resources of the sum of inner scopes.
#### Notes
* It is an error to specify the same resource type more than once in each list.
* It is an error for the _request_ or _limit_ values for a pod to be less than the sum of the (explicit or defaulted) values for the containers it encloses. (We may relax this later.)
* If multiple pods are running on the same node and attempting to use more resources than they have requested, the result is implementation-defined. For example: unallocated or unused resources might be spread equally across claimants, or the assignment might be weighted by the size of the original request, or as a function of limits, or priority, or the phase of the moon, perhaps modulated by the direction of the tide. Thus, although it's not mandatory to provide a _request_, it's probably a good idea. (Note that the _request_ could be filled in by an automated system that is observing actual usage and/or historical data.)
* Internally, the Kubernetes master can decide the defaulting behavior and the kubelet implementation may expected an absolute specification. For example, if the master decided that "the default is unbounded" it would pass 2^64 to the kubelet.
## Kubernetes-defined resource types
The following resource types are predefined ("reserved") by Kubernetes in the `kubernetes.io` namespace, and so cannot be used for user-defined resources. Note that the syntax of all resource types in the resource spec is deliberately similar, but some resource types (e.g., CPU) may receive significantly more support than simply tracking quantities in the schedulers and/or the Kubelet.
### Processor cycles
* Name: `cpu` (or `kubernetes.io/cpu`)
* Units: Kubernetes Compute Unit seconds/second (i.e., CPU cores normalized to a canonical "Kubernetes CPU")
* Internal representation: milli-KCUs
* Compressible? yes
* Qualities: this is a placeholder for the kind of thing that may be supported in the future &mdash; see [#147](http://issue.k8s.io/147)
* [future] `schedulingLatency`: as per lmctfy
* [future] `cpuConversionFactor`: property of a node: the speed of a CPU core on the node's processor divided by the speed of the canonical Kubernetes CPU (a floating point value; default = 1.0).
To reduce performance portability problems for pods, and to avoid worse-case provisioning behavior, the units of CPU will be normalized to a canonical "Kubernetes Compute Unit" (KCU, pronounced ˈkoÍ?okoÍžo), which will roughly be equivalent to a single CPU hyperthreaded core for some recent x86 processor. The normalization may be implementation-defined, although some reasonable defaults will be provided in the open-source Kubernetes code.
Note that requesting 2 KCU won't guarantee that precisely 2 physical cores will be allocated &mdash; control of aspects like this will be handled by resource _qualities_ (a future feature).
### Memory
* Name: `memory` (or `kubernetes.io/memory`)
* Units: bytes
* Compressible? no (at least initially)
The precise meaning of what "memory" means is implementation dependent, but the basic idea is to rely on the underlying `memcg` mechanisms, support, and definitions.
Note that most people will want to use power-of-two suffixes (Mi, Gi) for memory quantities
rather than decimal ones: "64MiB" rather than "64MB".
## Resource metadata
A resource type may have an associated read-only ResourceType structure, that contains metadata about the type. For example:
{% highlight yaml %}
resourceTypes: [
"kubernetes.io/memory": [
isCompressible: false, ...
]
"kubernetes.io/cpu": [
isCompressible: true,
internalScaleExponent: 3, ...
]
"kubernetes.io/disk-space": [ ... ]
]
{% endhighlight %}
Kubernetes will provide ResourceType metadata for its predefined types. If no resource metadata can be found for a resource type, Kubernetes will assume that it is a quantified, incompressible resource that is not specified in milli-units, and has no default value.
The defined properties are as follows:
| field name | type | contents |
| ---------- | ---- | -------- |
| name | string, required | the typename, as a fully-qualified string (e.g., `kubernetes.io/cpu`) |
| internalScaleExponent | int, default=0 | external values are multiplied by 10 to this power for internal storage (e.g., 3 for milli-units) |
| units | string, required | format: `unit* [per unit+]` (e.g., `second`, `byte per second`). An empty unit field means "dimensionless". |
| isCompressible | bool, default=false | true if the resource type is compressible |
| defaultRequest | string, default=none | in the same format as a user-supplied value |
| _[future]_ quantization | number, default=1 | smallest granularity of allocation: requests may be rounded up to a multiple of this unit; implementation-defined unit (e.g., the page size for RAM). |
# Appendix: future extensions
The following are planned future extensions to the resource model, included here to encourage comments.
## Usage data
Because resource usage and related metrics change continuously, need to be tracked over time (i.e., historically), can be characterized in a variety of ways, and are fairly voluminous, we will not include usage in core API objects, such as [Pods](../user-guide/pods) and Nodes, but will provide separate APIs for accessing and managing that data. See the Appendix for possible representations of usage data, but the representation we'll use is TBD.
Singleton values for observed and predicted future usage will rapidly prove inadequate, so we will support the following structure for extended usage information:
{% highlight yaml %}
resourceStatus: [
usage: [ cpu: <CPU-info>, memory: <memory-info> ],
maxusage: [ cpu: <CPU-info>, memory: <memory-info> ],
predicted: [ cpu: <CPU-info>, memory: <memory-info> ],
]
{% endhighlight %}
where a `<CPU-info>` or `<memory-info>` structure looks like this:
{% highlight yaml %}
{
mean: <value> # arithmetic mean
max: <value> # minimum value
min: <value> # maximum value
count: <value> # number of data points
percentiles: [ # map from %iles to values
"10": <10th-percentile-value>,
"50": <median-value>,
"99": <99th-percentile-value>,
"99.9": <99.9th-percentile-value>,
...
]
}
{% endhighlight %}
All parts of this structure are optional, although we strongly encourage including quantities for 50, 90, 95, 99, 99.5, and 99.9 percentiles. _[In practice, it will be important to include additional info such as the length of the time window over which the averages are calculated, the confidence level, and information-quality metrics such as the number of dropped or discarded data points.]_
and predicted
## Future resource types
### _[future] Network bandwidth_
* Name: "network-bandwidth" (or `kubernetes.io/network-bandwidth`)
* Units: bytes per second
* Compressible? yes
### _[future] Network operations_
* Name: "network-iops" (or `kubernetes.io/network-iops`)
* Units: operations (messages) per second
* Compressible? yes
### _[future] Storage space_
* Name: "storage-space" (or `kubernetes.io/storage-space`)
* Units: bytes
* Compressible? no
The amount of secondary storage space available to a container. The main target is local disk drives and SSDs, although this could also be used to qualify remotely-mounted volumes. Specifying whether a resource is a raw disk, an SSD, a disk array, or a file system fronting any of these, is left for future work.
### _[future] Storage time_
* Name: storage-time (or `kubernetes.io/storage-time`)
* Units: seconds per second of disk time
* Internal representation: milli-units
* Compressible? yes
This is the amount of time a container spends accessing disk, including actuator and transfer time. A standard disk drive provides 1.0 diskTime seconds per second.
### _[future] Storage operations_
* Name: "storage-iops" (or `kubernetes.io/storage-iops`)
* Units: operations per second
* Compressible? yes

View File

@ -1,593 +0,0 @@
---
title: "Abstract"
---
A proposal for the distribution of [secrets](../user-guide/secrets) (passwords, keys, etc) to the Kubelet and to
containers inside Kubernetes using a custom [volume](../user-guide/volumes.html#secrets) type. See the [secrets example](../user-guide/secrets/) for more information.
## Motivation
Secrets are needed in containers to access internal resources like the Kubernetes master or
external resources such as git repositories, databases, etc. Users may also want behaviors in the
kubelet that depend on secret data (credentials for image pull from a docker registry) associated
with pods.
Goals of this design:
1. Describe a secret resource
2. Define the various challenges attendant to managing secrets on the node
3. Define a mechanism for consuming secrets in containers without modification
## Constraints and Assumptions
* This design does not prescribe a method for storing secrets; storage of secrets should be
pluggable to accommodate different use-cases
* Encryption of secret data and node security are orthogonal concerns
* It is assumed that node and master are secure and that compromising their security could also
compromise secrets:
* If a node is compromised, the only secrets that could potentially be exposed should be the
secrets belonging to containers scheduled onto it
* If the master is compromised, all secrets in the cluster may be exposed
* Secret rotation is an orthogonal concern, but it should be facilitated by this proposal
* A user who can consume a secret in a container can know the value of the secret; secrets must
be provisioned judiciously
## Use Cases
1. As a user, I want to store secret artifacts for my applications and consume them securely in
containers, so that I can keep the configuration for my applications separate from the images
that use them:
1. As a cluster operator, I want to allow a pod to access the Kubernetes master using a custom
`.kubeconfig` file, so that I can securely reach the master
2. As a cluster operator, I want to allow a pod to access a Docker registry using credentials
from a `.dockercfg` file, so that containers can push images
3. As a cluster operator, I want to allow a pod to access a git repository using SSH keys,
so that I can push to and fetch from the repository
2. As a user, I want to allow containers to consume supplemental information about services such
as username and password which should be kept secret, so that I can share secrets about a
service amongst the containers in my application securely
3. As a user, I want to associate a pod with a `ServiceAccount` that consumes a secret and have
the kubelet implement some reserved behaviors based on the types of secrets the service account
consumes:
1. Use credentials for a docker registry to pull the pod's docker image
2. Present Kubernetes auth token to the pod or transparently decorate traffic between the pod
and master service
4. As a user, I want to be able to indicate that a secret expires and for that secret's value to
be rotated once it expires, so that the system can help me follow good practices
### Use-Case: Configuration artifacts
Many configuration files contain secrets intermixed with other configuration information. For
example, a user's application may contain a properties file than contains database credentials,
SaaS API tokens, etc. Users should be able to consume configuration artifacts in their containers
and be able to control the path on the container's filesystems where the artifact will be
presented.
### Use-Case: Metadata about services
Most pieces of information about how to use a service are secrets. For example, a service that
provides a MySQL database needs to provide the username, password, and database name to consumers
so that they can authenticate and use the correct database. Containers in pods consuming the MySQL
service would also consume the secrets associated with the MySQL service.
### Use-Case: Secrets associated with service accounts
[Service Accounts](service_accounts) are proposed as a
mechanism to decouple capabilities and security contexts from individual human users. A
`ServiceAccount` contains references to some number of secrets. A `Pod` can specify that it is
associated with a `ServiceAccount`. Secrets should have a `Type` field to allow the Kubelet and
other system components to take action based on the secret's type.
#### Example: service account consumes auth token secret
As an example, the service account proposal discusses service accounts consuming secrets which
contain Kubernetes auth tokens. When a Kubelet starts a pod associated with a service account
which consumes this type of secret, the Kubelet may take a number of actions:
1. Expose the secret in a `.kubernetes_auth` file in a well-known location in the container's
file system
2. Configure that node's `kube-proxy` to decorate HTTP requests from that pod to the
`kubernetes-master` service with the auth token, e. g. by adding a header to the request
(see the [LOAS Daemon](http://issue.k8s.io/2209) proposal)
#### Example: service account consumes docker registry credentials
Another example use case is where a pod is associated with a secret containing docker registry
credentials. The Kubelet could use these credentials for the docker pull to retrieve the image.
### Use-Case: Secret expiry and rotation
Rotation is considered a good practice for many types of secret data. It should be possible to
express that a secret has an expiry date; this would make it possible to implement a system
component that could regenerate expired secrets. As an example, consider a component that rotates
expired secrets. The rotator could periodically regenerate the values for expired secrets of
common types and update their expiry dates.
## Deferral: Consuming secrets as environment variables
Some images will expect to receive configuration items as environment variables instead of files.
We should consider what the best way to allow this is; there are a few different options:
1. Force the user to adapt files into environment variables. Users can store secrets that need to
be presented as environment variables in a format that is easy to consume from a shell:
$ cat /etc/secrets/my-secret.txt
export MY_SECRET_ENV=MY_SECRET_VALUE
The user could `source` the file at `/etc/secrets/my-secret` prior to executing the command for
the image either inline in the command or in an init script,
2. Give secrets an attribute that allows users to express the intent that the platform should
generate the above syntax in the file used to present a secret. The user could consume these
files in the same manner as the above option.
3. Give secrets attributes that allow the user to express that the secret should be presented to
the container as an environment variable. The container's environment would contain the
desired values and the software in the container could use them without accommodation the
command or setup script.
For our initial work, we will treat all secrets as files to narrow the problem space. There will
be a future proposal that handles exposing secrets as environment variables.
## Flow analysis of secret data with respect to the API server
There are two fundamentally different use-cases for access to secrets:
1. CRUD operations on secrets by their owners
2. Read-only access to the secrets needed for a particular node by the kubelet
### Use-Case: CRUD operations by owners
In use cases for CRUD operations, the user experience for secrets should be no different than for
other API resources.
#### Data store backing the REST API
The data store backing the REST API should be pluggable because different cluster operators will
have different preferences for the central store of secret data. Some possibilities for storage:
1. An etcd collection alongside the storage for other API resources
2. A collocated [HSM](http://en.wikipedia.org/wiki/Hardware_security_module)
3. A secrets server like [Vault](https://www.vaultproject.io/) or [Keywhiz](https://square.github.io/keywhiz/)
4. An external datastore such as an external etcd, RDBMS, etc.
#### Size limit for secrets
There should be a size limit for secrets in order to:
1. Prevent DOS attacks against the API server
2. Allow kubelet implementations that prevent secret data from touching the node's filesystem
The size limit should satisfy the following conditions:
1. Large enough to store common artifact types (encryption keypairs, certificates, small
configuration files)
2. Small enough to avoid large impact on node resource consumption (storage, RAM for tmpfs, etc)
To begin discussion, we propose an initial value for this size limit of **1MB**.
#### Other limitations on secrets
Defining a policy for limitations on how a secret may be referenced by another API resource and how
constraints should be applied throughout the cluster is tricky due to the number of variables
involved:
1. Should there be a maximum number of secrets a pod can reference via a volume?
2. Should there be a maximum number of secrets a service account can reference?
3. Should there be a total maximum number of secrets a pod can reference via its own spec and its
associated service account?
4. Should there be a total size limit on the amount of secret data consumed by a pod?
5. How will cluster operators want to be able to configure these limits?
6. How will these limits impact API server validations?
7. How will these limits affect scheduling?
For now, we will not implement validations around these limits. Cluster operators will decide how
much node storage is allocated to secrets. It will be the operator's responsibility to ensure that
the allocated storage is sufficient for the workload scheduled onto a node.
For now, kubelets will only attach secrets to api-sourced pods, and not file- or http-sourced
ones. Doing so would:
- confuse the secrets admission controller in the case of mirror pods.
- create an apiserver-liveness dependency -- avoiding this dependency is a main reason to use non-api-source pods.
### Use-Case: Kubelet read of secrets for node
The use-case where the kubelet reads secrets has several additional requirements:
1. Kubelets should only be able to receive secret data which is required by pods scheduled onto
the kubelet's node
2. Kubelets should have read-only access to secret data
3. Secret data should not be transmitted over the wire insecurely
4. Kubelets must ensure pods do not have access to each other's secrets
#### Read of secret data by the Kubelet
The Kubelet should only be allowed to read secrets which are consumed by pods scheduled onto that
Kubelet's node and their associated service accounts. Authorization of the Kubelet to read this
data would be delegated to an authorization plugin and associated policy rule.
#### Secret data on the node: data at rest
Consideration must be given to whether secret data should be allowed to be at rest on the node:
1. If secret data is not allowed to be at rest, the size of secret data becomes another draw on
the node's RAM - should it affect scheduling?
2. If secret data is allowed to be at rest, should it be encrypted?
1. If so, how should be this be done?
2. If not, what threats exist? What types of secret are appropriate to store this way?
For the sake of limiting complexity, we propose that initially secret data should not be allowed
to be at rest on a node; secret data should be stored on a node-level tmpfs filesystem. This
filesystem can be subdivided into directories for use by the kubelet and by the volume plugin.
#### Secret data on the node: resource consumption
The Kubelet will be responsible for creating the per-node tmpfs file system for secret storage.
It is hard to make a prescriptive declaration about how much storage is appropriate to reserve for
secrets because different installations will vary widely in available resources, desired pod to
node density, overcommit policy, and other operation dimensions. That being the case, we propose
for simplicity that the amount of secret storage be controlled by a new parameter to the kubelet
with a default value of **64MB**. It is the cluster operator's responsibility to handle choosing
the right storage size for their installation and configuring their Kubelets correctly.
Configuring each Kubelet is not the ideal story for operator experience; it is more intuitive that
the cluster-wide storage size be readable from a central configuration store like the one proposed
in [#1553](http://issue.k8s.io/1553). When such a store
exists, the Kubelet could be modified to read this configuration item from the store.
When the Kubelet is modified to advertise node resources (as proposed in
[#4441](http://issue.k8s.io/4441)), the capacity calculation
for available memory should factor in the potential size of the node-level tmpfs in order to avoid
memory overcommit on the node.
#### Secret data on the node: isolation
Every pod will have a [security context](security_context).
Secret data on the node should be isolated according to the security context of the container. The
Kubelet volume plugin API will be changed so that a volume plugin receives the security context of
a volume along with the volume spec. This will allow volume plugins to implement setting the
security context of volumes they manage.
## Community work
Several proposals / upstream patches are notable as background for this proposal:
1. [Docker vault proposal](https://github.com/docker/docker/issues/10310)
2. [Specification for image/container standardization based on volumes](https://github.com/docker/docker/issues/9277)
3. [Kubernetes service account proposal](service_accounts)
4. [Secrets proposal for docker (1)](https://github.com/docker/docker/pull/6075)
5. [Secrets proposal for docker (2)](https://github.com/docker/docker/pull/6697)
## Proposed Design
We propose a new `Secret` resource which is mounted into containers with a new volume type. Secret
volumes will be handled by a volume plugin that does the actual work of fetching the secret and
storing it. Secrets contain multiple pieces of data that are presented as different files within
the secret volume (example: SSH key pair).
In order to remove the burden from the end user in specifying every file that a secret consists of,
it should be possible to mount all files provided by a secret with a single `VolumeMount` entry
in the container specification.
### Secret API Resource
A new resource for secrets will be added to the API:
{% highlight go %}
type Secret struct {
TypeMeta
ObjectMeta
// Data contains the secret data. Each key must be a valid DNS_SUBDOMAIN.
// The serialized form of the secret data is a base64 encoded string,
// representing the arbitrary (possibly non-string) data value here.
Data map[string][]byte `json:"data,omitempty"`
// Used to facilitate programmatic handling of secret data.
Type SecretType `json:"type,omitempty"`
}
type SecretType string
const (
SecretTypeOpaque SecretType = "Opaque" // Opaque (arbitrary data; default)
SecretTypeServiceAccountToken SecretType = "kubernetes.io/service-account-token" // Kubernetes auth token
SecretTypeDockercfg SecretType = "kubernetes.io/dockercfg" // Docker registry auth
// FUTURE: other type values
)
const MaxSecretSize = 1 * 1024 * 1024
{% endhighlight %}
A Secret can declare a type in order to provide type information to system components that work
with secrets. The default type is `opaque`, which represents arbitrary user-owned data.
Secrets are validated against `MaxSecretSize`. The keys in the `Data` field must be valid DNS
subdomains.
A new REST API and registry interface will be added to accompany the `Secret` resource. The
default implementation of the registry will store `Secret` information in etcd. Future registry
implementations could store the `TypeMeta` and `ObjectMeta` fields in etcd and store the secret
data in another data store entirely, or store the whole object in another data store.
#### Other validations related to secrets
Initially there will be no validations for the number of secrets a pod references, or the number of
secrets that can be associated with a service account. These may be added in the future as the
finer points of secrets and resource allocation are fleshed out.
### Secret Volume Source
A new `SecretSource` type of volume source will be added to the `VolumeSource` struct in the
API:
{% highlight go %}
type VolumeSource struct {
// Other fields omitted
// SecretSource represents a secret that should be presented in a volume
SecretSource *SecretSource `json:"secret"`
}
type SecretSource struct {
Target ObjectReference
}
{% endhighlight %}
Secret volume sources are validated to ensure that the specified object reference actually points
to an object of type `Secret`.
In the future, the `SecretSource` will be extended to allow:
1. Fine-grained control over which pieces of secret data are exposed in the volume
2. The paths and filenames for how secret data are exposed
### Secret Volume Plugin
A new Kubelet volume plugin will be added to handle volumes with a secret source. This plugin will
require access to the API server to retrieve secret data and therefore the volume `Host` interface
will have to change to expose a client interface:
{% highlight go %}
type Host interface {
// Other methods omitted
// GetKubeClient returns a client interface
GetKubeClient() client.Interface
}
{% endhighlight %}
The secret volume plugin will be responsible for:
1. Returning a `volume.Builder` implementation from `NewBuilder` that:
1. Retrieves the secret data for the volume from the API server
2. Places the secret data onto the container's filesystem
3. Sets the correct security attributes for the volume based on the pod's `SecurityContext`
2. Returning a `volume.Cleaner` implementation from `NewClear` that cleans the volume from the
container's filesystem
### Kubelet: Node-level secret storage
The Kubelet must be modified to accept a new parameter for the secret storage size and to create
a tmpfs file system of that size to store secret data. Rough accounting of specific changes:
1. The Kubelet should have a new field added called `secretStorageSize`; units are megabytes
2. `NewMainKubelet` should accept a value for secret storage size
3. The Kubelet server should have a new flag added for secret storage size
4. The Kubelet's `setupDataDirs` method should be changed to create the secret storage
### Kubelet: New behaviors for secrets associated with service accounts
For use-cases where the Kubelet's behavior is affected by the secrets associated with a pod's
`ServiceAccount`, the Kubelet will need to be changed. For example, if secrets of type
`docker-reg-auth` affect how the pod's images are pulled, the Kubelet will need to be changed
to accommodate this. Subsequent proposals can address this on a type-by-type basis.
## Examples
For clarity, let's examine some detailed examples of some common use-cases in terms of the
suggested changes. All of these examples are assumed to be created in a namespace called
`example`.
### Use-Case: Pod with ssh keys
To create a pod that uses an ssh key stored as a secret, we first need to create a secret:
{% highlight json %}
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "ssh-key-secret"
},
"data": {
"id-rsa": "dmFsdWUtMg0KDQo=",
"id-rsa.pub": "dmFsdWUtMQ0K"
}
}
{% endhighlight %}
**Note:** The serialized JSON and YAML values of secret data are encoded as
base64 strings. Newlines are not valid within these strings and must be
omitted.
Now we can create a pod which references the secret with the ssh key and consumes it in a volume:
{% highlight json %}
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "secret-test-pod",
"labels": {
"name": "secret-test"
}
},
"spec": {
"volumes": [
{
"name": "secret-volume",
"secret": {
"secretName": "ssh-key-secret"
}
}
],
"containers": [
{
"name": "ssh-test-container",
"image": "mySshImage",
"volumeMounts": [
{
"name": "secret-volume",
"readOnly": true,
"mountPath": "/etc/secret-volume"
}
]
}
]
}
}
{% endhighlight %}
When the container's command runs, the pieces of the key will be available in:
/etc/secret-volume/id-rsa.pub
/etc/secret-volume/id-rsa
The container is then free to use the secret data to establish an ssh connection.
### Use-Case: Pods with pod / test credentials
This example illustrates a pod which consumes a secret containing prod
credentials and another pod which consumes a secret with test environment
credentials.
The secrets:
{% highlight json %}
{
"apiVersion": "v1",
"kind": "List",
"items":
[{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "prod-db-secret"
},
"data": {
"password": "dmFsdWUtMg0KDQo=",
"username": "dmFsdWUtMQ0K"
}
},
{
"kind": "Secret",
"apiVersion": "v1",
"metadata": {
"name": "test-db-secret"
},
"data": {
"password": "dmFsdWUtMg0KDQo=",
"username": "dmFsdWUtMQ0K"
}
}]
}
{% endhighlight %}
The pods:
{% highlight json %}
{
"apiVersion": "v1",
"kind": "List",
"items":
[{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "prod-db-client-pod",
"labels": {
"name": "prod-db-client"
}
},
"spec": {
"volumes": [
{
"name": "secret-volume",
"secret": {
"secretName": "prod-db-secret"
}
}
],
"containers": [
{
"name": "db-client-container",
"image": "myClientImage",
"volumeMounts": [
{
"name": "secret-volume",
"readOnly": true,
"mountPath": "/etc/secret-volume"
}
]
}
]
}
},
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "test-db-client-pod",
"labels": {
"name": "test-db-client"
}
},
"spec": {
"volumes": [
{
"name": "secret-volume",
"secret": {
"secretName": "test-db-secret"
}
}
],
"containers": [
{
"name": "db-client-container",
"image": "myClientImage",
"volumeMounts": [
{
"name": "secret-volume",
"readOnly": true,
"mountPath": "/etc/secret-volume"
}
]
}
]
}
}]
}
{% endhighlight %}
The specs for the two pods differ only in the value of the object referred to by the secret volume
source. Both containers will have the following files present on their filesystems:
/etc/secret-volume/username
/etc/secret-volume/password

View File

@ -1,121 +0,0 @@
---
title: "Security in Kubernetes"
---
Kubernetes should define a reasonable set of security best practices that allows processes to be isolated from each other, from the cluster infrastructure, and which preserves important boundaries between those who manage the cluster, and those who use the cluster.
While Kubernetes today is not primarily a multi-tenant system, the long term evolution of Kubernetes will increasingly rely on proper boundaries between users and administrators. The code running on the cluster must be appropriately isolated and secured to prevent malicious parties from affecting the entire cluster.
## High Level Goals
1. Ensure a clear isolation between the container and the underlying host it runs on
2. Limit the ability of the container to negatively impact the infrastructure or other containers
3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components
4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components
5. Allow users of the system to be cleanly separated from administrators
6. Allow administrative functions to be delegated to users where necessary
7. Allow applications to be run on the cluster that have "secret" data (keys, certs, passwords) which is properly abstracted from "public" data.
## Use cases
### Roles
We define "user" as a unique identity accessing the Kubernetes API server, which may be a human or an automated process. Human users fall into the following categories:
1. k8s admin - administers a Kubernetes cluster and has access to the underlying components of the system
2. k8s project administrator - administrates the security of a small subset of the cluster
3. k8s developer - launches pods on a Kubernetes cluster and consumes cluster resources
Automated process users fall into the following categories:
1. k8s container user - a user that processes running inside a container (on the cluster) can use to access other cluster resources independent of the human users attached to a project
2. k8s infrastructure user - the user that Kubernetes infrastructure components use to perform cluster functions with clearly defined roles
### Description of roles
* Developers:
* write pod specs.
* making some of their own images, and using some "community" docker images
* know which pods need to talk to which other pods
* decide which pods should share files with other pods, and which should not.
* reason about application level security, such as containing the effects of a local-file-read exploit in a webserver pod.
* do not often reason about operating system or organizational security.
* are not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
* Project Admins:
* allocate identity and roles within a namespace
* reason about organizational security within a namespace
* don't give a developer permissions that are not needed for role.
* protect files on shared storage from unnecessary cross-team access
* are less focused about application security
* Administrators:
* are less focused on application security. Focused on operating system security.
* protect the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers.
* comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
* decides who can use which Linux Capabilities, run privileged containers, use hostPath, etc.
* e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not.
## Proposed Design
A pod runs in a *security context* under a *service account* that is defined by an administrator or project administrator, and the *secrets* a pod has access to is limited by that *service account*.
1. The API should authenticate and authorize user actions [authn and authz](access)
2. All infrastructure components (kubelets, kube-proxies, controllers, scheduler) should have an infrastructure user that they can authenticate with and be authorized to perform only the functions they require against the API.
3. Most infrastructure components should use the API as a way of exchanging data and changing the system, and only the API should have access to the underlying data store (etcd)
4. When containers run on the cluster and need to talk to other containers or the API server, they should be identified and authorized clearly as an autonomous process via a [service account](service_accounts)
1. If the user who started a long-lived process is removed from access to the cluster, the process should be able to continue without interruption
2. If the user who started processes are removed from the cluster, administrators may wish to terminate their processes in bulk
3. When containers run with a service account, the user that created / triggered the service account behavior must be associated with the container's action
5. When container processes run on the cluster, they should run in a [security context](security_context) that isolates those processes via Linux user security, user namespaces, and permissions.
1. Administrators should be able to configure the cluster to automatically confine all container processes as a non-root, randomly assigned UID
2. Administrators should be able to ensure that container processes within the same namespace are all assigned the same unix user UID
3. Administrators should be able to limit which developers and project administrators have access to higher privilege actions
4. Project administrators should be able to run pods within a namespace under different security contexts, and developers must be able to specify which of the available security contexts they may use
5. Developers should be able to run their own images or images from the community and expect those images to run correctly
6. Developers may need to ensure their images work within higher security requirements specified by administrators
7. When available, Linux kernel user namespaces can be used to ensure 5.2 and 5.4 are met.
8. When application developers want to share filesystem data via distributed filesystems, the Unix user ids on those filesystems must be consistent across different container processes
6. Developers should be able to define [secrets](secrets) that are automatically added to the containers when pods are run
1. Secrets are files injected into the container whose values should not be displayed within a pod. Examples:
1. An SSH private key for git cloning remote data
2. A client certificate for accessing a remote system
3. A private key and certificate for a web server
4. A .kubeconfig file with embedded cert / token data for accessing the Kubernetes master
5. A .dockercfg file for pulling images from a protected registry
2. Developers should be able to define the pod spec so that a secret lands in a specific location
3. Project administrators should be able to limit developers within a namespace from viewing or modifying secrets (anyone who can launch an arbitrary pod can view secrets)
4. Secrets are generally not copied from one namespace to another when a developer's application definitions are copied
### Related design discussion
* [Authorization and authentication](access)
* [Secret distribution via files](http://pr.k8s.io/2030)
* [Docker secrets](https://github.com/docker/docker/pull/6697)
* [Docker vault](https://github.com/docker/docker/issues/10310)
* [Service Accounts:](service_accounts)
* [Secret volumes](http://pr.k8s.io/4126)
## Specific Design Points
### TODO: authorization, authentication
### Isolate the data store from the nodes and supporting infrastructure
Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer.
As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply appropriate authorization and authentication of change requests. In the future, etcd may offer granular access control, but that granularity will require an administrator to understand the schema of the data to properly apply security. An administrator must be able to properly secure Kubernetes at a policy level, rather than at an implementation level, and schema changes over time should not risk unintended security leaks.
Both the Kubelet and Kube Proxy need information related to their specific roles - for the Kubelet, the set of pods it should be running, and for the Proxy, the set of services and endpoints to load balance. The Kubelet also needs to provide information about running pods and historical termination data. The access pattern for both Kubelet and Proxy to load their configuration is an efficient "wait for changes" request over HTTP. It should be possible to limit the Kubelet and Proxy to only access the information they need to perform their roles and no more.
The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes.
The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a node in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time).

View File

@ -1,170 +0,0 @@
---
title: "Security Contexts"
---
## Abstract
A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security)):
1. Ensure a clear isolation between container and the underlying host it runs on
2. Limit the ability of the container to negatively impact the infrastructure or other containers
## Background
The problem of securing containers in Kubernetes has come up [before](http://issue.k8s.io/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface.
## Motivation
### Container isolation
In order to improve container isolation from host and other containers running on the host, containers should only be
granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
to the container process.
Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers.
### External integration with shared storage
In order to support external integration with shared storage, processes running in a Kubernetes cluster
should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established.
Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks.
## Constraints and Assumptions
* It is out of the scope of this document to prescribe a specific set
of constraints to isolate containers from their host. Different use cases need different
settings.
* The concept of a security context should not be tied to a particular security mechanism or platform
(ie. SELinux, AppArmor)
* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for
[service accounts](service_accounts).
## Use Cases
In order of increasing complexity, following are example use cases that would
be addressed with security contexts:
1. Kubernetes is used to run a single cloud application. In order to protect
nodes from containers:
* All containers run as a single non-root user
* Privileged containers are disabled
* All containers run with a particular MCS label
* Kernel capabilities like CHOWN and MKNOD are removed from containers
2. Just like case #1, except that I have more than one application running on
the Kubernetes cluster.
* Each application is run in its own namespace to avoid name collisions
* For each application a different uid and MCS label is used
3. Kubernetes is used as the base for a PAAS with
multiple projects, each project represented by a namespace.
* Each namespace is associated with a range of uids/gids on the node that
are mapped to uids/gids on containers using linux user namespaces.
* Certain pods in each namespace have special privileges to perform system
actions such as talking back to the server for deployment, run docker
builds, etc.
* External NFS storage is assigned to each namespace and permissions set
using the range of uids/gids assigned to that namespace.
## Proposed Design
### Overview
A *security context* consists of a set of constraints that determine how a container
is secured before getting created and run. A security context resides on the container and represents the runtime parameters that will
be used to create and run the container via container APIs. A *security context provider* is passed to the Kubelet so it can have a chance
to mutate Docker API calls in order to apply the security context.
It is recommended that this design be implemented in two phases:
1. Implement the security context provider extension point in the Kubelet
so that a default security context can be applied on container run and creation.
2. Implement a security context structure that is part of a service account. The
default context provider can then be used to apply a security context based
on the service account associated with the pod.
### Security Context Provider
The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container:
{% highlight go %}
type SecurityContextProvider interface {
// ModifyContainerConfig is called before the Docker createContainer call.
// The security context provider can make changes to the Config with which
// the container is created.
// An error is returned if it's not possible to secure the container as
// requested with a security context.
ModifyContainerConfig(pod *api.Pod, container *api.Container, config *docker.Config)
// ModifyHostConfig is called before the Docker runContainer call.
// The security context provider can make changes to the HostConfig, affecting
// security options, whether the container is privileged, volume binds, etc.
// An error is returned if it's not possible to secure the container as requested
// with a security context.
ModifyHostConfig(pod *api.Pod, container *api.Container, hostConfig *docker.HostConfig)
}
{% endhighlight %}
If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today.
### Security Context
A security context resides on the container and represents the runtime parameters that will
be used to create and run the container via container APIs. Following is an example of an initial implementation:
{% highlight go %}
type Container struct {
... other fields omitted ...
// Optional: SecurityContext defines the security options the pod should be run with
SecurityContext *SecurityContext
}
// SecurityContext holds security configuration that will be applied to a container. SecurityContext
// contains duplication of some existing fields from the Container resource. These duplicate fields
// will be populated based on the Container configuration if they are not set. Defining them on
// both the Container AND the SecurityContext will result in an error.
type SecurityContext struct {
// Capabilities are the capabilities to add/drop when running the container
Capabilities *Capabilities
// Run the container in privileged mode
Privileged *bool
// SELinuxOptions are the labels to be applied to the container
// and volumes
SELinuxOptions *SELinuxOptions
// RunAsUser is the UID to run the entrypoint of the container process.
RunAsUser *int64
}
// SELinuxOptions are the labels to be applied to the container.
type SELinuxOptions struct {
// SELinux user label
User string
// SELinux role label
Role string
// SELinux type label
Type string
// SELinux level label.
Level string
}
{% endhighlight %}
### Admission
It is up to an admission plugin to determine if the security context is acceptable or not. At the
time of writing, the admission control plugin for security contexts will only allow a context that
has defined capabilities or privileged. Contexts that attempt to define a UID or SELinux options
will be denied by default. In the future the admission plugin will base this decision upon
configurable policies that reside within the [service account](http://pr.k8s.io/2297).

View File

@ -1,173 +0,0 @@
---
title: "Service Accounts"
---
## Motivation
Processes in Pods may need to call the Kubernetes API. For example:
- scheduler
- replication controller
- node controller
- a map-reduce type framework which has a controller that then tries to make a dynamically determined number of workers and watch them
- continuous build and push system
- monitoring system
They also may interact with services other than the Kubernetes API, such as:
- an image repository, such as docker -- both when the images are pulled to start the containers, and for writing
images in the case of pods that generate images.
- accessing other cloud services, such as blob storage, in the context of a large, integrated, cloud offering (hosted
or private).
- accessing files in an NFS volume attached to the pod
## Design Overview
A service account binds together several things:
- a *name*, understood by users, and perhaps by peripheral systems, for an identity
- a *principal* that can be authenticated and [authorized](../admin/authorization)
- a [security context](security_context), which defines the Linux Capabilities, User IDs, Groups IDs, and other
capabilities and controls on interaction with the file system and OS.
- a set of [secrets](secrets), which a container may use to
access various networked resources.
## Design Discussion
A new object Kind is added:
{% highlight go %}
type ServiceAccount struct {
TypeMeta `json:",inline" yaml:",inline"`
ObjectMeta `json:"metadata,omitempty" yaml:"metadata,omitempty"`
username string
securityContext ObjectReference // (reference to a securityContext object)
secrets []ObjectReference // (references to secret objects
}
{% endhighlight %}
The name ServiceAccount is chosen because it is widely used already (e.g. by Kerberos and LDAP)
to refer to this type of account. Note that it has no relation to Kubernetes Service objects.
The ServiceAccount object does not include any information that could not be defined separately:
- username can be defined however users are defined.
- securityContext and secrets are only referenced and are created using the REST API.
The purpose of the serviceAccount object is twofold:
- to bind usernames to securityContexts and secrets, so that the username can be used to refer succinctly
in contexts where explicitly naming securityContexts and secrets would be inconvenient
- to provide an interface to simplify allocation of new securityContexts and secrets.
These features are explained later.
### Names
From the standpoint of the Kubernetes API, a `user` is any principal which can authenticate to Kubernetes API.
This includes a human running `kubectl` on her desktop and a container in a Pod on a Node making API calls.
There is already a notion of a username in Kubernetes, which is populated into a request context after authentication.
However, there is no API object representing a user. While this may evolve, it is expected that in mature installations,
the canonical storage of user identifiers will be handled by a system external to Kubernetes.
Kubernetes does not dictate how to divide up the space of user identifier strings. User names can be
simple Unix-style short usernames, (e.g. `alice`), or may be qualified to allow for federated identity (
`alice@example.com` vs `alice@example.org`.) Naming convention may distinguish service accounts from user
accounts (e.g. `alice@example.com` vs `build-service-account-a3b7f0@foo-namespace.service-accounts.example.com`),
but Kubernetes does not require this.
Kubernetes also does not require that there be a distinction between human and Pod users. It will be possible
to setup a cluster where Alice the human talks to the Kubernetes API as username `alice` and starts pods that
also talk to the API as user `alice` and write files to NFS as user `alice`. But, this is not recommended.
Instead, it is recommended that Pods and Humans have distinct identities, and reference implementations will
make this distinction.
The distinction is useful for a number of reasons:
- the requirements for humans and automated processes are different:
- Humans need a wide range of capabilities to do their daily activities. Automated processes often have more narrowly-defined activities.
- Humans may better tolerate the exceptional conditions created by expiration of a token. Remembering to handle
this in a program is more annoying. So, either long-lasting credentials or automated rotation of credentials is
needed.
- A Human typically keeps credentials on a machine that is not part of the cluster and so not subject to automatic
management. A VM with a role/service-account can have its credentials automatically managed.
- the identity of a Pod cannot in general be mapped to a single human.
- If policy allows, it may be created by one human, and then updated by another, and another, until its behavior cannot be attributed to a single human.
**TODO**: consider getting rid of separate serviceAccount object and just rolling its parts into the SecurityContext or
Pod Object.
The `secrets` field is a list of references to /secret objects that an process started as that service account should
have access to be able to assert that role.
The secrets are not inline with the serviceAccount object. This way, most or all users can have permission to `GET /serviceAccounts` so they can remind themselves
what serviceAccounts are available for use.
Nothing will prevent creation of a serviceAccount with two secrets of type `SecretTypeKubernetesAuth`, or secrets of two
different types. Kubelet and client libraries will have some behavior, TBD, to handle the case of multiple secrets of a
given type (pick first or provide all and try each in order, etc).
When a serviceAccount and a matching secret exist, then a `User.Info` for the serviceAccount and a `BearerToken` from the secret
are added to the map of tokens used by the authentication process in the apiserver, and similarly for other types. (We
might have some types that do not do anything on apiserver but just get pushed to the kubelet.)
### Pods
The `PodSpec` is extended to have a `Pods.Spec.ServiceAccountUsername` field. If this is unset, then a
default value is chosen. If it is set, then the corresponding value of `Pods.Spec.SecurityContext` is set by the
Service Account Finalizer (see below).
TBD: how policy limits which users can make pods with which service accounts.
### Authorization
Kubernetes API Authorization Policies refer to users. Pods created with a `Pods.Spec.ServiceAccountUsername` typically
get a `Secret` which allows them to authenticate to the Kubernetes APIserver as a particular user. So any
policy that is desired can be applied to them.
A higher level workflow is needed to coordinate creation of serviceAccounts, secrets and relevant policy objects.
Users are free to extend Kubernetes to put this business logic wherever is convenient for them, though the
Service Account Finalizer is one place where this can happen (see below).
### Kubelet
The kubelet will treat as "not ready to run" (needing a finalizer to act on it) any Pod which has an empty
SecurityContext.
The kubelet will set a default, restrictive, security context for any pods created from non-Apiserver config
sources (http, file).
Kubelet watches apiserver for secrets which are needed by pods bound to it.
**TODO**: how to only let kubelet see secrets it needs to know.
### The service account finalizer
There are several ways to use Pods with SecurityContexts and Secrets.
One way is to explicitly specify the securityContext and all secrets of a Pod when the pod is initially created,
like this:
**TODO**: example of pod with explicit refs.
Another way is with the *Service Account Finalizer*, a plugin process which is optional, and which handles
business logic around service accounts.
The Service Account Finalizer watches Pods, Namespaces, and ServiceAccount definitions.
First, if it finds pods which have a `Pod.Spec.ServiceAccountUsername` but no `Pod.Spec.SecurityContext` set,
then it copies in the referenced securityContext and secrets references for the corresponding `serviceAccount`.
Second, if ServiceAccount definitions change, it may take some actions.
**TODO**: decide what actions it takes when a serviceAccount definition changes. Does it stop pods, or just
allow someone to list ones that are out of spec? In general, people may want to customize this?
Third, if a new namespace is created, it may create a new serviceAccount for that namespace. This may include
a new username (e.g. `NAMESPACE-default-service-account@serviceaccounts.$CLUSTERID.kubernetes.io`), a new
securityContext, a newly generated secret to authenticate that serviceAccount to the Kubernetes API, and default
policies for that service account.
**TODO**: more concrete example. What are typical default permissions for default service account (e.g. readonly access
to services in the same namespace and read-write access to events in that namespace?)
Finally, it may provide an interface to automate creation of new serviceAccounts. In that case, the user may want
to GET serviceAccounts to see what has been created.

View File

@ -1,105 +0,0 @@
---
title: "Simple rolling update"
---
This is a lightweight design document for simple [rolling update](../user-guide/kubectl/kubectl_rolling-update) in `kubectl`.
Complete execution flow can be found [here](#execution-details). See the [example of rolling update](../user-guide/update-demo/) for more information.
### Lightweight rollout
Assume that we have a current replication controller named `foo` and it is running image `image:v1`
`kubectl rolling-update foo [foo-v2] --image=myimage:v2`
If the user doesn't specify a name for the 'next' replication controller, then the 'next' replication controller is renamed to
the name of the original replication controller.
Obviously there is a race here, where if you kill the client between delete foo, and creating the new version of 'foo' you might be surprised about what is there, but I think that's ok.
See [Recovery](#recovery) below
If the user does specify a name for the 'next' replication controller, then the 'next' replication controller is retained with its existing name,
and the old 'foo' replication controller is deleted. For the purposes of the rollout, we add a unique-ifying label `kubernetes.io/deployment` to both the `foo` and `foo-next` replication controllers.
The value of that label is the hash of the complete JSON representation of the`foo-next` or`foo` replication controller. The name of this label can be overridden by the user with the `--deployment-label-key` flag.
#### Recovery
If a rollout fails or is terminated in the middle, it is important that the user be able to resume the roll out.
To facilitate recovery in the case of a crash of the updating process itself, we add the following annotations to each replication controller in the `kubernetes.io/` annotation namespace:
* `desired-replicas` The desired number of replicas for this replication controller (either N or zero)
* `update-partner` A pointer to the replication controller resource that is the other half of this update (syntax `<name>` the namespace is assumed to be identical to the namespace of this replication controller.)
Recovery is achieved by issuing the same command again:
{% highlight sh %}
kubectl rolling-update foo [foo-v2] --image=myimage:v2
{% endhighlight %}
Whenever the rolling update command executes, the kubectl client looks for replication controllers called `foo` and `foo-next`, if they exist, an attempt is
made to roll `foo` to `foo-next`. If `foo-next` does not exist, then it is created, and the rollout is a new rollout. If `foo` doesn't exist, then
it is assumed that the rollout is nearly completed, and `foo-next` is renamed to `foo`. Details of the execution flow are given below.
### Aborting a rollout
Abort is assumed to want to reverse a rollout in progress.
`kubectl rolling-update foo [foo-v2] --rollback`
This is really just semantic sugar for:
`kubectl rolling-update foo-v2 foo`
With the added detail that it moves the `desired-replicas` annotation from `foo-v2` to `foo`
### Execution Details
For the purposes of this example, assume that we are rolling from `foo` to `foo-next` where the only change is an image update from `v1` to `v2`
If the user doesn't specify a `foo-next` name, then it is either discovered from the `update-partner` annotation on `foo`. If that annotation doesn't exist,
then `foo-next` is synthesized using the pattern `<controller-name>-<hash-of-next-controller-JSON>`
#### Initialization
* If `foo` and `foo-next` do not exist:
* Exit, and indicate an error to the user, that the specified controller doesn't exist.
* If `foo` exists, but `foo-next` does not:
* Create `foo-next` populate it with the `v2` image, set `desired-replicas` to `foo.Spec.Replicas`
* Goto Rollout
* If `foo-next` exists, but `foo` does not:
* Assume that we are in the rename phase.
* Goto Rename
* If both `foo` and `foo-next` exist:
* Assume that we are in a partial rollout
* If `foo-next` is missing the `desired-replicas` annotation
* Populate the `desired-replicas` annotation to `foo-next` using the current size of `foo`
* Goto Rollout
#### Rollout
* While size of `foo-next` < `desired-replicas` annotation on `foo-next`
* increase size of `foo-next`
* if size of `foo` > 0
decrease size of `foo`
* Goto Rename
#### Rename
* delete `foo`
* create `foo` that is identical to `foo-next`
* delete `foo-next`
#### Abort
* If `foo-next` doesn't exist
* Exit and indicate to the user that they may want to simply do a new rollout with the old version
* If `foo` doesn't exist
* Exit and indicate not found to the user
* Otherwise, `foo-next` and `foo` both exist
* Set `desired-replicas` annotation on `foo` to match the annotation on `foo-next`
* Goto Rollout with `foo` and `foo-next` trading places.

View File

@ -1,56 +0,0 @@
---
title: "Kubernetes API and Release Versioning"
---
Legend:
* **Kube &lt;major&gt;.&lt;minor&gt;.&lt;patch&gt;** refers to the version of Kubernetes that is released. This versions all components: apiserver, kubelet, kubectl, etc.
* **API vX[betaY]** refers to the version of the HTTP API.
## Release Timeline
### Minor version scheme and timeline
* Kube 1.0.0, 1.0.1 -- DONE!
* Kube 1.0.X (X>1): Standard operating procedure. We patch the release-1.0 branch as needed and increment the patch number.
* Kube 1.1.0-alpha.X: Released roughly every two weeks by cutting from HEAD. No cherrypick releases. If there is a critical bugfix, a new release from HEAD can be created ahead of schedule.
* Kube 1.1.0-beta: When HEAD is feature-complete, we will cut the release-1.1.0 branch 2 weeks prior to the desired 1.1.0 date and only merge PRs essential to 1.1. This cut will be marked as 1.1.0-beta, and HEAD will be revved to 1.2.0-alpha.0.
* Kube 1.1.0: Final release, cut from the release-1.1.0 branch cut two weeks prior. Should occur between 3 and 4 months after 1.0. 1.1.1-beta will be tagged at the same time on the same branch.
### Major version timeline
There is no mandated timeline for major versions. They only occur when we need to start the clock on deprecating features. A given major version should be the latest major version for at least one year from its original release date.
## Release versions as related to API versions
Here is an example major release cycle:
* **Kube 1.0 should have API v1 without v1beta\* API versions**
* The last version of Kube before 1.0 (e.g. 0.14 or whatever it is) will have the stable v1 API. This enables you to migrate all your objects off of the beta API versions of the API and allows us to remove those beta API versions in Kube 1.0 with no effect. There will be tooling to help you detect and migrate any v1beta\* data versions or calls to v1 before you do the upgrade.
* **Kube 1.x may have API v2beta***
* The first incarnation of a new (backwards-incompatible) API in HEAD is v2beta1. By default this will be unregistered in apiserver, so it can change freely. Once it is available by default in apiserver (which may not happen for several minor releases), it cannot change ever again because we serialize objects in versioned form, and we always need to be able to deserialize any objects that are saved in etcd, even between alpha versions. If further changes to v2beta1 need to be made, v2beta2 is created, and so on, in subsequent 1.x versions.
* **Kube 1.y (where y is the last version of the 1.x series) must have final API v2**
* Before Kube 2.0 is cut, API v2 must be released in 1.x. This enables two things: (1) users can upgrade to API v2 when running Kube 1.x and then switch over to Kube 2.x transparently, and (2) in the Kube 2.0 release itself we can cleanup and remove all API v2beta\* versions because no one should have v2beta\* objects left in their database. As mentioned above, tooling will exist to make sure there are no calls or references to a given API version anywhere inside someone's kube installation before someone upgrades.
* Kube 2.0 must include the v1 API, but Kube 3.0 must include the v2 API only. It *may* include the v1 API as well if the burden is not high - this will be determined on a per-major-version basis.
## Rationale for API v2 being complete before v2.0's release
It may seem a bit strange to complete the v2 API before v2.0 is released, but *adding* a v2 API is not a breaking change. *Removing* the v2beta\* APIs *is* a breaking change, which is what necessitates the major version bump. There are other ways to do this, but having the major release be the fresh start of that release's API without the baggage of its beta versions seems most intuitive out of the available options.
# Patches
Patch releases are intended for critical bug fixes to the latest minor version, such as addressing security vulnerabilities, fixes to problems affecting a large number of users, severe problems with no workaround, and blockers for products based on Kubernetes.
They should not contain miscellaneous feature additions or improvements, and especially no incompatibilities should be introduced between patch versions of the same minor version (or even major version).
Dependencies, such as Docker or Etcd, should also not be changed unless absolutely necessary, and also just to fix critical bugs (so, at most patch version changes, not new major nor minor versions).
# Upgrades
* Users can upgrade from any Kube 1.x release to any other Kube 1.x release as a rolling upgrade across their cluster. (Rolling upgrade means being able to upgrade the master first, then one node at a time. See #4855 for details.)
* No hard breaking changes over version boundaries.
* For example, if a user is at Kube 1.x, we may require them to upgrade to Kube 1.x+y before upgrading to Kube 2.x. In others words, an upgrade across major versions (e.g. Kube 1.x to Kube 2.x) should effectively be a no-op and as graceful as an upgrade from Kube 1.x to Kube 1.x+1. But you can require someone to go from 1.x to 1.x+y before they go to 2.x.
There is a separate question of how to track the capabilities of a kubelet to facilitate rolling upgrades. That is not addressed here.

View File

@ -1,10 +1,6 @@
---
title: "API Conventions"
---
API Conventions
===============
Updated: 9/20/2015
*This document is oriented at users who want a deeper understanding of the Kubernetes
@ -125,12 +121,14 @@ Objects that contain both spec and status should not contain additional top-leve
The `FooCondition` type for some resource type `Foo` may include a subset of the following fields, but must contain at least `type` and `status` fields:
{% highlight go %}
Type FooConditionType `json:"type" description:"type of Foo condition"`
Status ConditionStatus `json:"status" description:"status of the condition, one of True, False, Unknown"`
LastHeartbeatTime unversioned.Time `json:"lastHeartbeatTime,omitempty" description:"last time we got an update on a given condition"`
LastTransitionTime unversioned.Time `json:"lastTransitionTime,omitempty" description:"last time the condition transit from one status to another"`
Reason string `json:"reason,omitempty" description:"one-word CamelCase reason for the condition's last transition"`
Message string `json:"message,omitempty" description:"human-readable message indicating details about last transition"`
{% endhighlight %}
Additional fields may be added in the future.
@ -168,17 +166,21 @@ Discussed in [#2004](http://issue.k8s.io/2004) and elsewhere. There are no maps
For example:
{% highlight yaml %}
ports:
- name: www
containerPort: 80
{% endhighlight %}
vs.
{% highlight yaml %}
ports:
www:
containerPort: 80
{% endhighlight %}
This rule maintains the invariant that all JSON/YAML keys are fields in API objects. The only exceptions are pure maps in the API (currently, labels, selectors, annotations, data), as opposed to sets of subobjects.
@ -235,20 +237,24 @@ The API supports three different PATCH operations, determined by their correspon
In the standard JSON merge patch, JSON objects are always merged but lists are always replaced. Often that isn't what we want. Let's say we start with the following Pod:
{% highlight yaml %}
spec:
containers:
- name: nginx
image: nginx-1.0
{% endhighlight %}
...and we POST that to the server (as JSON). Then let's say we want to *add* a container to this Pod.
{% highlight yaml %}
PATCH /api/v1/namespaces/default/pods/pod-name
spec:
containers:
- name: log-tailer
image: log-tailer-1.0
{% endhighlight %}
If we were to use standard Merge Patch, the entire container list would be replaced with the single log-tailer container. However, our intent is for the container lists to merge together based on the `name` field.
@ -264,20 +270,24 @@ Strategic Merge Patch also supports special operations as listed below.
To override the container list to be strictly replaced, regardless of the default:
{% highlight yaml %}
containers:
- name: nginx
image: nginx-1.0
- $patch: replace # any further $patch operations nested in this list will be ignored
{% endhighlight %}
To delete an element of a list that should be merged:
{% highlight yaml %}
containers:
- name: nginx
image: nginx-1.0
- $patch: delete
name: log-tailer # merge key and value goes here
{% endhighlight %}
### Map Operations
@ -285,19 +295,23 @@ containers:
To indicate that a map should not be merged and instead should be taken literally:
{% highlight yaml %}
$patch: replace # recursive and applies to all fields of the map it's in
containers:
- name: nginx
image: nginx-1.0
{% endhighlight %}
To delete a field of a map:
{% highlight yaml %}
name: nginx
image: nginx-1.0
labels:
live: null # set the value of the map key to null
{% endhighlight %}
@ -358,10 +372,12 @@ The only way for a client to know the expected value of resourceVersion is to ha
In the case of a conflict, the correct client action at this point is to GET the resource again, apply the changes afresh, and try submitting again. This mechanism can be used to prevent races like the following:
```
Client #1 Client #2
GET Foo GET Foo
Set Foo.Bar = "one" Set Foo.Baz = "two"
PUT Foo PUT Foo
```
When these sequences occur in parallel, either the change to Foo.Bar or the change to Foo.Baz can be lost.
@ -486,6 +502,7 @@ The status object is encoded as JSON and provided as the body of the response.
**Example:**
{% highlight console %}
$ curl -v -k -H "Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc" https://10.240.122.184:443/api/v1/namespaces/default/pods/grafana
> GET /api/v1/namespaces/default/pods/grafana HTTP/1.1
@ -513,6 +530,7 @@ $ curl -v -k -H "Authorization: Bearer WhCDvq4VPpYhrcfmF6ei7V9qlbqTubUc" https:/
},
"code": 404
}
{% endhighlight %}
`status` field contains one of two possible values:

View File

@ -1,10 +1,6 @@
---
title: "Instrumenting Kubernetes with a new metric"
---
Instrumenting Kubernetes with a new metric
===================
The following is a step-by-step guide for adding a new metric to the Kubernetes code base.
We use the Prometheus monitoring system's golang client library for instrumenting our code. Once you've picked out a file that you want to add a metric to, you should:

View File

@ -1,10 +1,6 @@
---
title: "GitHub Issues for the Kubernetes Project"
---
GitHub Issues for the Kubernetes Project
========================================
A list quick overview of how we will review and prioritize incoming issues at https://github.com/kubernetes/kubernetes/issues
Priorities

View File

@ -2,10 +2,6 @@
title: "Kubectl Conventions"
---
Kubectl Conventions
===================
Updated: 8/27/2015
{% include pagetoc.html %}

View File

@ -1,10 +1,6 @@
---
title: "Logging Conventions"
---
Logging Conventions
===================
The following conventions for the glog levels to use. [glog](http://godoc.org/github.com/golang/glog) is globally preferred to [log](http://golang.org/pkg/log/) for better runtime control.
* glog.Errorf() - Always an error

View File

@ -1,14 +1,9 @@
---
title: "Pull Request Process"
---
Pull Request Process
====================
An overview of how we will manage old or out-of-date pull requests.
Process
-------
## Process
We will close any pull requests older than two weeks.
@ -19,8 +14,7 @@ We want to limit the total number of PRs in flight to:
* Remove old PRs that would be difficult to rebase as the underlying code has changed over time
* Encourage code velocity
Life of a Pull Request
----------------------
## Life of a Pull Request
Unless in the last few weeks of a milestone when we need to reduce churn and stabilize, we aim to be always accepting pull requests.
@ -33,8 +27,7 @@ There are several requirements for the submit queue to work:
Additionally, for infrequent or new contributors, we require the on call to apply the "ok-to-merge" label manually. This is gated by the [whitelist](https://github.com/kubernetes/contrib/tree/master/submit-queue/whitelist.txt).
Automation
----------
## Automation
We use a variety of automation to manage pull requests. This automation is described in detail
[elsewhere.](automation)

View File

@ -2,9 +2,7 @@
title: "Getting started on Microsoft Azure"
---
Getting started on Microsoft Azure
----------------------------------
## Getting started on Microsoft Azure
Checkout the [coreos azure getting started guide](/{{page.version}}/docs/getting-started-guides/coreos/azure/README)

View File

@ -2,9 +2,6 @@
title: "Kubernetes on Azure with CoreOS and Weave"
---
Kubernetes on Azure with CoreOS and [Weave](http://weave.works)
---------------------------------------------------------------
{% include pagetoc.html %}
- [Introduction](#introduction)
@ -18,7 +15,10 @@ Kubernetes on Azure with CoreOS and [Weave](http://weave.works)
## Introduction
In this guide I will demonstrate how to deploy a Kubernetes cluster to Azure cloud. You will be using CoreOS with Weave, which implements simple and secure networking, in a transparent, yet robust way. The purpose of this guide is to provide an out-of-the-box implementation that can ultimately be taken into production with little change. It will demonstrate how to provision a dedicated Kubernetes master and etcd nodes, and show how to scale the cluster with ease.
In this guide I will demonstrate how to deploy a Kubernetes cluster to Azure cloud. You will be using CoreOS with [Weave](http://weave.works),
which implements simple and secure networking, in a transparent, yet robust way. The purpose of this guide is to provide an out-of-the-box
implementation that can ultimately be taken into production with little change. It will demonstrate how to provision a dedicated Kubernetes
master and etcd nodes, and show how to scale the cluster with ease.
### Prerequisites
@ -29,8 +29,10 @@ In this guide I will demonstrate how to deploy a Kubernetes cluster to Azure clo
To get started, you need to checkout the code:
{% highlight sh %}
git clone https://github.com/kubernetes/kubernetes
cd kubernetes/docs/getting-started-guides/coreos/azure/
{% endhighlight %}
You will need to have [Node.js installed](http://nodejs.org/download/) on you machine. If you have previously used Azure CLI, you should have it already.
@ -38,34 +40,44 @@ You will need to have [Node.js installed](http://nodejs.org/download/) on you ma
First, you need to install some of the dependencies with
{% highlight sh %}
npm install
{% endhighlight %}
Now, all you need to do is:
{% highlight sh %}
./azure-login.js -u <your_username>
./create-kubernetes-cluster.js
{% endhighlight %}
This script will provision a cluster suitable for production use, where there is a ring of 3 dedicated etcd nodes: 1 kubernetes master and 2 kubernetes nodes. The `kube-00` VM will be the master, your work loads are only to be deployed on the nodes, `kube-01` and `kube-02`. Initially, all VMs are single-core, to ensure a user of the free tier can reproduce it without paying extra. I will show how to add more bigger VMs later.
This script will provision a cluster suitable for production use, where there is a ring of 3 dedicated etcd nodes: 1 kubernetes master and 2 kubernetes nodes.
The `kube-00` VM will be the master, your work loads are only to be deployed on the nodes, `kube-01` and `kube-02`. Initially, all VMs are single-core, to
ensure a user of the free tier can reproduce it without paying extra. I will show how to add more bigger VMs later.
![VMs in Azure](initial_cluster.png)
Once the creation of Azure VMs has finished, you should see the following:
{% highlight console %}
...
azure_wrapper/info: Saved SSH config, you can use it like so: `ssh -F ./output/kube_1c1496016083b4_ssh_conf <hostname>`
azure_wrapper/info: The hosts in this deployment are:
[ 'etcd-00', 'etcd-01', 'etcd-02', 'kube-00', 'kube-01', 'kube-02' ]
azure_wrapper/info: Saved state into `./output/kube_1c1496016083b4_deployment.yml`
{% endhighlight %}
Let's login to the master node like so:
{% highlight sh %}
ssh -F ./output/kube_1c1496016083b4_ssh_conf kube-00
{% endhighlight %}
> Note: config file name will be different, make sure to use the one you see.
@ -73,10 +85,12 @@ ssh -F ./output/kube_1c1496016083b4_ssh_conf kube-00
Check there are 2 nodes in the cluster:
{% highlight console %}
core@kube-00 ~ $ kubectl get nodes
NAME LABELS STATUS
kube-01 kubernetes.io/hostname=kube-01 Ready
kube-02 kubernetes.io/hostname=kube-02 Ready
{% endhighlight %}
## Deploying the workload
@ -84,13 +98,17 @@ kube-02 kubernetes.io/hostname=kube-02 Ready
Let's follow the Guestbook example now:
{% highlight sh %}
kubectl create -f ~/guestbook-example
{% endhighlight %}
You need to wait for the pods to get deployed, run the following and wait for `STATUS` to change from `Pending` to `Running`.
{% highlight sh %}
kubectl get pods --watch
{% endhighlight %}
> Note: the most time it will spend downloading Docker container images on each of the nodes.
@ -98,6 +116,7 @@ kubectl get pods --watch
Eventually you should see:
{% highlight console %}
NAME READY STATUS RESTARTS AGE
frontend-0a9xi 1/1 Running 0 4m
frontend-4wahe 1/1 Running 0 4m
@ -105,6 +124,7 @@ frontend-6l36j 1/1 Running 0 4m
redis-master-talmr 1/1 Running 0 4m
redis-slave-12zfd 1/1 Running 0 4m
redis-slave-3nbce 1/1 Running 0 4m
{% endhighlight %}
## Scaling
@ -116,12 +136,15 @@ You will need to open another terminal window on your machine and go to the same
First, lets set the size of new VMs:
{% highlight sh %}
export AZ_VM_SIZE=Large
{% endhighlight %}
Now, run scale script with state file of the previous deployment and number of nodes to add:
{% highlight console %}
core@kube-00 ~ $ ./scale-kubernetes-cluster.js ./output/kube_1c1496016083b4_deployment.yml 2
...
azure_wrapper/info: Saved SSH config, you can use it like so: `ssh -F ./output/kube_8f984af944f572_ssh_conf <hostname>`
@ -135,6 +158,7 @@ azure_wrapper/info: The hosts in this deployment are:
'kube-03',
'kube-04' ]
azure_wrapper/info: Saved state into `./output/kube_8f984af944f572_deployment.yml`
{% endhighlight %}
> Note: this step has created new files in `./output`.
@ -142,12 +166,14 @@ azure_wrapper/info: Saved state into `./output/kube_8f984af944f572_deployment.ym
Back on `kube-00`:
{% highlight console %}
core@kube-00 ~ $ kubectl get nodes
NAME LABELS STATUS
kube-01 kubernetes.io/hostname=kube-01 Ready
kube-02 kubernetes.io/hostname=kube-02 Ready
kube-03 kubernetes.io/hostname=kube-03 Ready
kube-04 kubernetes.io/hostname=kube-04 Ready
{% endhighlight %}
You can see that two more nodes joined happily. Let's scale the number of Guestbook instances now.
@ -155,42 +181,50 @@ You can see that two more nodes joined happily. Let's scale the number of Guestb
First, double-check how many replication controllers there are:
{% highlight console %}
core@kube-00 ~ $ kubectl get rc
ONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
frontend php-redis kubernetes/example-guestbook-php-redis:v2 name=frontend 3
redis-master master redis name=redis-master 1
redis-slave worker kubernetes/redis-slave:v2 name=redis-slave 2
{% endhighlight %}
As there are 4 nodes, let's scale proportionally:
{% highlight console %}
core@kube-00 ~ $ kubectl scale --replicas=4 rc redis-slave
>>>>>>> coreos/azure: Updates for 1.0
scaled
core@kube-00 ~ $ kubectl scale --replicas=4 rc frontend
scaled
{% endhighlight %}
Check what you have now:
{% highlight console %}
core@kube-00 ~ $ kubectl get rc
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS
frontend php-redis kubernetes/example-guestbook-php-redis:v2 name=frontend 4
redis-master master redis name=redis-master 1
redis-slave worker kubernetes/redis-slave:v2 name=redis-slave 4
{% endhighlight %}
You now will have more instances of front-end Guestbook apps and Redis slaves; and, if you look up all pods labeled `name=frontend`, you should see one running on each node.
{% highlight console %}
core@kube-00 ~/guestbook-example $ kubectl get pods -l name=frontend
NAME READY STATUS RESTARTS AGE
frontend-0a9xi 1/1 Running 0 22m
frontend-4wahe 1/1 Running 0 22m
frontend-6l36j 1/1 Running 0 22m
frontend-z9oxo 1/1 Running 0 41s
{% endhighlight %}
## Exposing the app to the outside world
@ -198,6 +232,7 @@ frontend-z9oxo 1/1 Running 0 41s
There is no native Azure load-balancer support in Kubernetes 1.0, however here is how you can expose the Guestbook app to the Internet.
```
./expose_guestbook_app_port.sh ./output/kube_1c1496016083b4_ssh_conf
Guestbook app is on port 31605, will map it to port 80 on kube-00
info: Executing command vm endpoint create
@ -213,6 +248,7 @@ data: Protcol : tcp
data: Virtual IP Address : 137.117.156.164
data: Direct server return : Disabled
info: vm endpoint show command OK
```
You then should be able to access it from anywhere via the Azure virtual IP for `kube-00` displayed above, i.e. `http://137.117.156.164/` in my case.
@ -228,7 +264,9 @@ You should probably try deploy other [example apps](../../../../examples/) or wr
If you don't wish care about the Azure bill, you can tear down the cluster. It's easy to redeploy it, as you can see.
{% highlight sh %}
./destroy-cluster.js ./output/kube_8f984af944f572_deployment.yml
{% endhighlight %}
> Note: make sure to use the _latest state file_, as after scaling there is a new one.

View File

@ -1,10 +1,6 @@
---
title: "Bare Metal CoreOS with Kubernetes and Project Calico"
---
Bare Metal CoreOS with Kubernetes and Project Calico
------------------------------------------
This guide explains how to deploy a bare-metal Kubernetes cluster on CoreOS using [Calico networking](http://www.projectcalico.org).
Specifically, this guide will have you do the following:
@ -35,9 +31,11 @@ In the next few steps you will be asked to configure these files and host them o
To get the Kubernetes source, clone the GitHub repo, and build the binaries.
```
git clone https://github.com/kubernetes/kubernetes.git
cd kubernetes
./build/release.sh
```
Once the binaries are built, host the entire `<kubernetes>/_output/dockerized/bin/<OS>/<ARCHITECTURE>/` folder on an accessible HTTP server so they can be accessed by the cloud-config. You'll point your cloud-config files at this HTTP server later.
@ -47,7 +45,9 @@ Once the binaries are built, host the entire `<kubernetes>/_output/dockerized/bi
Let's download the CoreOS bootable ISO. We'll use this image to boot and install CoreOS on each server.
```
wget http://stable.release.core-os.net/amd64-usr/current/coreos_production_iso_image.iso
```
You can also download the ISO from the [CoreOS website](https://coreos.com/docs/running-coreos/platforms/iso/).
@ -59,8 +59,10 @@ Once you've downloaded the image, use it to boot your Kubernetes Master server.
Let's get the master-config.yaml and fill in the necessary variables. Run the following commands on your HTTP server to get the cloud-config files.
```
git clone https://github.com/Metaswitch/calico-kubernetes-demo.git
cd calico-kubernetes-demo/coreos
```
You'll need to replace the following variables in the `master-config.yaml` file to match your deployment.
@ -74,7 +76,9 @@ Host the modified `master-config.yaml` file and pull it on to your Kubernetes Ma
The CoreOS bootable ISO comes with a tool called `coreos-install` which will allow us to install CoreOS to disk and configure the install using cloud-config. The following command will download and install stable CoreOS, using the master-config.yaml file for configuration.
```
sudo coreos-install -d /dev/sda -C stable -c master-config.yaml
```
Once complete, eject the bootable ISO and restart the server. When it comes back up, you should have SSH access as the `core` user using the public key provided in the master-config.yaml file.
@ -99,19 +103,25 @@ You'll need to replace the following variables in the `node-config.yaml` file to
Host the modified `node-config.yaml` file and pull it on to your Kubernetes node.
```
wget http://<http_server_ip>/node-config.yaml
```
Install and configure CoreOS on the node using the following command.
```
sudo coreos-install -d /dev/sda -C stable -c node-config.yaml
```
Once complete, restart the server. When it comes back up, you should have SSH access as the `core` user using the public key provided in the `node-config.yaml` file. It will take some time for the node to be fully configured. Once fully configured, you can check that the node is running with the following command on the Kubernetes master.
```
/home/core/kubectl get nodes
```
## Testing the Cluster

View File

@ -1,7 +1,8 @@
---
title: "Bare Metal CoreOS with Kubernetes (OFFLINE)"
---
Deploy a CoreOS running Kubernetes environment. This particular guild is made to help those in an OFFLINE system, wither for testing a POC before the real deal, or you are restricted to be totally offline for your applications.
Deploy a CoreOS running Kubernetes environment. This particular guild is made to help those in an OFFLINE system,
whether for testing a POC before the real deal, or you are restricted to be totally offline for your applications.
{% include pagetoc.html %}

View File

@ -9,7 +9,8 @@ You need two or more Fedora 22 droplets on Digital Ocean with [Private Networkin
## Overview
This guide will walk you through the process of getting a Kubernetes Fedora cluster running on Digital Ocean with networking powered by Calico networking. It will cover the installation and configuration of the following systemd processes on the following hosts:
This guide will walk you through the process of getting a Kubernetes Fedora cluster running on Digital Ocean with networking powered by Calico networking.
It will cover the installation and configuration of the following systemd processes on the following hosts:
Kubernetes Master:
- `kube-apiserver`
@ -32,13 +33,16 @@ For this demo, we will be setting up one Master and one Node with the following
| kube-master |10.134.251.56|
| kube-node-1 |10.134.251.55|
This guide is scalable to multiple nodes provided you [configure interface-cbr0 with its own subnet on each Node](#configure-the-virtual-interface---cbr0) and [add an entry to /etc/hosts for each host](#setup-communication-between-hosts).
This guide is scalable to multiple nodes provided you [configure interface-cbr0 with its own subnet on each Node](#configure-the-virtual-interface---cbr0)
and [add an entry to /etc/hosts for each host](#setup-communication-between-hosts).
Ensure you substitute the IP Addresses and Hostnames used in this guide with ones in your own setup.
## Setup Communication Between Hosts
Digital Ocean private networking configures a private network on eth1 for each host. To simplify communication between the hosts, we will add an entry to /etc/hosts so that all hosts in the cluster can hostname-resolve one another to this interface. **It is important that the hostname resolves to this interface instead of eth0, as all Kubernetes and Calico services will be running on it.**
Digital Ocean private networking configures a private network on eth1 for each host. To simplify communication between the hosts, we will add an entry to /etc/hosts
so that all hosts in the cluster can hostname-resolve one another to this interface. **It is important that the hostname resolves to this interface instead of eth0, as
all Kubernetes and Calico services will be running on it.**
```
@ -177,7 +181,8 @@ systemctl start calico-node.service
### Configure the Virtual Interface - cbr0
By default, docker will create and run on a virtual interface called `docker0`. This interface is automatically assigned the address range 172.17.42.1/16. In order to set our own address range, we will create a new virtual interface called `cbr0` and then start docker on it.
By default, docker will create and run on a virtual interface called `docker0`. This interface is automatically assigned the address range 172.17.42.1/16.
In order to set our own address range, we will create a new virtual interface called `cbr0` and then start docker on it.
* Add a virtual interface by creating `/etc/sysconfig/network-scripts/ifcfg-cbr0`:
@ -192,7 +197,8 @@ BOOTPROTO=static
```
>**Note for Multi-Node Clusters:** Each node should be assigned an IP address on a unique subnet. In this example, node-1 is using 192.168.1.1/24, so node-2 should be assigned another pool on the 192.168.x.0/24 subnet, e.g. 192.168.2.1/24.
>**Note for Multi-Node Clusters:** Each node should be assigned an IP address on a unique subnet. In this example, node-1 is using 192.168.1.1/24,
so node-2 should be assigned another pool on the 192.168.x.0/24 subnet, e.g. 192.168.2.1/24.
* Ensure that your system has bridge-utils installed. Then, restart the networking daemon to activate the new interface
@ -274,7 +280,9 @@ systemctl start calico-node.service
* Configure the IP Address Pool
Most Kubernetes application deployments will require communication between Pods and the kube-apiserver on Master. On a standard Digital Ocean Private Network, requests sent from Pods to the kube-apiserver will not be returned as the networking fabric will drop response packets destined for any 192.168.0.0/16 address. To resolve this, you can have calicoctl add a masquerade rule to all outgoing traffic on the node:
Most Kubernetes application deployments will require communication between Pods and the kube-apiserver on Master. On a standard Digital
Ocean Private Network, requests sent from Pods to the kube-apiserver will not be returned as the networking fabric will drop response packets
destined for any 192.168.0.0/16 address. To resolve this, you can have calicoctl add a masquerade rule to all outgoing traffic on the node:
```
@ -303,7 +311,8 @@ KUBE_MASTER="--master=http://kube-master:8080"
* Edit `/etc/kubernetes/kubelet`
We'll pass in an extra parameter - `--network-plugin=calico` to tell the Kubelet to use the Calico networking plugin. Additionally, we'll add two environment variables that will be used by the Calico networking plugin.
We'll pass in an extra parameter - `--network-plugin=calico` to tell the Kubelet to use the Calico networking plugin. Additionally, we'll add two
environment variables that will be used by the Calico networking plugin.
```

View File

@ -1,62 +1,75 @@
---
title: "Getting started on Fedora"
---
Getting started on [Fedora](http://fedoraproject.org)
-----------------------------------------------------
{% include pagetoc.html %}
- [Prerequisites](#prerequisites)
- [Instructions](#instructions)
## Prerequisites
1. You need 2 or more machines with Fedora installed.
## Instructions
This is a getting started guide for Fedora. It is a manual configuration so you understand all the underlying packages / services / ports, etc...
This is a getting started guide for [Fedora](http://fedoraproject.org). It is a manual configuration so you understand all the underlying packages / services / ports, etc...
This guide will only get ONE node (previously minion) working. Multiple nodes require a functional [networking configuration](../../admin/networking) done outside of Kubernetes. Although the additional Kubernetes configuration requirements should be obvious.
This guide will only get ONE node (previously minion) working. Multiple nodes require a functional [networking configuration](../../admin/networking)
done outside of Kubernetes. Although the additional Kubernetes configuration requirements should be obvious.
The Kubernetes package provides a few services: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, kube-proxy. These services are managed by systemd and the configuration resides in a central location: /etc/kubernetes. We will break the services up between the hosts. The first host, fed-master, will be the Kubernetes master. This host will run the kube-apiserver, kube-controller-manager, and kube-scheduler. In addition, the master will also run _etcd_ (not needed if _etcd_ runs on a different host but this guide assumes that _etcd_ and Kubernetes master run on the same host). The remaining host, fed-node will be the node and run kubelet, proxy and docker.
The Kubernetes package provides a few services: kube-apiserver, kube-scheduler, kube-controller-manager, kubelet, kube-proxy. These
services are managed by systemd and the configuration resides in a central location: /etc/kubernetes. We will break the services up
between the hosts. The first host, fed-master, will be the Kubernetes master. This host will run the kube-apiserver, kube-controller-manager,
and kube-scheduler. In addition, the master will also run _etcd_ (not needed if _etcd_ runs on a different host but this guide assumes
that _etcd_ and Kubernetes master run on the same host). The remaining host, fed-node will be the node and run kubelet, proxy and docker.
**System Information:**
Hosts:
```
fed-master = 192.168.121.9
fed-node = 192.168.121.65
```
**Prepare the hosts:**
* Install Kubernetes on all hosts - fed-{master,node}. This will also pull in docker. Also install etcd on fed-master. This guide has been tested with kubernetes-0.18 and beyond.
* The [--enablerepo=updates-testing](https://fedoraproject.org/wiki/QA:Updates_Testing) directive in the yum command below will ensure that the most recent Kubernetes version that is scheduled for pre-release will be installed. This should be a more recent version than the Fedora "stable" release for Kubernetes that you would get without adding the directive.
* If you want the very latest Kubernetes release [you can download and yum install the RPM directly from Fedora Koji](http://koji.fedoraproject.org/koji/packageinfo?packageID=19202) instead of using the yum install command below.
* Install Kubernetes on all hosts - fed-{master,node}. This will also pull in docker. Also install etcd on fed-master.
This guide has been tested with kubernetes-0.18 and beyond.
* The [--enablerepo=updates-testing](https://fedoraproject.org/wiki/QA:Updates_Testing) directive in the yum
command below will ensure that the most recent Kubernetes version that is scheduled for pre-release will
be installed. This should be a more recent version than the Fedora "stable" release for Kubernetes that you
would get without adding the directive.
* If you want the very latest Kubernetes release [you can download and yum install the RPM directly from
Fedora Koji](http://koji.fedoraproject.org/koji/packageinfo?packageID=19202) instead of using the yum
install command below.
{% highlight sh %}
yum -y install --enablerepo=updates-testing kubernetes
{% endhighlight %}
* Install etcd and iptables
{% highlight sh %}
yum -y install etcd iptables
{% endhighlight %}
* Add master and node to /etc/hosts on all machines (not needed if hostnames already in DNS). Make sure that communication works between fed-master and fed-node by using a utility such as ping.
{% highlight sh %}
echo "192.168.121.9 fed-master
192.168.121.65 fed-node" >> /etc/hosts
{% endhighlight %}
* Edit /etc/kubernetes/config which will be the same on all hosts (master and node) to contain:
{% highlight sh %}
# Comma separated list of nodes in the etcd cluster
KUBE_MASTER="--master=http://fed-master:8080"
@ -68,20 +81,25 @@ KUBE_LOG_LEVEL="--v=0"
# Should this cluster be allowed to run privileged docker containers
KUBE_ALLOW_PRIV="--allow-privileged=false"
{% endhighlight %}
* Disable the firewall on both the master and node, as docker does not play well with other firewall rule managers. Please note that iptables-services does not exist on default fedora server install.
{% highlight sh %}
systemctl disable iptables-services firewalld
systemctl stop iptables-services firewalld
{% endhighlight %}
**Configure the Kubernetes services on the master.**
* Edit /etc/kubernetes/apiserver to appear as such. The service-cluster-ip-range IP addresses must be an unused block of addresses, not used anywhere else. They do not need to be routed or assigned to anything.
* Edit /etc/kubernetes/apiserver to appear as such. The service-cluster-ip-range IP addresses must be an unused block of addresses, not used anywhere else.
They do not need to be routed or assigned to anything.
{% highlight sh %}
# The address on the local server to listen to.
KUBE_API_ADDRESS="--address=0.0.0.0"
@ -93,30 +111,37 @@ KUBE_SERVICE_ADDRESSES="--service-cluster-ip-range=10.254.0.0/16"
# Add your own!
KUBE_API_ARGS=""
{% endhighlight %}
* Edit /etc/etcd/etcd.conf,let the etcd to listen all the ip instead of 127.0.0.1, if not, you will get the error like "connection refused". Note that Fedora 22 uses etcd 2.0, One of the changes in etcd 2.0 is that now uses port 2379 and 2380 (as opposed to etcd 0.46 which userd 4001 and 7001).
{% highlight sh %}
ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:4001"
{% endhighlight %}
* Create /var/run/kubernetes on master:
{% highlight sh %}
mkdir /var/run/kubernetes
chown kube:kube /var/run/kubernetes
chmod 750 /var/run/kubernetes
{% endhighlight %}
* Start the appropriate services on master:
{% highlight sh %}
for SERVICES in etcd kube-apiserver kube-controller-manager kube-scheduler; do
systemctl restart $SERVICES
systemctl enable $SERVICES
systemctl status $SERVICES
done
{% endhighlight %}
* Addition of nodes:
@ -124,6 +149,7 @@ done
* Create following node.json file on Kubernetes master node:
{% highlight json %}
{
"apiVersion": "v1",
"kind": "Node",
@ -135,16 +161,19 @@ done
"externalID": "fed-node"
}
}
{% endhighlight %}
Now create a node object internally in your Kubernetes cluster by running:
{% highlight console %}
$ kubectl create -f ./node.json
$ kubectl get nodes
NAME LABELS STATUS
fed-node name=fed-node-label Unknown
{% endhighlight %}
Please note that in the above, it only creates a representation for the node
@ -160,6 +189,7 @@ a Kubernetes node (fed-node) below.
* Edit /etc/kubernetes/kubelet to appear as such:
{% highlight sh %}
###
# Kubernetes kubelet (node) config
@ -174,24 +204,29 @@ KUBELET_API_SERVER="--api-servers=http://fed-master:8080"
# Add your own!
#KUBELET_ARGS=""
{% endhighlight %}
* Start the appropriate services on the node (fed-node).
{% highlight sh %}
for SERVICES in kube-proxy kubelet docker; do
systemctl restart $SERVICES
systemctl enable $SERVICES
systemctl status $SERVICES
done
{% endhighlight %}
* Check to make sure now the cluster can see the fed-node on fed-master, and its status changes to _Ready_.
{% highlight console %}
kubectl get nodes
NAME LABELS STATUS
fed-node name=fed-node-label Ready
{% endhighlight %}
* Deletion of nodes:
@ -199,7 +234,9 @@ fed-node name=fed-node-label Ready
To delete _fed-node_ from your Kubernetes cluster, one should run the following on fed-master (Please do not do it, it is just for information):
{% highlight sh %}
kubectl delete -f ./node.json
{% endhighlight %}
*You should be finished!*

View File

@ -1,6 +1,8 @@
---
title: "Containers with Kubernetes"
---
{% include pagetoc.html %}
## Containers and commands
So far the Pods we've seen have all used the `image` field to indicate what process Kubernetes

View File

@ -70,8 +70,3 @@ drwxrwxrwt 3 0 0 180 Aug 24 13:03 ..
{% endhighlight %}
The file `labels` is stored in a temporary directory (`..2015_08_24_13_03_44259413923` in the example above) which is symlinked to by `..downwardapi`. Symlinks for annotations and labels in `/etc` point to files containing the actual metadata through the `..downwardapi` indirection.  This structure allows for dynamic atomic refresh of the metadata: updates are written to a new temporary directory, and the `..downwardapi` symlink is updated atomically using `rename(2)`.

View File

@ -70,8 +70,3 @@ drwxrwxrwt 3 0 0 180 Aug 24 13:03 ..
{% endhighlight %}
The file `labels` is stored in a temporary directory (`..2015_08_24_13_03_44259413923` in the example above) which is symlinked to by `..downwardapi`. Symlinks for annotations and labels in `/etc` point to files containing the actual metadata through the `..downwardapi` indirection.  This structure allows for dynamic atomic refresh of the metadata: updates are written to a new temporary directory, and the `..downwardapi` symlink is updated atomically using `rename(2)`.

View File

@ -1,9 +1,6 @@
---
title: "Environment Guide Example"
---
Environment Guide Example
=========================
This example demonstrates running pods, replication controllers, and
services. It shows two types of pods: frontend and backend, with
services on top of both. Accessing the frontend pod will return
@ -15,29 +12,28 @@ is [here](/{{page.version}}/docs/user-guide/container-environment).
![Diagram](diagram.png)
Prerequisites
-------------
## Prerequisites
This example assumes that you have a Kubernetes cluster installed and
running, and that you have installed the `kubectl` command line tool
somewhere in your path. Please see the [getting
started](/{{page.version}}/docs/getting-started-guides/) for installation instructions
for your platform.
Optional: Build your own containers
-----------------------------------
### Optional: Build your own containers
The code for the containers is under
[containers/](containers/)
Get everything running
----------------------
## Get everything running
kubectl create -f ./backend-rc.yaml
kubectl create -f ./backend-srv.yaml
kubectl create -f ./show-rc.yaml
kubectl create -f ./show-srv.yaml
Query the service
-----------------
## Query the service
Use `kubectl describe service show-srv` to determine the public IP of
your service.
@ -49,6 +45,7 @@ Run `curl <public ip>:80` to query the service. You should get
something like this back:
```
Pod Name: show-rc-xxu6i
Pod Namespace: default
USER_VAR: important information
@ -68,6 +65,7 @@ Response from backend
Backend Container
Backend Pod Name: backend-rc-6qiya
Backend Namespace: default
```
First the frontend pod's information is printed. The pod name and
@ -87,10 +85,7 @@ frontend pods are always contacting the backend through the backend
service. This results in a different backend pod servicing each
request as well.
Cleanup
-------
## Cleanup
kubectl delete rc,service -l type=show-type
kubectl delete rc,service -l type=backend-type

View File

@ -1,25 +1,22 @@
---
title: "Building"
---
Building
--------
For each container, the build steps are the same. The examples below
are for the `show` container. Replace `show` with `backend` for the
backend container.
Google Container Registry ([GCR](https://cloud.google.com/tools/container-registry/))
---
## Google Container Registry ([GCR](https://cloud.google.com/tools/container-registry/))
docker build -t gcr.io/<project-name>/show .
gcloud docker push gcr.io/<project-name>/show
Docker Hub
----------
## Docker Hub
docker build -t <username>/show .
docker push <username>/show
Change Pod Definitions
----------------------
## Change Pod Definitions
Edit both `show-rc.yaml` and `backend-rc.yaml` and replace the
specified `image:` with the one that you built.

View File

@ -2,24 +2,24 @@
title: "Building"
---
Building
--------
## Building
For each container, the build steps are the same. The examples below
are for the `show` container. Replace `show` with `backend` for the
backend container.
Google Container Registry ([GCR](https://cloud.google.com/tools/container-registry/))
---
## Google Container Registry ([GCR](https://cloud.google.com/tools/container-registry/))
docker build -t gcr.io/<project-name>/show .
gcloud docker push gcr.io/<project-name>/show
Docker Hub
----------
## Docker Hub
docker build -t <username>/show .
docker push <username>/show
Change Pod Definitions
----------------------
## Change Pod Definitions
Edit both `show-rc.yaml` and `backend-rc.yaml` and replace the
specified `image:` with the one that you built.

View File

@ -1,9 +1,6 @@
---
title: "Environment Guide Example"
---
Environment Guide Example
=========================
This example demonstrates running pods, replication controllers, and
services. It shows two types of pods: frontend and backend, with
services on top of both. Accessing the frontend pod will return
@ -15,29 +12,28 @@ is [here](/{{page.version}}/docs/user-guide/container-environment).
![Diagram](diagram.png)
Prerequisites
-------------
## Prerequisites
This example assumes that you have a Kubernetes cluster installed and
running, and that you have installed the `kubectl` command line tool
somewhere in your path. Please see the [getting
started](/{{page.version}}/docs/getting-started-guides/) for installation instructions
for your platform.
Optional: Build your own containers
-----------------------------------
## Optional: Build your own containers
The code for the containers is under
[containers/](containers/)
Get everything running
----------------------
## Get everything running
kubectl create -f ./backend-rc.yaml
kubectl create -f ./backend-srv.yaml
kubectl create -f ./show-rc.yaml
kubectl create -f ./show-srv.yaml
Query the service
-----------------
## Query the service
Use `kubectl describe service show-srv` to determine the public IP of
your service.
@ -49,6 +45,7 @@ Run `curl <public ip>:80` to query the service. You should get
something like this back:
```
Pod Name: show-rc-xxu6i
Pod Namespace: default
USER_VAR: important information
@ -68,6 +65,7 @@ Response from backend
Backend Container
Backend Pod Name: backend-rc-6qiya
Backend Namespace: default
```
First the frontend pod's information is printed. The pod name and
@ -87,8 +85,8 @@ frontend pods are always contacting the backend through the backend
service. This results in a different backend pod servicing each
request as well.
Cleanup
-------
## Cleanup
kubectl delete rc,service -l type=show-type
kubectl delete rc,service -l type=backend-type

View File

@ -289,6 +289,3 @@ You can expose a Service in multiple ways that don't directly involve the Ingres
* Use [Service.Type=NodePort](https://github.com/kubernetes/kubernetes/blob/release-1.0/docs/user-guide/services.md#type-nodeport)
* Use a [Port Proxy] (https://github.com/kubernetes/contrib/tree/master/for-demos/proxy-to-service)
* Deploy the [Service loadbalancer](https://github.com/kubernetes/contrib/tree/master/service-loadbalancer). This allows you to share a single IP among multiple Services and achieve more advanced loadbalancing through Service Annotations.

View File

@ -64,8 +64,3 @@ text | the plain text | kind is {.kind} | kind is List
[,] | union operator | {.items[*]['metadata.name', 'status.capacity']} | 127.0.0.1 127.0.0.2 map[cpu:4] map[cpu:8]
?() | filter | {.users[?(@.name=="e2e")].user.password} | secret
range, end | iterate list | {range .items[*]}[{.metadata.name}, {.status.capacity}] {end} | [127.0.0.1, map[cpu:4]] [127.0.0.2, map[cpu:8]]

View File

@ -271,6 +271,3 @@ Use the following set of examples to help you familiarize yourself with running
## Next steps
Start using the [kubectl](kubectl/kubectl) commands.

View File

@ -1,10 +1,6 @@
---
title: "Resource Quota"
---
Resource Quota
========================================
This page has been moved to [here](../../admin/resourcequota/README)

View File

@ -1,10 +1,6 @@
---
title: "Resource Quota"
---
Resource Quota
========================================
This page has been moved to [here](../../admin/resourcequota/README)