Separated user, dev, and design docs.
Renamed: logging.md -> devel/logging.m Renamed: access.md -> design/access.md Renamed: identifiers.md -> design/identifiers.md Renamed: labels.md -> design/labels.md Renamed: namespaces.md -> design/namespaces.md Renamed: security.md -> design/security.md Renamed: networking.md -> design/networking.md Added abbreviated user user-focused document in place of most moved docs. Added docs/README.md explains how docs are organized. Added short, user-oriented documentation on labels Added a glossary. Fixed up some links.
This commit is contained in:
commit
b8c71ec885
|
@ -0,0 +1,248 @@
|
|||
# K8s Identity and Access Management Sketch
|
||||
|
||||
This document suggests a direction for identity and access management in the Kubernetes system.
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
High level goals are:
|
||||
- Have a plan for how identity, authentication, and authorization will fit in to the API.
|
||||
- Have a plan for partitioning resources within a cluster between independent organizational units.
|
||||
- Ease integration with existing enterprise and hosted scenarios.
|
||||
|
||||
### Actors
|
||||
Each of these can act as normal users or attackers.
|
||||
- External Users: People who are accessing applications running on K8s (e.g. a web site served by webserver running in a container on K8s), but who do not have K8s API access.
|
||||
- K8s Users : People who access the K8s API (e.g. create K8s API objects like Pods)
|
||||
- K8s Project Admins: People who manage access for some K8s Users
|
||||
- K8s Cluster Admins: People who control the machines, networks, or binaries that comprise a K8s cluster.
|
||||
- K8s Admin means K8s Cluster Admins and K8s Project Admins taken together.
|
||||
|
||||
### Threats
|
||||
Both intentional attacks and accidental use of privilege are concerns.
|
||||
|
||||
For both cases it may be useful to think about these categories differently:
|
||||
- Application Path - attack by sending network messages from the internet to the IP/port of any application running on K8s. May exploit weakness in application or misconfiguration of K8s.
|
||||
- K8s API Path - attack by sending network messages to any K8s API endpoint.
|
||||
- Insider Path - attack on K8s system components. Attacker may have privileged access to networks, machines or K8s software and data. Software errors in K8s system components and administrator error are some types of threat in this category.
|
||||
|
||||
This document is primarily concerned with K8s API paths, and secondarily with Internal paths. The Application path also needs to be secure, but is not the focus of this document.
|
||||
|
||||
### Assets to protect
|
||||
|
||||
External User assets:
|
||||
- Personal information like private messages, or images uploaded by External Users
|
||||
- web server logs
|
||||
|
||||
K8s User assets:
|
||||
- External User assets of each K8s User
|
||||
- things private to the K8s app, like:
|
||||
- credentials for accessing other services (docker private repos, storage services, facebook, etc)
|
||||
- SSL certificates for web servers
|
||||
- proprietary data and code
|
||||
|
||||
K8s Cluster assets:
|
||||
- Assets of each K8s User
|
||||
- Machine Certificates or secrets.
|
||||
- The value of K8s cluster computing resources (cpu, memory, etc).
|
||||
|
||||
This document is primarily about protecting K8s User assets and K8s cluster assets from other K8s Users and K8s Project and Cluster Admins.
|
||||
|
||||
### Usage environments
|
||||
Cluster in Small organization:
|
||||
- K8s Admins may be the same people as K8s Users.
|
||||
- few K8s Admins.
|
||||
- prefer ease of use to fine-grained access control/precise accounting, etc.
|
||||
- Product requirement that it be easy for potential K8s Cluster Admin to try out setting up a simple cluster.
|
||||
|
||||
Cluster in Large organization:
|
||||
- K8s Admins typically distinct people from K8s Users. May need to divide K8s Cluster Admin access by roles.
|
||||
- K8s Users need to be protected from each other.
|
||||
- Auditing of K8s User and K8s Admin actions important.
|
||||
- flexible accurate usage accounting and resource controls important.
|
||||
- Lots of automated access to APIs.
|
||||
- Need to integrate with existing enterprise directory, authentication, accounting, auditing, and security policy infrastructure.
|
||||
|
||||
Org-run cluster:
|
||||
- organization that runs K8s master components is same as the org that runs apps on K8s.
|
||||
- Minions may be on-premises VMs or physical machines; Cloud VMs; or a mix.
|
||||
|
||||
Hosted cluster:
|
||||
- Offering K8s API as a service, or offering a Paas or Saas built on K8s
|
||||
- May already offer web services, and need to integrate with existing customer account concept, and existing authentication, accounting, auditing, and security policy infrastructure.
|
||||
- May want to leverage K8s User accounts and accounting to manage their User accounts (not a priority to support this use case.)
|
||||
- Precise and accurate accounting of resources needed. Resource controls needed for hard limits (Users given limited slice of data) and soft limits (Users can grow up to some limit and then be expanded).
|
||||
|
||||
K8s ecosystem services:
|
||||
- There may be companies that want to offer their existing services (Build, CI, A/B-test, release automation, etc) for use with K8s. There should be some story for this case.
|
||||
|
||||
Pods configs should be largely portable between Org-run and hosted configurations.
|
||||
|
||||
|
||||
# Design
|
||||
Related discussion:
|
||||
- https://github.com/GoogleCloudPlatform/kubernetes/issues/442
|
||||
- https://github.com/GoogleCloudPlatform/kubernetes/issues/443
|
||||
|
||||
This doc describes two security profiles:
|
||||
- Simple profile: like single-user mode. Make it easy to evaluate K8s without lots of configuring accounts and policies. Protects from unauthorized users, but does not partition authorized users.
|
||||
- Enterprise profile: Provide mechanisms needed for large numbers of users. Defense in depth. Should integrate with existing enterprise security infrastructure.
|
||||
|
||||
K8s distribution should include templates of config, and documentation, for simple and enterprise profiles. System should be flexible enough for knowledgeable users to create intermediate profiles, but K8s developers should only reason about those two Profiles, not a matrix.
|
||||
|
||||
Features in this doc are divided into "Initial Feature", and "Improvements". Initial features would be candidates for version 1.00.
|
||||
|
||||
## Identity
|
||||
###userAccount
|
||||
K8s will have a `userAccount` API object.
|
||||
- `userAccount` has a UID which is immutable. This is used to associate users with objects and to record actions in audit logs.
|
||||
- `userAccount` has a name which is a string and human readable and unique among userAccounts. It is used to refer to users in Policies, to ensure that the Policies are human readable. It can be changed only when there are no Policy objects or other objects which refer to that name. An email address is a suggested format for this field.
|
||||
- `userAccount` is not related to the unix username of processes in Pods created by that userAccount.
|
||||
- `userAccount` API objects can have labels
|
||||
|
||||
The system may associate one or more Authentication Methods with a
|
||||
`userAccount` (but they are not formally part of the userAccount object.)
|
||||
In a simple deployment, the authentication method for a
|
||||
user might be an authentication token which is verified by a K8s server. In a
|
||||
more complex deployment, the authentication might be delegated to
|
||||
another system which is trusted by the K8s API to authenticate users, but where
|
||||
the authentication details are unknown to K8s.
|
||||
|
||||
Initial Features:
|
||||
- there is no superuser `userAccount`
|
||||
- `userAccount` objects are statically populated in the K8s API store by reading a config file. Only a K8s Cluster Admin can do this.
|
||||
- `userAccount` can have a default `namespace`. If API call does not specify a `namespace`, the default `namespace` for that caller is assumed.
|
||||
- `userAccount` is global. A single human with access to multiple namespaces is recommended to only have one userAccount.
|
||||
|
||||
Improvements:
|
||||
- Make `userAccount` part of a separate API group from core K8s objects like `pod`. Facilitates plugging in alternate Access Management.
|
||||
|
||||
Simple Profile:
|
||||
- single `userAccount`, used by all K8s Users and Project Admins. One access token shared by all.
|
||||
|
||||
Enterprise Profile:
|
||||
- every human user has own `userAccount`.
|
||||
- `userAccount`s have labels that indicate both membership in groups, and ability to act in certain roles.
|
||||
- each service using the API has own `userAccount` too. (e.g. `scheduler`, `repcontroller`)
|
||||
- automated jobs to denormalize the ldap group info into the local system list of users into the K8s userAccount file.
|
||||
|
||||
###Unix accounts
|
||||
A `userAccount` is not a Unix user account. The fact that a pod is started by a `userAccount` does not mean that the processes in that pod's containers run as a Unix user with a corresponding name or identity.
|
||||
|
||||
Initially:
|
||||
- The unix accounts available in a container, and used by the processes running in a container are those that are provided by the combination of the base operating system and the Docker manifest.
|
||||
- Kubernetes doesn't enforce any relation between `userAccount` and unix accounts.
|
||||
|
||||
Improvements:
|
||||
- Kubelet allocates disjoint blocks of root-namespace uids for each container. This may provide some defense-in-depth against container escapes. (https://github.com/docker/docker/pull/4572)
|
||||
- requires docker to integrate user namespace support, and deciding what getpwnam() does for these uids.
|
||||
- any features that help users avoid use of privileged containers (https://github.com/GoogleCloudPlatform/kubernetes/issues/391)
|
||||
|
||||
###Namespaces
|
||||
K8s will have a have a `namespace` API object. It is similar to a Google Compute Engine `project`. It provides a namespace for objects created by a group of people co-operating together, preventing name collisions with non-cooperating groups. It also serves as a reference point for authorization policies.
|
||||
|
||||
Namespaces are described in [namespace.md](namespaces.md).
|
||||
|
||||
In the Enterprise Profile:
|
||||
- a `userAccount` may have permission to access several `namespace`s.
|
||||
|
||||
In the Simple Profile:
|
||||
- There is a single `namespace` used by the single user.
|
||||
|
||||
Namespaces versus userAccount vs Labels:
|
||||
- `userAccount`s are intended for audit logging (both name and UID should be logged), and to define who has access to `namespace`s.
|
||||
- `labels` (see [docs/labels.md](labels.md)) should be used to distinguish pods, users, and other objects that cooperate towards a common goal but are different in some way, such as version, or responsibilities.
|
||||
- `namespace`s prevent name collisions between uncoordinated groups of people, and provide a place to attach common policies for co-operating groups of people.
|
||||
|
||||
|
||||
## Authentication
|
||||
|
||||
Goals for K8s authentication:
|
||||
- Include a built-in authentication system with no configuration required to use in single-user mode, and little configuration required to add several user accounts, and no https proxy required.
|
||||
- Allow for authentication to be handled by a system external to Kubernetes, to allow integration with existing to enterprise authorization systems. The kubernetes namespace itself should avoid taking contributions of multiple authorization schemes. Instead, a trusted proxy in front of the apiserver can be used to authenticate users.
|
||||
- For organizations whose security requirements only allow FIPS compliant implementations (e.g. apache) for authentication.
|
||||
- So the proxy can terminate SSL, and isolate the CA-signed certificate from less trusted, higher-touch APIserver.
|
||||
- For organizations that already have existing SaaS web services (e.g. storage, VMs) and want a common authentication portal.
|
||||
- Avoid mixing authentication and authorization, so that authorization policies be centrally managed, and to allow changes in authentication methods without affecting authorization code.
|
||||
|
||||
Initially:
|
||||
- Tokens used to authenticate a user.
|
||||
- Long lived tokens identify a particular `userAccount`.
|
||||
- Administrator utility generates tokens at cluster setup.
|
||||
- OAuth2.0 Bearer tokens protocol, http://tools.ietf.org/html/rfc6750
|
||||
- No scopes for tokens. Authorization happens in the API server
|
||||
- Tokens dynamically generated by apiserver to identify pods which are making API calls.
|
||||
- Tokens checked in a module of the APIserver.
|
||||
- Authentication in apiserver can be disabled by flag, to allow testing without authorization enabled, and to allow use of an authenticating proxy. In this mode, a query parameter or header added by the proxy will identify the caller.
|
||||
|
||||
Improvements:
|
||||
- Refresh of tokens.
|
||||
- SSH keys to access inside containers.
|
||||
|
||||
To be considered for subsequent versions:
|
||||
- Fuller use of OAuth (http://tools.ietf.org/html/rfc6749)
|
||||
- Scoped tokens.
|
||||
- Tokens that are bound to the channel between the client and the api server
|
||||
- http://www.ietf.org/proceedings/90/slides/slides-90-uta-0.pdf
|
||||
- http://www.browserauth.net
|
||||
|
||||
|
||||
## Authorization
|
||||
|
||||
K8s authorization should:
|
||||
- Allow for a range of maturity levels, from single-user for those test driving the system, to integration with existing to enterprise authorization systems.
|
||||
- Allow for centralized management of users and policies. In some organizations, this will mean that the definition of users and access policies needs to reside on a system other than k8s and encompass other web services (such as a storage service).
|
||||
- Allow processes running in K8s Pods to take on identity, and to allow narrow scoping of permissions for those identities in order to limit damage from software faults.
|
||||
- Have Authorization Policies exposed as API objects so that a single config file can create or delete Pods, Controllers, Services, and the identities and policies for those Pods and Controllers.
|
||||
- Be separate as much as practical from Authentication, to allow Authentication methods to change over time and space, without impacting Authorization policies.
|
||||
|
||||
K8s will implement a relatively simple
|
||||
[Attribute-Based Access Control](http://en.wikipedia.org/wiki/Attribute_Based_Access_Control) model.
|
||||
The model will be described in more detail in a forthcoming document. The model will
|
||||
- Be less complex than XACML
|
||||
- Be easily recognizable to those familiar with Amazon IAM Policies.
|
||||
- Have a subset/aliases/defaults which allow it to be used in a way comfortable to those users more familiar with Role-Based Access Control.
|
||||
|
||||
Authorization policy is set by creating a set of Policy objects.
|
||||
|
||||
The API Server will be the Enforcement Point for Policy. For each API call that it receives, it will construct the Attributes needed to evaluate the policy (what user is making the call, what resource they are accessing, what they are trying to do that resource, etc) and pass those attributes to a Decision Point. The Decision Point code evaluates the Attributes against all the Policies and allows or denies the API call. The system will be modular enough that the Decision Point code can either be linked into the APIserver binary, or be another service that the apiserver calls for each Decision (with appropriate time-limited caching as needed for performance).
|
||||
|
||||
Policy objects may be applicable only to a single namespace or to all namespaces; K8s Project Admins would be able to create those as needed. Other Policy objects may be applicable to all namespaces; a K8s Cluster Admin might create those in order to authorize a new type of controller to be used by all namespaces, or to make a K8s User into a K8s Project Admin.)
|
||||
|
||||
|
||||
## Accounting
|
||||
|
||||
The API should have a `quota` concept (see https://github.com/GoogleCloudPlatform/kubernetes/issues/442). A quota object relates a namespace (and optionally a label selector) to a maximum quantity of resources that may be used (see [resources.md](resources.md)).
|
||||
|
||||
Initially:
|
||||
- a `quota` object is immutable.
|
||||
- for hosted K8s systems that do billing, Project is recommended level for billing accounts.
|
||||
- Every object that consumes resources should have a `namespace` so that Resource usage stats are roll-up-able to `namespace`.
|
||||
- K8s Cluster Admin sets quota objects by writing a config file.
|
||||
|
||||
Improvements:
|
||||
- allow one namespace to charge the quota for one or more other namespaces. This would be controlled by a policy which allows changing a billing_namespace= label on an object.
|
||||
- allow quota to be set by namespace owners for (namespace x label) combinations (e.g. let "webserver" namespace use 100 cores, but to prevent accidents, don't allow "webserver" namespace and "instance=test" use more than 10 cores.
|
||||
- tools to help write consistent quota config files based on number of minions, historical namespace usages, QoS needs, etc.
|
||||
- way for K8s Cluster Admin to incrementally adjust Quota objects.
|
||||
|
||||
Simple profile:
|
||||
- a single `namespace` with infinite resource limits.
|
||||
|
||||
Enterprise profile:
|
||||
- multiple namespaces each with their own limits.
|
||||
|
||||
Issues:
|
||||
- need for locking or "eventual consistency" when multiple apiserver goroutines are accessing the object store and handling pod creations.
|
||||
|
||||
|
||||
## Audit Logging
|
||||
|
||||
API actions can be logged.
|
||||
|
||||
Initial implementation:
|
||||
- All API calls logged to nginx logs.
|
||||
|
||||
Improvements:
|
||||
- API server does logging instead.
|
||||
- Policies to drop logging for high rate trusted API calls, or by users performing audit or other sensitive functions.
|
|
@ -0,0 +1,90 @@
|
|||
# Identifiers and Names in Kubernetes
|
||||
|
||||
A summarization of the goals and recommendations for identifiers in Kubernetes. Described in [GitHub issue #199](https://github.com/GoogleCloudPlatform/kubernetes/issues/199).
|
||||
|
||||
|
||||
## Definitions
|
||||
|
||||
UID
|
||||
: A non-empty, opaque, system-generated value guaranteed to be unique in time and space; intended to distinguish between historical occurrences of similar entities.
|
||||
|
||||
Name
|
||||
: A non-empty string guaranteed to be unique within a given scope at a particular time; used in resource URLs; provided by clients at creation time and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish distinct entities, and reference particular entities across operations.
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) label (DNS_LABEL)
|
||||
: An alphanumeric (a-z, A-Z, and 0-9) string, with a maximum length of 63 characters, with the '-' character allowed anywhere except the first or last character, suitable for use as a hostname or segment in a domain name
|
||||
|
||||
[rfc1035](http://www.ietf.org/rfc/rfc1035.txt)/[rfc1123](http://www.ietf.org/rfc/rfc1123.txt) subdomain (DNS_SUBDOMAIN)
|
||||
: One or more rfc1035/rfc1123 labels separated by '.' with a maximum length of 253 characters
|
||||
|
||||
[rfc4122](http://www.ietf.org/rfc/rfc4122.txt) universally unique identifier (UUID)
|
||||
: A 128 bit generated value that is extremely unlikely to collide across time and space and requires no central coordination
|
||||
|
||||
|
||||
## Objectives for names and UIDs
|
||||
|
||||
1. Uniquely identify (via a UID) an object across space and time
|
||||
|
||||
2. Uniquely name (via a name) an object across space
|
||||
|
||||
3. Provide human-friendly names in API operations and/or configuration files
|
||||
|
||||
4. Allow idempotent creation of API resources (#148) and enforcement of space-uniqueness of singleton objects
|
||||
|
||||
5. Allow DNS names to be automatically generated for some objects
|
||||
|
||||
|
||||
## General design
|
||||
|
||||
1. When an object is created via an API, a Name string (a DNS_SUBDOMAIN) must be specified. Name must be non-empty and unique within the apiserver. This enables idempotent and space-unique creation operations. Parts of the system (e.g. replication controller) may join strings (e.g. a base name and a random suffix) to create a unique Name. For situations where generating a name is impractical, some or all objects may support a param to auto-generate a name. Generating random names will defeat idempotency.
|
||||
* Examples: "guestbook.user", "backend-x4eb1"
|
||||
|
||||
2. When an object is created via an api, a Namespace string (a DNS_SUBDOMAIN? format TBD via #1114) may be specified. Depending on the API receiver, namespaces might be validated (e.g. apiserver might ensure that the namespace actually exists). If a namespace is not specified, one will be assigned by the API receiver. This assignment policy might vary across API receivers (e.g. apiserver might have a default, kubelet might generate something semi-random).
|
||||
* Example: "api.k8s.example.com"
|
||||
|
||||
3. Upon acceptance of an object via an API, the object is assigned a UID (a UUID). UID must be non-empty and unique across space and time.
|
||||
* Example: "01234567-89ab-cdef-0123-456789abcdef"
|
||||
|
||||
|
||||
## Case study: Scheduling a pod
|
||||
|
||||
Pods can be placed onto a particular node in a number of ways. This case
|
||||
study demonstrates how the above design can be applied to satisfy the
|
||||
objectives.
|
||||
|
||||
### A pod scheduled by a user through the apiserver
|
||||
|
||||
1. A user submits a pod with Namespace="" and Name="guestbook" to the apiserver.
|
||||
|
||||
2. The apiserver validates the input.
|
||||
1. A default Namespace is assigned.
|
||||
2. The pod name must be space-unique within the Namespace.
|
||||
3. Each container within the pod has a name which must be space-unique within the pod.
|
||||
|
||||
3. The pod is accepted.
|
||||
1. A new UID is assigned.
|
||||
|
||||
4. The pod is bound to a node.
|
||||
1. The kubelet on the node is passed the pod's UID, Namespace, and Name.
|
||||
|
||||
5. Kubelet validates the input.
|
||||
|
||||
6. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
* This may correspond to Docker's container ID.
|
||||
|
||||
### A pod placed by a config file on the node
|
||||
|
||||
1. A config file is stored on the node, containing a pod with UID="", Namespace="", and Name="cadvisor".
|
||||
|
||||
2. Kubelet validates the input.
|
||||
1. Since UID is not provided, kubelet generates one.
|
||||
2. Since Namespace is not provided, kubelet generates one.
|
||||
1. The generated namespace should be deterministic and cluster-unique for the source, such as a hash of the hostname and file path.
|
||||
* E.g. Namespace="file-f4231812554558a718a01ca942782d81"
|
||||
|
||||
3. Kubelet runs the pod.
|
||||
1. Each container is started up with enough metadata to distinguish the pod from whence it came.
|
||||
2. Each attempt to run a container is assigned a UID (a string) that is unique across time.
|
||||
1. This may correspond to Docker's container ID.
|
|
@ -0,0 +1,68 @@
|
|||
# Labels
|
||||
|
||||
_Labels_ are key/value pairs identifying client/user-defined attributes (and non-primitive system-generated attributes) of API objects, which are stored and returned as part of the [metadata of those objects](api-conventions.md). Labels can be used to organize and to select subsets of objects according to these attributes.
|
||||
|
||||
Each object can have a set of key/value labels set on it, with at most one label with a particular key.
|
||||
```
|
||||
"labels": {
|
||||
"key1" : "value1",
|
||||
"key2" : "value2"
|
||||
}
|
||||
```
|
||||
|
||||
Unlike [names and UIDs](identifiers.md), labels do not provide uniqueness. In general, we expect many objects to carry the same label(s).
|
||||
|
||||
Via a _label selector_, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes.
|
||||
|
||||
Label selectors may also be used to associate policies with sets of objects.
|
||||
|
||||
We also [plan](https://github.com/GoogleCloudPlatform/kubernetes/issues/560) to make labels available inside pods and [lifecycle hooks](container-environment.md).
|
||||
|
||||
[Namespacing of label keys](https://github.com/GoogleCloudPlatform/kubernetes/issues/1491) is under discussion.
|
||||
|
||||
Valid labels follow a slightly modified RFC952 format: 24 characters or less, all lowercase, begins with alpha, dashes (-) are allowed, and ends with alphanumeric.
|
||||
|
||||
## Motivation
|
||||
|
||||
Service deployments and batch processing pipelines are often multi-dimensional entities (e.g., multiple partitions or deployments, multiple release tracks, multiple tiers, multiple micro-services per tier). Management often requires cross-cutting operations, which breaks encapsulation of strictly hierarchical representations, especially rigid hierarchies determined by the infrastructure rather than by users. Labels enable users to map their own organizational structures onto system objects in a loosely coupled fashion, without requiring clients to store these mappings.
|
||||
|
||||
## Label selectors
|
||||
|
||||
Label selectors permit very simple filtering by label keys and values. The simplicity of label selectors is deliberate. It is intended to facilitate transparency for humans, easy set overlap detection, efficient indexing, and reverse-indexing (i.e., finding all label selectors matching an object's labels - https://github.com/GoogleCloudPlatform/kubernetes/issues/1348).
|
||||
|
||||
Currently the system supports selection by exact match of a map of keys and values. Matching objects must have all of the specified labels (both keys and values), though they may have additional labels as well.
|
||||
|
||||
We are in the process of extending the label selection specification (see [selector.go](../blob/master/pkg/labels/selector.go) and https://github.com/GoogleCloudPlatform/kubernetes/issues/341) to support conjunctions of requirements of the following forms:
|
||||
```
|
||||
key1 in (value11, value12, ...)
|
||||
key1 not in (value11, value12, ...)
|
||||
key1 exists
|
||||
```
|
||||
|
||||
LIST and WATCH operations may specify label selectors to filter the sets of objects returned using a query parameter: `?labels=key1%3Dvalue1,key2%3Dvalue2,...`. We may extend such filtering to DELETE operations in the future.
|
||||
|
||||
Kubernetes also currently supports two objects that use label selectors to keep track of their members, `service`s and `replicationController`s:
|
||||
- `service`: A [service](services.md) is a configuration unit for the proxies that run on every worker node. It is named and points to one or more pods.
|
||||
- `replicationController`: A [replication controller](replication-controller.md) ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it'll kill some. If there are too few, it'll start more.
|
||||
|
||||
The set of pods that a `service` targets is defined with a label selector. Similarly, the population of pods that a `replicationController` is monitoring is also defined with a label selector.
|
||||
|
||||
For management convenience and consistency, `services` and `replicationControllers` may themselves have labels and would generally carry the labels their corresponding pods have in common.
|
||||
|
||||
In the future, label selectors will be used to identify other types of distributed service workers, such as worker pool members or peers in a distributed application.
|
||||
|
||||
Individual labels are used to specify identifying metadata, and to convey the semantic purposes/roles of pods of containers. Examples of typical pod label keys include `service`, `environment` (e.g., with values `dev`, `qa`, or `production`), `tier` (e.g., with values `frontend` or `backend`), and `track` (e.g., with values `daily` or `weekly`), but you are free to develop your own conventions.
|
||||
|
||||
Sets identified by labels and label selectors could be overlapping (think Venn diagrams). For instance, a service might target all pods with `tier in (frontend), environment in (prod)`. Now say you have 10 replicated pods that make up this tier. But you want to be able to 'canary' a new version of this component. You could set up a `replicationController` (with `replicas` set to 9) for the bulk of the replicas with labels `tier=frontend, environment=prod, track=stable` and another `replicationController` (with `replicas` set to 1) for the canary with labels `tier=frontend, environment=prod, track=canary`. Now the service is covering both the canary and non-canary pods. But you can mess with the `replicationControllers` separately to test things out, monitor the results, etc.
|
||||
|
||||
Note that the superset described in the previous example is also heterogeneous. In long-lived, highly available, horizontally scaled, distributed, continuously evolving service applications, heterogeneity is inevitable, due to canaries, incremental rollouts, live reconfiguration, simultaneous updates and auto-scaling, hardware upgrades, and so on.
|
||||
|
||||
Pods (and other objects) may belong to multiple sets simultaneously, which enables representation of service substructure and/or superstructure. In particular, labels are intended to facilitate the creation of non-hierarchical, multi-dimensional deployment structures. They are useful for a variety of management purposes (e.g., configuration, deployment) and for application introspection and analysis (e.g., logging, monitoring, alerting, analytics). Without the ability to form sets by intersecting labels, many implicitly related, overlapping flat sets would need to be created, for each subset and/or superset desired, which would lose semantic information and be difficult to keep consistent. Purely hierarchically nested sets wouldn't readily support slicing sets across different dimensions.
|
||||
|
||||
Pods may be removed from these sets by changing their labels. This flexibility may be used to remove pods from service for debugging, data recovery, etc.
|
||||
|
||||
Since labels can be set at pod creation time, no separate set add/remove operations are necessary, which makes them easier to use than manual set management. Additionally, since labels are directly attached to pods and label selectors are fairly simple, it's easy for users and for clients and tools to determine what sets they belong to (i.e., they are reversible). OTOH, with sets formed by just explicitly enumerating members, one would (conceptually) need to search all sets to determine which ones a pod belonged to.
|
||||
|
||||
## Labels vs. annotations
|
||||
|
||||
We'll eventually index and reverse-index labels for efficient queries and watches, use them to sort and group in UIs and CLIs, etc. We don't want to pollute labels with non-identifying, especially large and/or structured, data. Non-identifying information should be recorded using [annotations](annotations.md).
|
|
@ -0,0 +1,193 @@
|
|||
# Kubernetes Proposal - Namespaces
|
||||
|
||||
**Related PR:**
|
||||
|
||||
| Topic | Link |
|
||||
| ---- | ---- |
|
||||
| Identifiers.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/1216 |
|
||||
| Access.md | https://github.com/GoogleCloudPlatform/kubernetes/pull/891 |
|
||||
| Indexing | https://github.com/GoogleCloudPlatform/kubernetes/pull/1183 |
|
||||
| Cluster Subdivision | https://github.com/GoogleCloudPlatform/kubernetes/issues/442 |
|
||||
|
||||
## Background
|
||||
|
||||
High level goals:
|
||||
|
||||
* Enable an easy-to-use mechanism to logically scope Kubernetes resources
|
||||
* Ensure extension resources to Kubernetes can share the same logical scope as core Kubernetes resources
|
||||
* Ensure it aligns with access control proposal
|
||||
* Ensure system has log n scale with increasing numbers of scopes
|
||||
|
||||
## Use cases
|
||||
|
||||
Actors:
|
||||
|
||||
1. k8s admin - administers a kubernetes cluster
|
||||
2. k8s service - k8s daemon operates on behalf of another user (i.e. controller-manager)
|
||||
2. k8s policy manager - enforces policies imposed on k8s cluster
|
||||
3. k8s user - uses a kubernetes cluster to schedule pods
|
||||
|
||||
User stories:
|
||||
|
||||
1. Ability to set immutable namespace to k8s resources
|
||||
2. Ability to list k8s resource scoped to a namespace
|
||||
3. Restrict a namespace identifier to a DNS-compatible string to support compound naming conventions
|
||||
4. Ability for a k8s policy manager to enforce a k8s user's access to a set of namespaces
|
||||
5. Ability to set/unset a default namespace for use by kubecfg client
|
||||
6. Ability for a k8s service to monitor resource changes across namespaces
|
||||
7. Ability for a k8s service to list resources across namespaces
|
||||
|
||||
## Proposed Design
|
||||
|
||||
### Model Changes
|
||||
|
||||
Introduce a new attribute *Namespace* for each resource that must be scoped in a Kubernetes cluster.
|
||||
|
||||
A *Namespace* is a DNS compatible subdomain.
|
||||
|
||||
```
|
||||
// TypeMeta is shared by all objects sent to, or returned from the client
|
||||
type TypeMeta struct {
|
||||
Kind string `json:"kind,omitempty" yaml:"kind,omitempty"`
|
||||
Uid string `json:"uid,omitempty" yaml:"uid,omitempty"`
|
||||
CreationTimestamp util.Time `json:"creationTimestamp,omitempty" yaml:"creationTimestamp,omitempty"`
|
||||
SelfLink string `json:"selfLink,omitempty" yaml:"selfLink,omitempty"`
|
||||
ResourceVersion uint64 `json:"resourceVersion,omitempty" yaml:"resourceVersion,omitempty"`
|
||||
APIVersion string `json:"apiVersion,omitempty" yaml:"apiVersion,omitempty"`
|
||||
Namespace string `json:"namespace,omitempty" yaml:"namespace,omitempty"`
|
||||
Name string `json:"name,omitempty" yaml:"name,omitempty"`
|
||||
}
|
||||
```
|
||||
|
||||
An identifier, *UID*, is unique across time and space intended to distinguish between historical occurences of similar entities.
|
||||
|
||||
A *Name* is unique within a given *Namespace* at a particular time, used in resource URLs; provided by clients at creation time
|
||||
and encouraged to be human friendly; intended to facilitate creation idempotence and space-uniqueness of singleton objects, distinguish
|
||||
distinct entities, and reference particular entities across operations.
|
||||
|
||||
As of this writing, the following resources MUST have a *Namespace* and *Name*
|
||||
|
||||
* pod
|
||||
* service
|
||||
* replicationController
|
||||
* endpoint
|
||||
|
||||
A *policy* MAY be associated with a *Namespace*.
|
||||
|
||||
If a *policy* has an associated *Namespace*, the resource paths it enforces are scoped to a particular *Namespace*.
|
||||
|
||||
## k8s API server
|
||||
|
||||
In support of namespace isolation, the Kubernetes API server will address resources by the following conventions:
|
||||
|
||||
The typical actors for the following requests are the k8s user or the k8s service.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/ns/{ns}/{resourceType}/ | Create instance of {resourceType} in namespace {ns} |
|
||||
| GET | GET | /api/{version}/ns/{ns}/{resourceType}/{name} | Get instance of {resourceType} in namespace {ns} with {name} |
|
||||
| UPDATE | PUT | /api/{version}/ns/{ns}/{resourceType}/{name} | Update instance of {resourceType} in namespace {ns} with {name} |
|
||||
| DELETE | DELETE | /api/{version}/ns/{ns}/{resourceType}/{name} | Delete instance of {resourceType} in namespace {ns} with {name} |
|
||||
| LIST | GET | /api/{version}/ns/{ns}/{resourceType} | List instances of {resourceType} in namespace {ns} |
|
||||
| WATCH | GET | /api/{version}/watch/ns/{ns}/{resourceType} | Watch for changes to a {resourceType} in namespace {ns} |
|
||||
|
||||
The typical actor for the following requests are the k8s service or k8s admin as enforced by k8s Policy.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| WATCH | GET | /api/{version}/watch/{resourceType} | Watch for changes to a {resourceType} across all namespaces |
|
||||
| LIST | GET | /api/{version}/list/{resourceType} | List instances of {resourceType} across all namespaces |
|
||||
|
||||
The legacy API patterns for k8s are an alias to interacting with the *default* namespace as follows.
|
||||
|
||||
| Action | HTTP Verb | Path | Description |
|
||||
| ---- | ---- | ---- | ---- |
|
||||
| CREATE | POST | /api/{version}/{resourceType}/ | Create instance of {resourceType} in namespace *default* |
|
||||
| GET | GET | /api/{version}/{resourceType}/{name} | Get instance of {resourceType} in namespace *default* |
|
||||
| UPDATE | PUT | /api/{version}/{resourceType}/{name} | Update instance of {resourceType} in namespace *default* |
|
||||
| DELETE | DELETE | /api/{version}/{resourceType}/{name} | Delete instance of {resourceType} in namespace *default* |
|
||||
|
||||
The k8s API server verifies the *Namespace* on resource creation matches the *{ns}* on the path.
|
||||
|
||||
The k8s API server will enable efficient mechanisms to filter model resources based on the *Namespace*. This may require
|
||||
the creation of an index on *Namespace* that could support query by namespace with optional label selectors.
|
||||
|
||||
The k8s API server will associate a resource with a *Namespace* if not populated by the end-user based on the *Namespace* context
|
||||
of the incoming request. If the *Namespace* of the resource being created, or updated does not match the *Namespace* on the request,
|
||||
then the k8s API server will reject the request.
|
||||
|
||||
TODO: Update to discuss k8s api server proxy patterns
|
||||
|
||||
## k8s storage
|
||||
|
||||
A namespace provides a unique identifier space and therefore must be in the storage path of a resource.
|
||||
|
||||
In etcd, we want to continue to still support efficient WATCH across namespaces.
|
||||
|
||||
Resources that persist content in etcd will have storage paths as follows:
|
||||
|
||||
/registry/{resourceType}/{resource.Namespace}/{resource.Name}
|
||||
|
||||
This enables k8s service to WATCH /registry/{resourceType} for changes across namespace of a particular {resourceType}.
|
||||
|
||||
Upon scheduling a pod to a particular host, the pod's namespace must be in the key path as follows:
|
||||
|
||||
/host/{host}/pod/{pod.Namespace}/{pod.Name}
|
||||
|
||||
## k8s Authorization service
|
||||
|
||||
This design assumes the existence of an authorization service that filters incoming requests to the k8s API Server in order
|
||||
to enforce user authorization to a particular k8s resource. It performs this action by associating the *subject* of a request
|
||||
with a *policy* to an associated HTTP path and verb. This design encodes the *namespace* in the resource path in order to enable
|
||||
external policy servers to function by resource path alone. If a request is made by an identity that is not allowed by
|
||||
policy to the resource, the request is terminated. Otherwise, it is forwarded to the apiserver.
|
||||
|
||||
## k8s controller-manager
|
||||
|
||||
The controller-manager will provision pods in the same namespace as the associated replicationController.
|
||||
|
||||
## k8s Kubelet
|
||||
|
||||
There is no major change to the kubelet introduced by this proposal.
|
||||
|
||||
### kubecfg client
|
||||
|
||||
kubecfg supports following:
|
||||
|
||||
```
|
||||
kubecfg [OPTIONS] ns {namespace}
|
||||
```
|
||||
|
||||
To set a namespace to use across multiple operations:
|
||||
|
||||
```
|
||||
$ kubecfg ns ns1
|
||||
```
|
||||
|
||||
To view the current namespace:
|
||||
|
||||
```
|
||||
$ kubecfg ns
|
||||
Using namespace ns1
|
||||
```
|
||||
|
||||
To reset to the default namespace:
|
||||
|
||||
```
|
||||
$ kubecfg ns default
|
||||
```
|
||||
|
||||
In addition, each kubecfg request may explicitly specify a namespace for the operation via the following OPTION
|
||||
|
||||
--ns
|
||||
|
||||
When loading resource files specified by the -c OPTION, the kubecfg client will ensure the namespace is set in the
|
||||
message body to match the client specified default.
|
||||
|
||||
If no default namespace is applied, the client will assume the following default namespace:
|
||||
|
||||
* default
|
||||
|
||||
The kubecfg client would store default namespace information in the same manner it caches authentication information today
|
||||
as a file on user's file system.
|
||||
|
|
@ -0,0 +1,107 @@
|
|||
# Networking
|
||||
|
||||
## Model and motivation
|
||||
|
||||
Kubernetes deviates from the default Docker networking model. The goal is for each pod to have an IP in a flat shared networking namespace that has full communication with other physical computers and containers across the network. IP-per-pod creates a clean, backward-compatible model where pods can be treated much like VMs or physical hosts from the perspectives of port allocation, networking, naming, service discovery, load balancing, application configuration, and migration.
|
||||
|
||||
OTOH, dynamic port allocation requires supporting both static ports (e.g., for externally accessible services) and dynamically allocated ports, requires partitioning centrally allocated and locally acquired dynamic ports, complicates scheduling (since ports are a scarce resource), is inconvenient for users, complicates application configuration, is plagued by port conflicts and reuse and exhaustion, requires non-standard approaches to naming (e.g., etcd rather than DNS), requires proxies and/or redirection for programs using standard naming/addressing mechanisms (e.g., web browsers), requires watching and cache invalidation for address/port changes for instances in addition to watching group membership changes, and obstructs container/pod migration (e.g., using CRIU). NAT introduces additional complexity by fragmenting the addressing space, which breaks self-registration mechanisms, among other problems.
|
||||
|
||||
With the IP-per-pod model, all user containers within a pod behave as if they are on the same host with regard to networking. They can all reach each other’s ports on localhost. Ports which are published to the host interface are done so in the normal Docker way. All containers in all pods can talk to all other containers in all other pods by their 10-dot addresses.
|
||||
|
||||
In addition to avoiding the aforementioned problems with dynamic port allocation, this approach reduces friction for applications moving from the world of uncontainerized apps on physical or virtual hosts to containers within pods. People running application stacks together on the same host have already figured out how to make ports not conflict (e.g., by configuring them through environment variables) and have arranged for clients to find them.
|
||||
|
||||
The approach does reduce isolation between containers within a pod -- ports could conflict, and there couldn't be private ports across containers within a pod, but applications requiring their own port spaces could just run as separate pods and processes requiring private communication could run within the same container. Besides, the premise of pods is that containers within a pod share some resources (volumes, cpu, ram, etc.) and therefore expect and tolerate reduced isolation. Additionally, the user can control what containers belong to the same pod whereas, in general, they don't control what pods land together on a host.
|
||||
|
||||
When any container calls SIOCGIFADDR, it sees the IP that any peer container would see them coming from -- each pod has its own IP address that other pods can know. By making IP addresses and ports the same within and outside the containers and pods, we create a NAT-less, flat address space. "ip addr show" should work as expected. This would enable all existing naming/discovery mechanisms to work out of the box, including self-registration mechanisms and applications that distribute IP addresses. (We should test that with etcd and perhaps one other option, such as Eureka (used by Acme Air) or Consul.) We should be optimizing for inter-pod network communication. Within a pod, containers are more likely to use communication through volumes (e.g., tmpfs) or IPC.
|
||||
|
||||
This is different from the standard Docker model. In that mode, each container gets an IP in the 172-dot space and would only see that 172-dot address from SIOCGIFADDR. If these containers connect to another container the peer would see the connect coming from a different IP than the container itself knows. In short - you can never self-register anything from a container, because a container can not be reached on its private IP.
|
||||
|
||||
An alternative we considered was an additional layer of addressing: pod-centric IP per container. Each container would have its own local IP address, visible only within that pod. This would perhaps make it easier for containerized applications to move from physical/virtual hosts to pods, but would be more complex to implement (e.g., requiring a bridge per pod, split-horizon/VP DNS) and to reason about, due to the additional layer of address translation, and would break self-registration and IP distribution mechanisms.
|
||||
|
||||
## Current implementation
|
||||
|
||||
For the Google Compute Engine cluster configuration scripts, [advanced routing](https://developers.google.com/compute/docs/networking#routing) is set up so that each VM has an extra 256 IP addresses that get routed to it. This is in addition to the 'main' IP address assigned to the VM that is NAT-ed for Internet access. The networking bridge (called `cbr0` to differentiate it from `docker0`) is set up outside of Docker proper and only does NAT for egress network traffic that isn't aimed at the virtual network.
|
||||
|
||||
Ports mapped in from the 'main IP' (and hence the internet if the right firewall rules are set up) are proxied in user mode by Docker. In the future, this should be done with `iptables` by either the Kubelet or Docker: [Issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15).
|
||||
|
||||
We start Docker with:
|
||||
DOCKER_OPTS="--bridge cbr0 --iptables=false"
|
||||
|
||||
We set up this bridge on each node with SaltStack, in [container_bridge.py](cluster/saltbase/salt/_states/container_bridge.py).
|
||||
|
||||
cbr0:
|
||||
container_bridge.ensure:
|
||||
- cidr: {{ grains['cbr-cidr'] }}
|
||||
...
|
||||
grains:
|
||||
roles:
|
||||
- kubernetes-pool
|
||||
cbr-cidr: $MINION_IP_RANGE
|
||||
|
||||
We make these addresses routable in GCE:
|
||||
|
||||
gcutil addroute ${MINION_NAMES[$i]} ${MINION_IP_RANGES[$i]} \
|
||||
--norespect_terminal_width \
|
||||
--project ${PROJECT} \
|
||||
--network ${NETWORK} \
|
||||
--next_hop_instance ${ZONE}/instances/${MINION_NAMES[$i]} &
|
||||
|
||||
The minion IP ranges are /24s in the 10-dot space.
|
||||
|
||||
GCE itself does not know anything about these IPs, though.
|
||||
|
||||
These are not externally routable, though, so containers that need to communicate with the outside world need to use host networking. To set up an external IP that forwards to the VM, it will only forward to the VM's primary IP (which is assigned to no pod). So we use docker's -p flag to map published ports to the main interface. This has the side effect of disallowing two pods from exposing the same port. (More discussion on this in [Issue #390](https://github.com/GoogleCloudPlatform/kubernetes/issues/390).)
|
||||
|
||||
We create a container to use for the pod network namespace -- a single loopback device and a single veth device. All the user's containers get their network namespaces from this pod networking container.
|
||||
|
||||
Docker allocates IP addresses from a bridge we create on each node, using its “container” networking mode.
|
||||
|
||||
1. Create a normal (in the networking sense) container which uses a minimal image and runs a command that blocks forever. This is not a user-defined container, and gets a special well-known name.
|
||||
- creates a new network namespace (netns) and loopback device
|
||||
- creates a new pair of veth devices and binds them to the netns
|
||||
- auto-assigns an IP from docker’s IP range
|
||||
|
||||
2. Create the user containers and specify the name of the network container as their “net” argument. Docker finds the PID of the command running in the network container and attaches to the netns of that PID.
|
||||
|
||||
### Other networking implementation examples
|
||||
With the primary aim of providing IP-per-pod-model, other implementations exist to serve the purpose outside of GCE.
|
||||
- [OpenVSwitch with GRE/VxLAN](../ovs-networking.md)
|
||||
- [Flannel](https://github.com/coreos/flannel#flannel)
|
||||
|
||||
## Challenges and future work
|
||||
|
||||
### Docker API
|
||||
|
||||
Right now, docker inspect doesn't show the networking configuration of the containers, since they derive it from another container. That information should be exposed somehow.
|
||||
|
||||
### External IP assignment
|
||||
|
||||
We want to be able to assign IP addresses externally from Docker ([Docker issue #6743](https://github.com/dotcloud/docker/issues/6743)) so that we don't need to statically allocate fixed-size IP ranges to each node, so that IP addresses can be made stable across network container restarts ([Docker issue #2801](https://github.com/dotcloud/docker/issues/2801)), and to facilitate pod migration. Right now, if the network container dies, all the user containers must be stopped and restarted because the netns of the network container will change on restart, and any subsequent user container restart will join that new netns, thereby not being able to see its peers. Additionally, a change in IP address would encounter DNS caching/TTL problems. External IP assignment would also simplify DNS support (see below).
|
||||
|
||||
### Naming, discovery, and load balancing
|
||||
|
||||
In addition to enabling self-registration with 3rd-party discovery mechanisms, we'd like to setup DDNS automatically ([Issue #146](https://github.com/GoogleCloudPlatform/kubernetes/issues/146)). hostname, $HOSTNAME, etc. should return a name for the pod ([Issue #298](https://github.com/GoogleCloudPlatform/kubernetes/issues/298)), and gethostbyname should be able to resolve names of other pods. Probably we need to set up a DNS resolver to do the latter ([Docker issue #2267](https://github.com/dotcloud/docker/issues/2267)), so that we don't need to keep /etc/hosts files up to date dynamically.
|
||||
|
||||
[Service](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) endpoints are currently found through environment variables. Both [Docker-links-compatible](https://docs.docker.com/userguide/dockerlinks/) variables and kubernetes-specific variables ({NAME}_SERVICE_HOST and {NAME}_SERVICE_BAR) are supported, and resolve to ports opened by the service proxy. We don't actually use [the Docker ambassador pattern](https://docs.docker.com/articles/ambassador_pattern_linking/) to link containers because we don't require applications to identify all clients at configuration time, yet. While services today are managed by the service proxy, this is an implementation detail that applications should not rely on. Clients should instead use the [service portal IP](https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/services.md) (which the above environment variables will resolve to). However, a flat service namespace doesn't scale and environment variables don't permit dynamic updates, which complicates service deployment by imposing implicit ordering constraints. We intend to register each service portal IP in DNS, and for that to become the preferred resolution protocol.
|
||||
|
||||
We'd also like to accommodate other load-balancing solutions (e.g., HAProxy), non-load-balanced services ([Issue #260](https://github.com/GoogleCloudPlatform/kubernetes/issues/260)), and other types of groups (worker pools, etc.). Providing the ability to Watch a label selector applied to pod addresses would enable efficient monitoring of group membership, which could be directly consumed or synced with a discovery mechanism. Event hooks ([Issue #140](https://github.com/GoogleCloudPlatform/kubernetes/issues/140)) for join/leave events would probably make this even easier.
|
||||
|
||||
### External routability
|
||||
|
||||
We want traffic between containers to use the pod IP addresses across nodes. Say we have Node A with a container IP space of 10.244.1.0/24 and Node B with a container IP space of 10.244.2.0/24. And we have Container A1 at 10.244.1.1 and Container B1 at 10.244.2.1. We want Container A1 to talk to Container B1 directly with no NAT. B1 should see the "source" in the IP packets of 10.244.1.1 -- not the "primary" host IP for Node A. That means that we want to turn off NAT for traffic between containers (and also between VMs and containers).
|
||||
|
||||
We'd also like to make pods directly routable from the external internet. However, we can't yet support the extra container IPs that we've provisioned talking to the internet directly. So, we don't map external IPs to the container IPs. Instead, we solve that problem by having traffic that isn't to the internal network (! 10.0.0.0/8) get NATed through the primary host IP address so that it can get 1:1 NATed by the GCE networking when talking to the internet. Similarly, incoming traffic from the internet has to get NATed/proxied through the host IP.
|
||||
|
||||
So we end up with 3 cases:
|
||||
|
||||
1. Container -> Container or Container <-> VM. These should use 10. addresses directly and there should be no NAT.
|
||||
|
||||
2. Container -> Internet. These have to get mapped to the primary host IP so that GCE knows how to egress that traffic. There is actually 2 layers of NAT here: Container IP -> Internal Host IP -> External Host IP. The first level happens in the guest with IP tables and the second happens as part of GCE networking. The first one (Container IP -> internal host IP) does dynamic port allocation while the second maps ports 1:1.
|
||||
|
||||
3. Internet -> Container. This also has to go through the primary host IP and also has 2 levels of NAT, ideally. However, the path currently is a proxy with (External Host IP -> Internal Host IP -> Docker) -> (Docker -> Container IP). Once [issue #15](https://github.com/GoogleCloudPlatform/kubernetes/issues/15) is closed, it should be External Host IP -> Internal Host IP -> Container IP. But to get that second arrow we have to set up the port forwarding iptables rules per mapped port.
|
||||
|
||||
Another approach could be to create a new host interface alias for each pod, if we had a way to route an external IP to it. This would eliminate the scheduling constraints resulting from using the host's IP address.
|
||||
|
||||
### IPv6
|
||||
|
||||
IPv6 would be a nice option, also, but we can't depend on it yet. Docker support is in progress: [Docker issue #2974](https://github.com/dotcloud/docker/issues/2974), [Docker issue #6923](https://github.com/dotcloud/docker/issues/6923), [Docker issue #6975](https://github.com/dotcloud/docker/issues/6975). Additionally, direct ipv6 assignment to instances doesn't appear to be supported by major cloud providers (e.g., AWS EC2, GCE) yet. We'd happily take pull requests from people running Kubernetes on bare metal, though. :-)
|
|
@ -0,0 +1,26 @@
|
|||
# Security in Kubernetes
|
||||
|
||||
General design principles and guidelines related to security of containers, APIs, and infrastructure in Kubernetes.
|
||||
|
||||
|
||||
## Objectives
|
||||
|
||||
1. Ensure a clear isolation between container and the underlying host it runs on
|
||||
2. Limit the ability of the container to negatively impact the infrastructure or other containers
|
||||
3. [Principle of Least Privilege](http://en.wikipedia.org/wiki/Principle_of_least_privilege) - ensure components are only authorized to perform the actions they need, and limit the scope of a compromise by limiting the capabilities of individual components
|
||||
4. Reduce the number of systems that have to be hardened and secured by defining clear boundaries between components
|
||||
|
||||
|
||||
## Design Points
|
||||
|
||||
### Isolate the data store from the minions and supporting infrastructure
|
||||
|
||||
Access to the central data store (etcd) in Kubernetes allows an attacker to run arbitrary containers on hosts, to gain access to any protected information stored in either volumes or in pods (such as access tokens or shared secrets provided as environment variables), to intercept and redirect traffic from running services by inserting middlemen, or to simply delete the entire history of the custer.
|
||||
|
||||
As a general principle, access to the central data store should be restricted to the components that need full control over the system and which can apply appropriate authorization and authentication of change requests. In the future, etcd may offer granular access control, but that granularity will require an administrator to understand the schema of the data to properly apply security. An administrator must be able to properly secure Kubernetes at a policy level, rather than at an implementation level, and schema changes over time should not risk unintended security leaks.
|
||||
|
||||
Both the Kubelet and Kube Proxy need information related to their specific roles - for the Kubelet, the set of pods it should be running, and for the Proxy, the set of services and endpoints to load balance. The Kubelet also needs to provide information about running pods and historical termination data. The access pattern for both Kubelet and Proxy to load their configuration is an efficient "wait for changes" request over HTTP. It should be possible to limit the Kubelet and Proxy to only access the information they need to perform their roles and no more.
|
||||
|
||||
The controller manager for Replication Controllers and other future controllers act on behalf of a user via delegation to perform automated maintenance on Kubernetes resources. Their ability to access or modify resource state should be strictly limited to their intended duties and they should be prevented from accessing information not pertinent to their role. For example, a replication controller needs only to create a copy of a known pod configuration, to determine the running state of an existing pod, or to delete an existing pod that it created - it does not need to know the contents or current state of a pod, nor have access to any data in the pods attached volumes.
|
||||
|
||||
The Kubernetes pod scheduler is responsible for reading data from the pod to fit it onto a minion in the cluster. At a minimum, it needs access to view the ID of a pod (to craft the binding), its current state, any resource information necessary to identify placement, and other data relevant to concerns like anti-affinity, zone or region preference, or custom logic. It does not need the ability to modify pods or see other resources, only to create bindings. It should not need the ability to delete bindings unless the scheduler takes control of relocating components on failed hosts (which could be implemented by a separate component that can delete bindings but not create them). The scheduler may need read access to user or project-container information to determine preferential location (underspecified at this time).
|
Loading…
Reference in New Issue