252 lines
16 KiB
Markdown
252 lines
16 KiB
Markdown
# Kubernetes Design and Architecture
|
||
|
||
A much more detailed and updated [Architectural
|
||
Roadmap](../../devel/architectural-roadmap.md) is also available.
|
||
|
||
## Overview
|
||
|
||
Kubernetes is a production-grade, open-source infrastructure for the deployment, scaling,
|
||
management, and composition of application containers across clusters of hosts, inspired
|
||
by [previous work at Google](https://research.google.com/pubs/pub44843.html). Kubernetes
|
||
is more than just a “container orchestrator”. It aims to eliminate the burden of orchestrating
|
||
physical/virtual compute, network, and storage infrastructure, and enable application operators
|
||
and developers to focus entirely on container-centric primitives for self-service operation.
|
||
Kubernetes also provides a stable, portable foundation (a platform) for building customized
|
||
workflows and higher-level automation.
|
||
|
||
Kubernetes is primarily targeted at applications composed of multiple containers. It therefore
|
||
groups containers using *pods* and *labels* into tightly coupled and loosely coupled formations
|
||
for easy management and discovery.
|
||
|
||
## Scope
|
||
|
||
Kubernetes is a [platform for deploying and managing containers](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/).
|
||
Kubernetes provides a container runtime, container
|
||
orchestration, container-centric infrastructure orchestration, self-healing mechanisms such as health checking and re-scheduling, and service discovery and load balancing.
|
||
|
||
Kubernetes aspires to be an extensible, pluggable, building-block OSS
|
||
platform and toolkit. Therefore, architecturally, we want Kubernetes to be built
|
||
as a collection of pluggable components and layers, with the ability to use
|
||
alternative schedulers, controllers, storage systems, and distribution
|
||
mechanisms, and we're evolving its current code in that direction. Furthermore,
|
||
we want others to be able to extend Kubernetes functionality, such as with
|
||
higher-level PaaS functionality or multi-cluster layers, without modification of
|
||
core Kubernetes source. Therefore, its API isn't just (or even necessarily
|
||
mainly) targeted at end users, but at tool and extension developers. Its APIs
|
||
are intended to serve as the foundation for an open ecosystem of tools,
|
||
automation systems, and higher-level API layers. Consequently, there are no
|
||
"internal" inter-component APIs. All APIs are visible and available, including
|
||
the APIs used by the scheduler, the node controller, the replication-controller
|
||
manager, Kubelet's API, etc. There's no glass to break -- in order to handle
|
||
more complex use cases, one can just access the lower-level APIs in a fully
|
||
transparent, composable manner.
|
||
|
||
## Goals
|
||
|
||
The project is committed to the following (aspirational) [design ideals](principles.md):
|
||
* _Portable_. Kubernetes runs everywhere -- public cloud, private cloud, bare metal, laptop --
|
||
with consistent behavior so that applications and tools are portable throughout the ecosystem
|
||
as well as between development and production environments.
|
||
* _General-purpose_. Kubernetes should run all major categories of workloads to enable you to run
|
||
all of your workloads on a single infrastructure, stateless and stateful, microservices and
|
||
monoliths, services and batch, greenfield and legacy.
|
||
* _Meet users partway_. Kubernetes doesn’t just cater to purely greenfield cloud-native
|
||
applications, nor does it meet all users where they are. It focuses on deployment and management
|
||
of microservices and cloud-native applications, but provides some mechanisms to facilitate
|
||
migration of monolithic and legacy applications.
|
||
* _Flexible_. Kubernetes functionality can be consumed a la carte and (in most cases) Kubernetes
|
||
does not prevent you from using your own solutions in lieu of built-in functionality.
|
||
* _Extensible_. Kubernetes enables you to integrate it into your environment and to add the
|
||
additional capabilities you need, by exposing the same interfaces used by built-in
|
||
functionality.
|
||
* _Automatable_. Kubernetes aims to dramatically reduce the burden of manual operations. It
|
||
supports both declarative control by specifying users’ desired intent via its API, as well as
|
||
imperative control to support higher-level orchestration and automation. The declarative
|
||
approach is key to the system’s self-healing and autonomic capabilities.
|
||
* _Advance the state of the art_. While Kubernetes intends to support non-cloud-native
|
||
applications, it also aspires to advance the cloud-native and DevOps state of the art, such as
|
||
in the [participation of applications in their own management](https://kubernetes.io/blog/2016/09/cloud-native-application-interfaces/).
|
||
However, in doing
|
||
so, we strive not to force applications to lock themselves into Kubernetes APIs, which is, for
|
||
example, why we prefer configuration over convention in the [downward API](https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/#the-downward-api).
|
||
Additionally, Kubernetes is not bound by
|
||
the lowest common denominator of systems upon which it depends, such as container runtimes and
|
||
cloud providers. An example where we pushed the envelope of what was achievable was in its
|
||
[IP per Pod networking model](https://kubernetes.io/docs/concepts/cluster-administration/networking/#kubernetes-model).
|
||
|
||
## Architecture
|
||
|
||
A running Kubernetes cluster contains node agents (kubelet) and a cluster control plane (AKA
|
||
*master*), with cluster state backed by a distributed storage system
|
||
([etcd](https://github.com/coreos/etcd)).
|
||
|
||
### Cluster control plane (AKA *master*)
|
||
|
||
The Kubernetes [control plane](https://en.wikipedia.org/wiki/Control_plane) is split
|
||
into a set of components, which can all run on a single *master* node, or can be replicated
|
||
in order to support high-availability clusters, or can even be run on Kubernetes itself (AKA
|
||
[self-hosted](../cluster-lifecycle/self-hosted-kubernetes.md#what-is-self-hosted)).
|
||
|
||
Kubernetes provides a REST API supporting primarily CRUD operations on (mostly) persistent resources, which
|
||
serve as the hub of its control plane. Kubernetes’s API provides IaaS-like
|
||
container-centric primitives such as [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/),
|
||
[Services](https://kubernetes.io/docs/concepts/services-networking/service/), and
|
||
[Ingress](https://kubernetes.io/docs/concepts/services-networking/ingress/), and also lifecycle APIs to support orchestration
|
||
(self-healing, scaling, updates, termination) of common types of workloads, such as
|
||
[ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/) (simple fungible/stateless app manager),
|
||
[Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) (orchestrates updates of
|
||
stateless apps), [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) (batch),
|
||
[CronJob](https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/) (cron),
|
||
[DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) (cluster services), and
|
||
[StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) (stateful apps).
|
||
We deliberately decoupled service naming/discovery and load balancing from application
|
||
implementation, since the latter is diverse and open-ended.
|
||
|
||
Both user clients and components containing asynchronous controllers interact with the same API resources,
|
||
which serve as coordination points, common intermediate representation, and shared state. Most resources
|
||
contain metadata, including [labels](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) and
|
||
[annotations](https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/), fully elaborated desired state (spec),
|
||
including default values, and observed state (status).
|
||
|
||
Controllers work continuously to drive the actual state towards the desired state, while reporting back the currently observed state for users and for other controllers.
|
||
|
||
While the controllers are level-based (as described [here](http://gengnosis.blogspot.com/2007/01/level-triggered-and-edge-triggered.html) and [here](https://hackernoon.com/level-triggering-and-reconciliation-in-kubernetes-1f17fe30333d))
|
||
to maximize fault
|
||
tolerance, they typically `watch` for changes to relevant resources in order to minimize reaction
|
||
latency and redundant work. This enables decentralized and decoupled
|
||
[choreography-like](https://en.wikipedia.org/wiki/Service_choreography) coordination without a
|
||
message bus.
|
||
|
||
#### API Server
|
||
|
||
The [API server](https://kubernetes.io/docs/admin/kube-apiserver/) serves up the
|
||
[Kubernetes API](https://kubernetes.io/docs/concepts/overview/kubernetes-api/). It is intended to be a relatively simple
|
||
server, with most/all business logic implemented in separate components or in plug-ins. It mainly
|
||
processes REST operations, validates them, and updates the corresponding objects in `etcd` (and
|
||
perhaps eventually other stores). Note that, for a number of reasons, Kubernetes deliberately does
|
||
not support atomic transactions across multiple resources.
|
||
|
||
Kubernetes cannot function without this basic API machinery, which includes:
|
||
* REST semantics, watch, durability and consistency guarantees, API versioning, defaulting, and
|
||
validation
|
||
* Built-in admission-control semantics, synchronous admission-control hooks, and asynchronous
|
||
resource initialization
|
||
* API registration and discovery
|
||
|
||
Additionally, the API server acts as the gateway to the cluster. By definition, the API server
|
||
must be accessible by clients from outside the cluster, whereas the nodes, and certainly
|
||
containers, may not be. Clients authenticate the API server and also use it as a bastion and
|
||
proxy/tunnel to nodes and pods (and services).
|
||
|
||
#### Cluster state store
|
||
|
||
All persistent cluster state is stored in an instance of `etcd`. This provides a way to store
|
||
configuration data reliably. With `watch` support, coordinating components can be notified very
|
||
quickly of changes.
|
||
|
||
|
||
#### Controller-Manager Server
|
||
|
||
Most other cluster-level functions are currently performed by a separate process, called the
|
||
[Controller Manager](https://kubernetes.io/docs/admin/kube-controller-manager/). It performs
|
||
both lifecycle functions (e.g., namespace creation and lifecycle, event garbage collection,
|
||
terminated-pod garbage collection, cascading-deletion garbage collection, node garbage collection)
|
||
and API business logic (e.g., scaling of pods controlled by a
|
||
[ReplicaSet](https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/)).
|
||
|
||
The application management and composition layer, providing self-healing, scaling, application lifecycle management, service discovery, routing, and service binding and provisioning.
|
||
|
||
These functions may eventually be split into separate components to make them more easily
|
||
extended or replaced.
|
||
|
||
#### Scheduler
|
||
|
||
|
||
Kubernetes enables users to ask a cluster to run a set of containers. The scheduler
|
||
component automatically chooses hosts to run those containers on.
|
||
|
||
The scheduler watches for unscheduled pods and binds them to nodes via the `/binding` pod
|
||
subresource API, according to the availability of the requested resources, quality of service
|
||
requirements, affinity and anti-affinity specifications, and other constraints.
|
||
|
||
Kubernetes supports user-provided schedulers and multiple concurrent cluster schedulers,
|
||
using the shared-state approach pioneered by
|
||
[Omega](https://research.google.com/pubs/pub41684.html). In addition to the disadvantages of
|
||
pessimistic concurrency described by the Omega paper,
|
||
[two-level scheduling models](https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/Mesos-A-Platform-for-Fine-Grained-Resource-Sharing-in-the-Data-Center.pdf) that hide information from the upper-level
|
||
schedulers need to implement all of the same features in the lower-level scheduler as required by
|
||
all upper-layer schedulers in order to ensure that their scheduling requests can be satisfied by
|
||
available desired resources.
|
||
|
||
|
||
### The Kubernetes Node
|
||
|
||
The Kubernetes node has the services necessary to run application containers and
|
||
be managed from the master systems.
|
||
|
||
#### Kubelet
|
||
|
||
The most important and most prominent controller in Kubernetes is the Kubelet, which is the
|
||
primary implementer of the Pod and Node APIs that drive the container execution layer. Without
|
||
these APIs, Kubernetes would just be a CRUD-oriented REST application framework backed by a
|
||
key-value store (and perhaps the API machinery will eventually be spun out as an independent
|
||
project).
|
||
|
||
Kubernetes executes isolated application containers as its default, native mode of execution, as
|
||
opposed to processes and traditional operating-system packages. Not only are application
|
||
containers isolated from each other, but they are also isolated from the hosts on which they
|
||
execute, which is critical to decoupling management of individual applications from each other and
|
||
from management of the underlying cluster physical/virtual infrastructure.
|
||
|
||
Kubernetes provides [Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/) that can host multiple
|
||
containers and storage volumes as its fundamental execution primitive in order to facilitate
|
||
packaging a single application per container, decoupling deployment-time concerns from build-time
|
||
concerns, and migration from physical/virtual machines. The Pod primitive is key to glean the
|
||
[primary benefits](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/#why-containers) of deployment on modern
|
||
cloud platforms, such as Kubernetes.
|
||
|
||
API admission control may reject pods or add additional scheduling constraints to them, but
|
||
Kubelet is the final arbiter of what pods can and cannot run on a given node, not the schedulers
|
||
or DaemonSets.
|
||
|
||
Kubelet also currently links in the [cAdvisor](https://github.com/google/cadvisor) resource monitoring
|
||
agent.
|
||
|
||
#### Container runtime
|
||
|
||
Each node runs a container runtime, which is responsible for downloading images and running containers.
|
||
|
||
Kubelet does not link in the base container runtime. Instead, we're defining a
|
||
[Container Runtime Interface](/contributors/devel/container-runtime-interface.md) to control the
|
||
underlying runtime and facilitate pluggability of that layer.
|
||
This decoupling is needed in order to maintain clear component boundaries, facilitate testing, and facilitate pluggability.
|
||
Runtimes supported today, either upstream or by forks, include at least docker (for Linux and Windows),
|
||
[rkt](https://github.com/rkt/rkt),
|
||
[cri-o](https://github.com/kubernetes-incubator/cri-o), and [frakti](https://github.com/kubernetes/frakti).
|
||
|
||
#### Kube Proxy
|
||
|
||
The [service](https://kubernetes.io/docs/concepts/services-networking/service/) abstraction provides a way to
|
||
group pods under a common access policy (e.g., load-balanced). The implementation of this creates
|
||
a virtual IP which clients can access and which is transparently proxied to the pods in a Service.
|
||
Each node runs a [kube-proxy](https://kubernetes.io/docs/admin/kube-proxy/) process which programs
|
||
`iptables` rules to trap access to service IPs and redirect them to the correct backends. This provides a highly-available load-balancing solution with low performance overhead by balancing
|
||
client traffic from a node on that same node.
|
||
|
||
Service endpoints are found primarily via [DNS](https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/).
|
||
|
||
### Add-ons and other dependencies
|
||
|
||
A number of components, called [*add-ons*](https://git.k8s.io/kubernetes/cluster/addons) typically run on Kubernetes
|
||
itself:
|
||
* [DNS](https://git.k8s.io/kubernetes/cluster/addons/dns)
|
||
* [Ingress controller](https://github.com/kubernetes/ingress-gce)
|
||
* [Heapster](https://github.com/kubernetes/heapster/) (resource monitoring)
|
||
* [Dashboard](https://github.com/kubernetes/dashboard/) (GUI)
|
||
|
||
### Federation
|
||
|
||
A single Kubernetes cluster may span multiple availability zones.
|
||
|
||
However, for the highest availability, we recommend using [cluster federation](../multicluster/federation.md).
|