website/docs/concepts/controllers/statefulsets.md

7.8 KiB

assignees
bprashanth
enisoc
erictune
foxish
janetkuo
kow3ns
smarterclayton

{% capture overview %} Stateful Sets are a beta feature in 1.5. This feature replaces the Pet Sets feature from 1.4. Users of Pet Sets are referred to the 1.5 Upgrade Guide for further information on how to upgrade existing Pet Sets to Stateful Sets.

A Stateful Set is a Controller that provides a unique identity to its Pods. It provides guarantees about the ordering of deployment and scaling. {% endcapture %}

{% capture body %}

When to Use a Stateful Set

Stateful Sets are valuable for applications that require one or more of the following.

  • Stable, unique network identifiers.
  • Stable, persistent storage.
  • Ordered, graceful deployment and scaling.
  • Ordered, graceful deletion and termination.

In the above, by stable, we mean persistent across Pod (re) schedulings. As it is generally easier to manage, if an application doesn't require any of the above guarantees, and if it is feasible to do so, it should be deployed as a set of stateless replicas.

Limitations

  • Stateful Set is a beta resource, not available in any Kubernetes release prior to 1.5.
  • As with all alpha/beta resources, it can be disabled through the --runtime-config option passed to the apiserver.
  • The storage for a given Pod must either be provisioned by a Persistent Volume Provisioner based on the requested storage class, or pre-provisioned by an admin.
  • Deleting and/or scaling a Stateful Set down will not delete the volumes associated with the Stateful Set. This is done to ensure safety first, your data is more valuable than an auto purge of all related Stateful Set resources.
  • Stateful Sets currently require a Headless Service to be responsible for the network identity of the Pods. The user is responsible for this Service.
  • Updating an existing Stateful Set is currently a manual process, meaning you either need to deploy a new Stateful Set with the new image version, or orphan Pods one by one, update their image, and join them back to the cluster.

Components

The example below demonstrates the components of a Stateful Set.

  • A Headless Service, named nginx, is used to control the network domain.
  • The Stateful Set, named web, has a Spec that indicates that 3 replicas of the nginx container will be launched in unique Pods.
  • The volumeClaimTemplates, will provide stable storage using Persistent Volumes provisioned by a Persistent Volume Provisioner.
---
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: gcr.io/google_containers/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Pod Identity

Stateful Set Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage. The identity sticks to the Pod, regardless of which node it's (re) scheduled on.

Ordinal Index

For a Stateful Set with N replicas, each Pod in the Stateful Set will be assigned an integer ordinal, in the range [0,N), that is unique over the Set.

Stable Network ID

The hostname of a Pod in a Stateful Set is derived from the name of the Stateful Set and the ordinal of the Pod. The pattern for the constructed hostname is $(statefulset name)-$(ordinal). The example above will create three Pods named web-0,web-1,web-2. A Stateful Set can use a Headless Service to control the domain of its Pods. The domain managed by this Service takes the form: $(service name).$(namespace).svc.cluster.local, where "cluster.local" is the cluster domain. As each Pod is created, it gets a matching DNS subdomain, taking the form: $(podname).$(governing service domain), where the governing service is defined by the serviceName field on the Stateful Set.

Here are some examples of choices for Cluster Domain, Service name, Stateful Set name, and how that affects the DNS names for the Stateful Set's Pods.

Cluster Domain Service (ns/name) Stateful Set (ns/name) Stateful Set Domain Pod DNS Pod Hostname
cluster.local default/nginx default/web nginx.default.svc.cluster.local web-{0..N-1}.nginx.default.svc.cluster.local web-{0..N-1}
cluster.local foo/nginx foo/web nginx.foo.svc.cluster.local web-{0..N-1}.nginx.foo.svc.cluster.local web-{0..N-1}
kube.local foo/nginx foo/web nginx.foo.svc.kube.local web-{0..N-1}.nginx.foo.svc.kube.local web-{0..N-1}

Note that Cluster Domain will be set to cluster.local unless otherwise configured.

Stable Storage

Persistent Volumes, one for each Volume Claim Template, are created based on the volumeClaimTemplates field of the Stateful Set. In the example above, each Pod will receive a single persistent volume with a storage class of anything and 1 Gib of provisioned storage. When a Pod is (re) scheduled onto a node, its volumeMounts mount the Persistent Volumes associated with its Persistent Volume Claims. Note that, the Persistent Volumes associated with the Pods' Persistent Volume Claims are not deleted when the Pods, or Stateful Set are deleted. This must be done manually.

Deployment and Scaling Guarantee

  • For a Stateful Set with N replicas, when Pods are being deployed, they are created sequentially, in order from {0..N-1}.
  • When Pods are being deleted, they are terminated in reverse order, from {N-1..0}.
  • Before a scaling operation is applied to a Pod, all of its predecessors must be Running and Ready.
  • Before a Pod is terminated, all of its successors must be completely shutdown.

When the web example above is created, three Pods will be deployed in the order web-0, web-1, web-2. web-1 will not be deployed before web-0 is Running and Ready, and web-2 will not be deployed until web-1 is Running and Ready. If web-0 should fail, after web-1 is Running and Ready, but before web-2 is launched, web-2 will not be launched until web-0 is successfully relaunched and becomes Running and Ready.

If a user were to scale the deployed example by patching the Stateful Set such that replicas=1, web-2 would be terminated first. web-1 would not be terminated until web-2 is fully shutdown and deleted. If web-0 were to fail after web-2 has been terminated and is completely shutdown, but prior to web-1's termination, web-1 would not be terminated until web-0 is Running and Ready. {% endcapture %} {% include templates/concept.md %}