community/contributors/design-proposals/apps/daemonset-update.md

# DaemonSet Updates

**Author**: @madhusudancs, @lukaszo, @janetkuo

**Status**: Proposal

## Abstract

A proposal for adding the update feature to `DaemonSet`. This feature will be
implemented on server side (in `DaemonSet` API).

Users already can update a `DaemonSet` today (Kubernetes release 1.5), which will
not cause changes to its subsequent pods, until those pods are killed. In this
proposal, we plan to add a "RollingUpdate" strategy which allows DaemonSet to
downstream its changes to pods.

## Requirements

In this proposal, we design DaemonSet updates based on the following requirements:

- Users can trigger a rolling update of DaemonSet at a controlled speed, which
  is achieved by:
  - Only a certain number of DaemonSet pods can be down at the same time during
    an update
  - A DaemonSet pod needs to be ready for a specific amount of time before it's
    considered up
- Users can monitor the status of a DaemonSet update (e.g. the number of pods
  that are updated and healthy)
- A broken DaemonSet update should not continue, but one can still update the
  DaemonSet again to fix it
- Users should be able to update a DaemonSet even during an ongoing DaemonSet
  upgrade -- in other words, rollover (e.g. update the DaemonSet to fix a broken
  DaemonSet update)

Here are some potential requirements that haven't been covered by this proposal:

- Users should be able to view the history of previous DaemonSet updates
- Users can figure out the revision of a DaemonSet's pod (e.g. which version is
  this DaemonSet pod?)
- DaemonSet should provide at-most-one guarantee per node (i.e. at most one pod
  from a DaemonSet can exist on a node at any time)
- Uptime is critical for each pod of a DaemonSet during an upgrade (e.g. the time
  from a DaemonSet pods being killed to recreated and healthy should be < 5s)
- Each DaemonSet pod can still fit on the node after being updated
- Some DaemonSets require the node to be drained before the DaemonSet's pod on it
  is updated (e.g. logging daemons)
- DaemonSet's pods are implicitly given higher priority than non-daemons
- DaemonSets can only be operated by admins (i.e. people who manage nodes)
  - This is required if we allow DaemonSet controllers to drain, cordon,
    uncordon nodes, evict pods, or allow DaemonSet pods to have higher priority

## Implementation

### API Object

To enable DaemonSet upgrades, `DaemonSet` related API object will have the following
changes:

```go
type DaemonSetUpdateStrategy struct {
	// Type of daemon set update. Can be "RollingUpdate" or "OnDelete".
	// Default is OnDelete.
	// +optional
	Type DaemonSetUpdateStrategyType

	// Rolling update config params. Present only if DaemonSetUpdateStrategy =
	// RollingUpdate.
	//---
	// TODO: Update this to follow our convention for oneOf, whatever we decide it
	// to be. Same as Deployment `strategy.rollingUpdate`.
	// See https://github.com/kubernetes/kubernetes/issues/35345
	// +optional
	RollingUpdate *RollingUpdateDaemonSet
}

type DaemonSetUpdateStrategyType string

const (
	// Replace the old daemons by new ones using rolling update i.e replace them on each node one after the other.
	RollingUpdateDaemonSetStrategyType DaemonSetUpdateStrategyType = "RollingUpdate"

	// Replace the old daemons only when it's killed
	OnDeleteDaemonSetStrategyType DaemonSetUpdateStrategyType = "OnDelete"
)

// Spec to control the desired behavior of daemon set rolling update.
type RollingUpdateDaemonSet struct {
	// The maximum number of DaemonSet pods that can be unavailable during
	// the update. Value can be an absolute number (ex: 5) or a percentage of total
	// number of DaemonSet pods at the start of the update (ex: 10%). Absolute
	// number is calculated from percentage by rounding up.
	// This must be greater than 0.
	// Default value is 1.
	// Example: when this is set to 30%, 30% of the currently running DaemonSet
	// pods can be stopped for an update at any given time. The update starts
	// by stopping at most 30% of the currently running DaemonSet pods and then
	// brings up new DaemonSet pods in their place. Once the new pods are ready,
	// it then proceeds onto other DaemonSet pods, thus ensuring that at least
	// 70% of original number of DaemonSet pods are available at all times
	// during the update.
	// +optional
	MaxUnavailable intstr.IntOrString
}

// DaemonSetSpec is the specification of a daemon set.
type DaemonSetSpec struct {
	// Note: Existing fields, including Selector and Template are omitted in
	// this proposal.

	// Update strategy to replace existing DaemonSet pods with new pods.
	// +optional
	UpdateStrategy DaemonSetUpdateStrategy `json:"updateStrategy,omitempty"`

	// Minimum number of seconds for which a newly created DaemonSet pod should
	// be ready without any of its container crashing, for it to be considered
	// available. Defaults to 0 (pod will be considered available as soon as it
	// is ready).
	// +optional
	MinReadySeconds int32 `json:"minReadySeconds,omitempty"`

	// DEPRECATED.
	// A sequence number representing a specific generation of the template.
	// Populated by the system. Can be set at creation time. Read-only otherwise.
	// +optional
	TemplateGeneration int64 `json:"templateGeneration,omitempty"`

	// The number of old history to retain to allow rollback.
	// This is a pointer to distinguish between explicit zero and not specified.
	// Defaults to 10.
	RevisionHistoryLimit *int32 `json:"revisionHistoryLimit,omitempty"`
}

// DaemonSetStatus represents the current status of a daemon set.
type DaemonSetStatus struct {
	// Note: Existing fields, including CurrentNumberScheduled, NumberMisscheduled,
	// DesiredNumberScheduled, NumberReady, and ObservedGeneration are omitted in
	// this proposal.

	// UpdatedNumberScheduled is the total number of nodes that are running updated
	// daemon pod
	// +optional
	UpdatedNumberScheduled int32 `json:"updatedNumberScheduled"`

	// NumberAvailable is the number of nodes that should be running the
	// daemon pod and have one or more of the daemon pod running and
	// available (ready for at least minReadySeconds)
	// +optional
	NumberAvailable int32 `json:"numberAvailable"`

	// NumberUnavailable is the number of nodes that should be running the
	// daemon pod and have non of the daemon pod running and available
	// (ready for at least minReadySeconds)
	// +optional
	NumberUnavailable int32 `json:"numberUnavailable"`

	// Count of hash collisions for the DaemonSet. The DaemonSet controller
	// uses this field as a collision avoidance mechanism when it needs to
	// create the name for the newest ControllerRevision.
	// +optional
	CollisionCount *int64 `json:"collisionCount,omitempty"`
}

const (
	// DEPRECATED: DefaultDeploymentUniqueLabelKey is used instead.
	// DaemonSetTemplateGenerationKey is the key of the labels that is added
	// to daemon set pods to distinguish between old and new pod
	// during DaemonSet template update.
	DaemonSetTemplateGenerationKey string = "pod-template-generation"

	// DefaultDaemonSetUniqueLabelKey is the default label key that is added
	// to existing DaemonSet pods to distinguish between old and new
	// DaemonSet pods during DaemonSet template updates.
	DefaultDaemonSetUniqueLabelKey string = "daemonset-controller-hash"
)
```

### Controller

#### DaemonSet Controller

The DaemonSet Controller will make DaemonSet updates happen. It will watch
DaemonSets on the apiserver.

DaemonSet controller manages [`ControllerRevisions`](controller_history.md) for
DaemonSet revision introspection and rollback. It's referred to as "history"
throughout the rest of this proposal.

For each pending DaemonSet updates, it will:

1. Reconstruct DaemonSet history:
   - List existing DaemonSet history controlled by this DaemonSet
   - Find the history of DaemonSet's current target state, and create one if
     not found:
     - The `.name` of this history will be unique, generated from pod template
       hash with hash collision resolution. If history creation failed:
       - If it's because of name collision:
         - Compare history with DaemonSet current target state:
           - If they're the same, we've already created the history
           - Otherwise, bump DaemonSet `.status.collisionCount` by 1, exit and
             retry in the next sync loop
       - Otherwise, exit and retry again in the next sync loop.
     - The history will be labeled with `DefaultDaemonSetUniqueLabelKey`.
     - DaemonSet controller will add a ControllerRef in the history
       `.ownerReferences`.
   - Current history should have the largest `.revision` number amongst all
     existing history. Update `.revision` if it's not (e.g. after a rollback.)
   - If more than one current history is found, remove duplicates and relabel
     their pods' `DefaultDaemonSetUniqueLabelKey`.
1. Sync nodes:
   - Find all nodes that should run these pods created by this DaemonSet.
   - Create daemon pods on nodes when they should have those pods running but not
     yet. Otherwise, delete running daemon pods that shouldn't be running on nodes.
   - Label new pods with current `.spec.templateGeneration` and
     `DefaultDaemonSetUniqueLabelKey` value of current history when creating them.
1. Check `DaemonSetUpdateStrategy`:
   - If `OnDelete`: do nothing
   - If `RollingUpdate`:
     - For all pods owned by this DaemonSet:
       - If its `pod-template-generation` label value equals to DaemonSet's
         `.spec.templateGeneration`, it's a new pod (don't compare
         `DefaultDaemonSetUniqueLabelKey`, for backward compatibility).
         - Add `DefaultDaemonSetUniqueLabelKey` label to the new pod based on current
           history, if the pod doesn't have this label set yet.
       - Otherwise, if the value doesn't match, or the pod doesn't have a
         `pod-template-generation` label, check its `DefaultDaemonSetUniqueLabelKey` label:
         - If the value matches any of the history's `DefaultDaemonSetUniqueLabelKey` label,
           it's a pod generated from that history.
           - If that history matches the current target state of the DaemonSet,
             it's a new pod.
           - Otherwise, it's an old pod.
         - Otherwise, if the pod doesn't have a `DefaultDaemonSetUniqueLabelKey` label, or no
           matching history is found, it's an old pod.
     - If there are old pods found, compare `MaxUnavailable` with DaemonSet
       `.status.numberUnavailable` to see how many old daemon pods can be
       killed. Then, kill those pods in the order that unhealthy pods (failed,
       pending, not ready) are killed first.
1. Clean up old history based on `.spec.revisionHistoryLimit`
   - Always keep live history and current history
1. Cleanup, update DaemonSet status
   - `.status.numberAvailable` = the total number of DaemonSet pods that have
     become `Ready` for `MinReadySeconds`
   - `.status.numberUnavailable` = `.status.desiredNumberScheduled` -
     `.status.numberAvailable`

If DaemonSet Controller crashes during an update, it can still recover.

#### API Server

In DaemonSet strategy (pkg/registry/extensions/daemonset/strategy.go#PrepareForUpdate),
increase DaemonSet's `.spec.templateGeneration` by 1 if any changes is made to
DaemonSet's `.spec.template`.

This was originally implemented in 1.6, and kept in 1.7 for backward compatibility.

### kubectl

#### kubectl rollout

Users can use `kubectl rollout` to monitor DaemonSet updates:

- `kubectl rollout status daemonset/<DaemonSet-Name>`: to see the DaemonSet
  upgrade status
- `kubectl rollout history daemonset/<DaemonSet-Name>`: to view the history of
  DaemonSet updates.
- `kubectl rollout undo daemonset/<DaemonSet-Name>`: to rollback a DaemonSet

## Updating DaemonSets mid-way

Users can update an updated DaemonSet before its rollout completes.
In this case, the existing daemon pods will not continue rolling out and the new
one will begin rolling out.


## Deleting DaemonSets

Deleting a DaemonSet (with cascading) will delete all its pods and history.


## DaemonSet Strategies

DaemonSetStrategy specifies how the new daemon pods should replace existing ones.
To begin with, we will support 2 types:

* On delete: Do nothing, until existing daemon pods are killed (for backward
  compatibility).
  - Other alternative names: No-op, External
* Rolling update: We gradually kill existing ones while creating the new one.


## Tests

- Updating a RollingUpdate DaemonSet will trigger updates to its daemon pods.
- Updating an OnDelete DaemonSet will not trigger updates, until the pods are
  killed.
- Users can use node labels to choose which nodes this DaemonSet should target.
  DaemonSet updates only affect pods on those nodes.
  - For example, some nodes may be running manifest pods, and other nodes will
    be running daemon pods
- DaemonSets can be updated while already being updated (i.e. rollover updates)
- Broken rollout can be rolled back (by applying old config)
- If a daemon pod can no long fit on the node after rolling update, the users
  can manually evict or delete other pods on the node to make room for the
  daemon pod, and the DaemonSet rollout will eventually succeed (DaemonSet
  controller will recreate the failed daemon pod if it can't be scheduled)


## Future Plans

In the future, we may:

- Implement at-most-one and/or at-least-one guarantees for DaemonSets (i.e. at
  most/at least one pod from a DaemonSet can exist on a node at any time)
  - At-most-one would use a deterministic name for the pod (e.g. use node name
    as daemon pod name suffix)
- Support use cases where uptime is critical for each pod of a DaemonSet during
  an upgrade
  - One approach is to use dummy pods to pre-pull images to reduce down time
- Support use cases that each DaemonSet pod can still fit on the node after
  being updated (unless it becomes larger than the node). Some possible
  approaches include:
  - Make DaemonSet pods (daemons) have higher priority than non-daemons, and
    kubelet will evict pods with lower priority to make room for higher priority
    ones
  - The DaemonSet controller will evict pods when daemons can't fit on the node
  - The DaemonSet controller will cordon the node before upgrading the daemon on
    it, and uncordon the node once it's done
- Support use cases that require the node to be drained before the daemons on it
  can updated (e.g. logging daemons)
  - The DaemonSet controller will drain the node before upgrading the daemon on
    it, and uncordon the node once it's done
- Make DaemonSets admin-only resources (admin = people who manage nodes). Some
  possible approaches include:
  - Remove namespace from DaemonSets (DaemonSets become node-level resources)
  - Modify RBAC bootstrap policy to make DaemonSets admin-only
  - Delegation or impersonation
- Support more DaemonSet update strategies
- Allow user-defined DaemonSet unique label key
- Support pausing DaemonSet rolling update
- Support auto-rollback DaemonSets

### API

Implement a subresource for DaemonSet history (`daemonsets/foo/history`) that
summarizes the information in the history.

Implement a subresource for DaemonSet rollback (`daemonsets/foo/rollback`) that
triggers a DaemonSet rollback.

### Tests

- DaemonSet should support at most one daemon pod per node guarantee.
  - Adding or deleting nodes won't break that.
- Users should be able to specify acceptable downtime of their daemon pods, and
  DaemonSet updates should respect that.