Update docs for v1.0 (#16)

Signed-off-by: FillZpp <FillZpp.pub@gmail.com>
This commit is contained in:
Siyu Wang 2021-12-14 11:29:58 +08:00 committed by GitHub
parent b2b58908a8
commit a7a830462b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
62 changed files with 7989 additions and 280 deletions

View File

@ -0,0 +1,181 @@
---
slug: openkruise-1.0
title: OpenKruise v1.0, Reaching New Peaks of application automation
authors: [FillZpp]
tags: [release]
---
OpenKruise, a cloud-native project for application automation and one of the sandbox projects of CNCF, has newly released version 1.0.
[OpenKruise](https://openkruise.io) is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and avalibility protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.
![openkruise-features|center|450x400](/img/blog/2021-12-13-release-1.0/features-en.png)
Overall, OpenKruise currently provides features in these areas:
- **Application workloads**: Enhanced strategies of deploy and upgrade for stateless/stateful/daemon applications, such as in-place update, canary/flowing upgrade.
- **Sidecar container management**: supports to define sidecar container alone, which means it can inject sidecar containers, upgrade them with no effect on application containers and even hot upgrade.
- **Enhanced operations**: such as restart containers in-place, pre-download images on specific nodes, keep containers launch priority in a Pod, distribute one resource to multiple namespaces.
- **Application availability protection**: protect avalibility for applications that deployed in Kubernetes.
## What's new?
### 1. InPlace Update for environments
*Author: [@FillZpp](https://github.com/FillZpp)*
OpenKruise has supported **InPlace Update** since very early version, mostly for workloads like CloneSet and Advanced StatefulSet. Comparing to recreate Pods during upgrade, in-place update only has to modify the fields in existing Pods.
![inplace-update-comparation|center|450x400](/img/docs/core-concepts/inplace-update-comparation.png)
As the picture shows above, we only modify the `image` field in Pod during in-place update. So that:
- Avoid additional cost of *scheduling*, *allocating IP*, *allocating and mounting volumes*.
- Faster image pulling, because of we can re-use most of image layers pulled by the old image and only to pull several new layers.
- When a container is in-place updating, the other containers in Pod will not be affected and remain running.
However, OpenKruise only supports to in-place update `image` field in Pod and has to recreate Pods if other fields need to update. All the way through, more and more users hope OpenKruise could support in-place update more fields such as `env` -- which is hard to implement, for it is limited by kube-apiserver.
After our unremitting efforts, OpenKruise finally support in-place update environments via Downward API since version v1.0. Take the CloneSet YAML below as an example, user has to set the configuration in annotation and write a env from it. After that, he just needs to modify the annotation value when changing the configuration. Kruise will restart all containers with env from the annotation in such Pod to enable the new configuration.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
...
spec:
replicas: 1
template:
metadata:
annotations:
app-config: "... the real env value ..."
spec:
containers:
- name: app
env:
- name: APP_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['app-config']
updateStrategy:
type: InPlaceIfPossible
```
*At the same time, we have removed the limit of `imageID` for in-place update, which means you can update a new image with the same imageID to the old image.*
For more details please read [documentation](/docs/core-concepts/inplace-update).
### 2. Distribute resources over multiple namespaces
*Author: [@veophi](https://github.com/veophi)*
For the scenario, where the namespace-scoped resources such as Secret and ConfigMap need to be distributed or synchronized to different namespaces, the native k8s currently only supports manual distribution and synchronization by users one-by-one, which is very inconvenient.
Typical examples:
- When users want to use the imagePullSecrets capability of SidecarSet, they must repeatedly create corresponding Secrets in relevant namespaces, and ensure the correctness and consistency of these Secret configurations;
- When users want to configure some common environment variables, they probably need to distribute ConfigMaps to multiple namespaces, and the subsequent modifications of these ConfigMaps might require synchronization among these namespaces.
Therefore, in the face of these scenarios that require the resource distribution and **continuously synchronization across namespaces**, we provide a tool, namely **ResourceDistribution**, to do this automatically.
Currently, ResourceDistribution supports the two kind resources --- **Secret & ConfigMap**.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
...
targets:
namespaceLabelSelector:
...
# or includedNamespaces, excludedNamespaces
```
So you can see ResourceDistribution is a kind of **cluster-scoped CRD**, which is mainly composed of two fields: **`resource` and `targets`**.
- `resource` is a **complete** and **correct** resource structure in YAML style.
- `targets` indicates the target namespaces that the resource should be distributed into.
For more details please read [documentation](/docs/user-manuals/resourcedistribution).
### 3. Container launch priority
*Author: [@Concurrensee](https://github.com/Concurrensee)*
Containers in a same Pod in it might have dependence, which means the application in one container runs depending on another container. For example:
1. Container A has to start first. Container B can start only if A is already running.
2. Container B has to exit first. Container A can stop only if B has already exited.
Currently, the sequences of containers start and stop are controlled by Kubelet.
Kubernetes used to have a KEP, which plans to add a type field for container to identify the priority of start and stop. However, it has been refused because of sig-node thought it may bring a huge change to code.
So OpenKruise provides a feature named **Container Launch Priority**, which helps user control the sequence of containers start in a Pod.
1. User only has to put the annotation `apps.kruise.io/container-launch-priority: Ordered` in a Pod, then Kruise will ensure all containers in this Pod should be started by the sequence of `pod.spec.containers` list.
2. If you want to customize the launch sequence, you can add `KRUISE_CONTAINER_PRIORITY` environment in container. The range of the value is `[-2147483647, 2147483647]`. The container with higher priority will be guaranteed to start before the others with lower priority.
For more details please read [documentation](/docs/user-manuals/containerlaunchpriority).
### 4. `kubectl-kruise` commandline tool
*Author: [@hantmac](https://github.com/hantmac)*
OpenKruise used to provide SDK like `kruise-api` and `client-java` for some programming languages, which can be imported into users' projects. On the other hand, some users also need to operate the workload resources with commandline in test environment.
However, the `rollout`, `set image` commands in original `kubectl` can only work for built-in workloads, such as Deployment and StatefulSet.
So, OpenKruise now provide a commandline tool named `kubectl-kruise`, which is a standard plugin of `kubectl` and can work for OpenKruise workload types.
```bash
# rollout undo cloneset
$ kubectl kruise rollout undo cloneset/nginx
# rollout status advanced statefulset
$ kubectl kruise rollout status statefulsets.apps.kruise.io/sts-demo
# set image of a cloneset
$ kubectl kruise set image cloneset/nginx busybox=busybox nginx=nginx:1.9.1
```
For more details please read [documentation](/docs/cli-tool/kubectl-plugin).
### 5. Other changes
**CloneSet:**
- Add `maxUnavailable` field in `scaleStrategy` to support rate limiting of scaling up.
- Mark revision stable when all pods updated to it, won't wait all pods to be ready.
**WorkloadSpread:**
- Manage the pods that have created before WorkloadSpread.
- Optimize the update and retry logic for webhook injection.
**Advanced DaemonSet:**
- Support in-place update Daemon Pod.
- Support progressive annotation to control if pods creation should be limited by partition.
**SidecarSet:**
- Fix SidecarSet filter active pods.
- Add `SourceContainerNameFrom` and `EnvNames` fields in `transferenv` to make the container name flexible and the list shorter.
**PodUnavailableBudget:**
- Add no pub-protection annotation to skip validation for the specific Pod.
- PodUnavailableBudget controller watches workload replicas changed.
**NodeImage:**
- Add `--nodeimage-creation-delay` flag to delay NodeImage creation after Node ready.
**UnitedDeployment:**
- Fix pod NodeSelectorTerms length 0 when UnitedDeployment NodeSelectorTerms is nil.
**Other optimization:**
- kruise-daemon list and watch pods using protobuf.
- Export cache resync args and defaults to be 0 in chart value.
- Fix http checker reloading after webhook certs updated.
- Generate CRDs with original controller-tools and markers.

View File

@ -6,13 +6,13 @@ title: Kubectl Plugin
## Install
1. You can simply download the binary from the [releases](https://github.com/openkruise/kruise-tools/releases) page. Currently `linux` and `darwin`(OS X) with `x86_64` and `arm64` are provided. If you are using some other systems or architectures, you have to download the source code and execute `make build` to build the binary.
1. You can simply download the binary from the [releases](https://github.com/openkruise/kruise-tools/releases) page. Currently `linux`, `darwin`(OS X), `windows` with `x86_64` and `arm64` are provided. If you are using some other systems or architectures, you have to download the source code and execute `make build` to build the binary.
2. Make it executable, rename and move it to system PATH.
2. Extract and move it to system PATH.
```bash
$ chmod +x kubectl-kruise_darwin_amd64
$ mv kubectl-kruise_darwin_amd64 /usr/local/bin/kubectl-kruise
$ tar xvf kubectl-kruise-darwin-amd64.tar.gz
$ mv darwin-amd64/kubectl-kruise /usr/local/bin/
```
3. Then you can use it with `kubectl-kruise` or `kubectl kruise`.

View File

@ -16,7 +16,7 @@ $ helm repo add openkruise https://openkruise.github.io/charts/
$ helm repo update
# Install the latest version.
$ helm install kruise openkruise/kruise --version 1.0.0-beta.0
$ helm install kruise openkruise/kruise --version 1.0.0
```
*If you want to install the stable version, read [doc](/docs/installation).*
@ -31,7 +31,7 @@ $ helm repo add openkruise https://openkruise.github.io/charts/
$ helm repo update
# Upgrade the latest version.
$ helm upgrade kruise openkruise/kruise --version 1.0.0-beta.0
$ helm upgrade kruise openkruise/kruise --version 1.0.0 [--force]
```
Note that:
@ -41,6 +41,7 @@ Note that:
2. If you want to drop the chart parameters you configured for the old release or set some new parameters,
it is recommended to add `--reset-values` flag in `helm upgrade` command.
Otherwise you should use `--reuse-values` flag to reuse the last release's values.
3. If you are **upgrading Kruise from 0.x to 1.x**, you must add `--force` for upgrade command. Otherwise it is an optional flag.
## Optional: download charts manually
@ -67,7 +68,7 @@ The following table lists the configurable parameters of the chart and their def
| `manager.log.level` | Log level that kruise-manager printed | `4` |
| `manager.replicas` | Replicas of kruise-controller-manager deployment | `2` |
| `manager.image.repository` | Repository for kruise-manager image | `openkruise/kruise-manager` |
| `manager.image.tag` | Tag for kruise-manager image | `v1.0.0-beta.0` |
| `manager.image.tag` | Tag for kruise-manager image | `v1.0.0` |
| `manager.resources.limits.cpu` | CPU resource limit of kruise-manager container | `100m` |
| `manager.resources.limits.memory` | Memory resource limit of kruise-manager container | `256Mi` |
| `manager.resources.requests.cpu` | CPU resource request of kruise-manager container | `100m` |

View File

@ -32,7 +32,7 @@ const darkCodeTheme = require('prism-react-renderer/themes/dracula');
showLastUpdateAuthor: true,
showLastUpdateTime: true,
includeCurrentVersion: true,
lastVersion: 'v0.10',
lastVersion: 'v1.0',
},
blog: {
showReadingTime: true,

View File

@ -0,0 +1,184 @@
---
slug: openkruise-1.0
title: OpenKruise v1.0:云原生应用自动化达到新的高峰
authors: [FillZpp]
tags: [release]
---
云原生应用自动化管理套件、CNCF Sandbox 项目 -- OpenKruise近期发布了 v1.0 大版本。
[OpenKruise](https://openkruise.io) 是针对 Kubernetes 的增强能力套件,聚焦于云原生应用的部署、升级、运维、稳定性防护等领域。所有的功能都通过 CRD 等标准方式扩展,可以适用于 1.16 以上版本的任意 Kubernetes 集群。单条 helm 命令即可完成 Kruise 的一键部署,无需更多配置。
![openkruise-features|center|450x400](/img/blog/2021-12-13-release-1.0/features-zh.png)
总得来看,目前 OpenKruise 提供的能力分为几个领域:
- **应用工作负载**面向无状态、有状态、daemon 等多种类型应用的高级部署发布策略,例如原地升级、灰度流式发布等。
- **Sidecar 容器管理**:支持独立定义 sidecar 容器,完成动态注入、独立原地升级、热升级等功能。
- **增强运维能力**:包括容器原地重启、镜像预拉取、容器启动顺序保障等。
- **应用分区管理**:管理应用在多个分区(可用区、不同机型等)上的部署比例、顺序、优先级等。
- **应用安全防护**:帮助应用在 Kubernetes 之上获得更高的安全性保障与可用性防护。
## 版本解析
在 v1.0 大版本中OpenKruise 带来了多种新的特性,同时也对不少已有功能做了增强与优化。
首先要说的是,从 v1.0 开始 OpenKruise 将 CRD/WehhookConfiguration 等资源配置的版本从 `v1beta1` 升级到 `v1`,因此可以**支持 Kubernetes v1.22 及以上版本的集群,但同时也要求 Kubernetes 的版本不能低于 v1.16**。
以下对 v1.0 的部分功能做简要介绍,详细的 ChangeLog 列表请查看 OpenKruise Github 上的 release 说明以及官网文档。
### 1. 支持环境变量原地升级
*Author: [@FillZpp](https://github.com/FillZpp)*
OpenKruise 从早期版本开始就支持了 “原地升级” 功能,主要应用于 CloneSet 与 Advanced StatefulSet 两种工作负载上。简单来说,原地升级使得应用在升级的过程中,不需要删除、新建 Pod 对象,而是通过对 Pod 中容器配置的修改来达到升级的目的。
![inplace-update-comparation|center|450x400](/img/docs/core-concepts/inplace-update-comparation.png)
如上图所示,原地升级过程中只修改了 Pod 中的字段,因此:
1. 可以避免如 *调度*、*分配 IP*、*分配、挂载盘* 等额外的操作和代价。
2. 更快的镜像拉取,因为开源复用已有旧镜像的大部分 layer 层,只需要拉取新镜像变化的一些 layer。
3. 当一个容器在原地升级时Pod 的网络、挂载盘、以及 Pod 中的其他容器不会受到影响,仍然维持运行。
然而OpenKruise 过去只能对 Pod 中 image 字段的更新做原地升级,对于其他字段仍然只能采用与 Deployment 相似的重建升级。一直以来,我们收到很多用户反馈,希望支持对 env 等更多字段的原地升级 -- 由于受到 kube-apiserver 的限制,这是很难做到的。
经过我们的不懈努力OpenKruise 终于在 v1.0 版本中,支持了通过 Downward API 的方式支持了 env 环境变量的原地升级。例如对以下CloneSet YAML用户将配置定义在 annotation 中并关联到对应 env 中。后续在修改配置时,只需要更新 annotation value 中的值Kruise 就会对 Pod 中所有 env 里引用了这个 annotation 的容器触发原地重建,从而生效这个新的 value 配置。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
...
spec:
replicas: 1
template:
metadata:
annotations:
app-config: "... the real env value ..."
spec:
containers:
- name: app
env:
- name: APP_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['app-config']
updateStrategy:
type: InPlaceIfPossible
```
*与此同时,我们在这个版本中也去除了过去对镜像原地升级的`imageID`限制即支持相同imageID的两个镜像替换升级。*
具体使用方式请参考[文档](/docs/core-concepts/inplace-update)。
### 2. 配置跨命名空间分发
*Author: [@veophi](https://github.com/veophi)*
在对 Secret、ConfigMap 等 namespace-scoped 资源进行跨 namespace 分发及同步的场景中,原生 kubernetes 目前只支持用户 one-by-one 地进行手动分发与同步,十分地不方便。
典型的案例有:
- 当用户需要使用 SidecarSet 的 imagePullSecrets 能力时,要先重复地在相关 namespaces 中创建对应的 Secret并且需要确保这些 Secret 配置的正确性和一致性。
- 当用户想要采用 ConfigMap 来配置一些**通用**的环境变量时,往往需要在多个 namespaces 做 ConfigMap 的下发,并且后续的修改往往也要求多 namespaces 之间保持同步。
因此,面对这些需要跨 namespaces 进行资源分发和**多次同步**的场景我们期望一种更便捷的分发和同步工具来自动化地去做这件事为此我们设计并实现了一个新的CRD --- **ResourceDistribution**
ResourceDistribution 目前支持 **Secret****ConfigMap** 两类资源的分发和同步。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
...
targets:
namespaceLabelSelector:
...
# or includedNamespaces, excludedNamespaces
```
如上述 YAML 所示ResourceDistribution是一类 **cluster-scoped** 的 CRD其主要由 **`resource`** 和 **`targets`** 两个字段构成,其中 **`resource`** 字段用于描述用户所要分发的资源,**`targets`** 字段用于描述用户所要分发的目标命名空间。
具体使用方式请参考[文档](/docs/user-manuals/resourcedistribution)。
### 3. 容器启动顺序控制
*Author: [@Concurrensee](https://github.com/Concurrensee)*
对于 Kubernetes 的一个 Pod其中的多个容器可能存在依赖关系比如 容器B 中应用进程的运行依赖于 容器A 中的应用。因此,多个容器之间存在顺序关系的需求:
- 容器A 先启动,启动成功后才可以启动 容器B
- 容器B 先退出,退出完成后才可以停止 容器A
通常来说 Pod 容器的启动和退出顺序是由 Kubelet 管理的。Kubernetes 曾经有一个 KEP 计划在 container 中增加一个 type 字段来标识不同类型容器的启停优先级。但是由于 sig-node 考虑到对现有代码架构的改动太大,目前这个 KEP 已经被拒绝了。
因此OpenKruise 在 v1.0 中提供了名为 **Container Launch Priority** 的功能,用于控制一个 Pod 中多个容器的强制启动顺序:
1. 对于任意一个 Pod 对象,只需要在 annotations 中定义 `apps.kruise.io/container-launch-priority: Ordered`,则 Kruise 会按照 Pod 中 `containers` 容器列表的顺序来保证其中容器的串行启动。
2. 如果要自定义 `containers` 中多个容器的启动顺序,则在容器 env 中添加 `KRUISE_CONTAINER_PRIORITY` 环境变量value 值是范围在 `[-2147483647, 2147483647]` 的整数。一个容器的 priority 值越大,会保证越先启动。
具体使用方式请参考[文档](/docs/user-manuals/containerlaunchpriority)。
### 4. `kubectl-kruise` 命令行工具
*Author: [@hantmac](https://github.com/hantmac)*
过去 OpenKruise 是通过 kruise-api、client-java 等仓库提供了 Go、Java 等语言的 Kruise API 定义以及客户端封装,可供用户在自己的应用程序中引入使用。但仍然有不少用户在测试环境下需要灵活地用命令行操作 workload 资源。
然而原生 `kubectl` 工具提供的 `rollout`、`set image` 等命令只能适用于原生的 workload 类型,如 Deployment、StatefulSet并不能识别 OpenKruise 中扩展的 workload 类型。
因此OpenKruise 最新提供了 `kubectl-kruise` 命令行工具,它是 `kubectl` 的标准插件,提供了许多适用于 OpenKruise workload 的功能。
```bash
# rollout undo cloneset
$ kubectl kruise rollout undo cloneset/nginx
# rollout status advanced statefulset
$ kubectl kruise rollout status statefulsets.apps.kruise.io/sts-demo
# set image of a cloneset
$ kubectl kruise set image cloneset/nginx busybox=busybox nginx=nginx:1.9.1
```
具体使用方式请参考[文档](/docs/cli-tool/kubectl-plugin)。
### 5. 其余部分功能改进与优化
**CloneSet:**
- 通过 `scaleStrategy.maxUnavailable` 策略支持流式扩容
- Stable revision 判断逻辑变化,当所有 Pod 版本与 updateRevision 一致时则标记为 currentRevision
**WorkloadSpread:**
- 支持接管存量 Pod 到匹配的 subset 分组中
- 优化 webhook 在 Pod 注入时的更新与重试逻辑
**Advanced DaemonSet:**
- 支持对 Daemon Pod 做原地升级
- 引入 progressive annotation 来选择是否按 partition 限制 Pod 创建
**SidecarSet:**
- 解决 SidecarSet 过滤屏蔽 inactive Pod
- 在 `transferenv` 中新增 `SourceContainerNameFrom``EnvNames` 字段,来解决 container name 不一致与大量 env 情况下的冗余问题
**PodUnavailableBudget:**
- 新增 “跳过保护” 标识
- PodUnavailableBudget controller 关注 workload 工作负载的 replicas 变化
**NodeImage:**
- 加入 `--nodeimage-creation-delay` 参数,并默认等待新增 Node ready 一段时间后同步创建 NodeImage
**UnitedDeployment:**
- 解决 `NodeSelectorTerms` 为 nil 情况下 Pod `NodeSelectorTerms` 长度为 0 的问题
**Other optimization:**
- kruise-daemon 采用 protobuf 协议操作 Pod 资源
- 暴露 cache resync 为命令行参数,并在 chart 中设置默认值为 0
- 解决 certs 更新时的 http checker 刷新问题
- 去除对 forked controller-tools 的依赖,改为使用原生 controller-tools 配合 markers 注解

View File

@ -6,13 +6,13 @@ title: Kubectl Plugin
## 安装
1. 你可以从 [releases](https://github.com/openkruise/kruise-tools/releases) 页面下载二进制文件,目前提供 `linux`、`darwin`OS X系统 `x86_64`、`arm64` 架构。如果你在使用其他的操作系统或架构,需要下载 [kruise-tools](https://github.com/openkruise/kruise-tools) 源码并通过 `make build` 打包。
1. 你可以从 [releases](https://github.com/openkruise/kruise-tools/releases) 页面下载二进制文件,目前提供 `linux`、`darwin`OS X、`windows` 系统以及 `x86_64`、`arm64` 架构。如果你在使用其他的操作系统或架构,需要下载 [kruise-tools](https://github.com/openkruise/kruise-tools) 源码并通过 `make build` 打包。
2. 添加可执行权限,重命名并移动到系统 PATH 路径中。
2. 解压缩,并移动到系统 PATH 路径中。
```bash
$ chmod +x kubectl-kruise_darwin_amd64
$ mv kubectl-kruise_darwin_amd64 /usr/local/bin/kubectl-kruise
$ tar xvf kubectl-kruise-darwin-amd64.tar.gz
$ mv darwin-amd64/kubectl-kruise /usr/local/bin/
```
3. 你可以通过 `kubectl-kruise``kubectl kruise` 命令来使用.

View File

@ -16,7 +16,7 @@ $ helm repo add openkruise https://openkruise.github.io/charts/
$ helm repo update
# Install the latest version.
$ helm install kruise openkruise/kruise --version 1.0.0-beta.0
$ helm install kruise openkruise/kruise --version 1.0.0
```
*如果你希望安装稳定版本,阅读[文档](/docs/installation)。*
@ -31,13 +31,14 @@ $ helm repo add openkruise https://openkruise.github.io/charts/
$ helm repo update
# Upgrade the latest version.
$ helm upgrade kruise openkruise/kruise --version 1.0.0-beta.0
$ helm upgrade kruise openkruise/kruise --version 1.0.0 [--force]
```
注意:
1. 在升级之前,**必须** 先阅读 [Change Log](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md) ,确保你已经了解新版本的不兼容变化。
2. 如果你要重置之前旧版本上用的参数或者配置一些新参数,建议在 `helm upgrade` 命令里加上 `--reset-values`
3. 如果你在将 Kruise 从 0.x 升级到 1.x 版本,你需要为 upgrade 命令添加 `--force` 参数,其他情况下这个参数是可选的。
## 可选的:手工下载 charts 包
@ -62,7 +63,7 @@ $ helm install/upgrade kruise /PATH/TO/CHART
| `manager.log.level` | kruise-manager 日志输出级别 | `4` |
| `manager.replicas` | kruise-manager 的期望副本数 | `2` |
| `manager.image.repository` | kruise-manager/kruise-daemon 镜像仓库 | `openkruise/kruise-manager` |
| `manager.image.tag` | kruise-manager/kruise-daemon 镜像版本 | `1.0.0-beta.0` |
| `manager.image.tag` | kruise-manager/kruise-daemon 镜像版本 | `1.0.0` |
| `manager.resources.limits.cpu` | kruise-manager 的 limit CPU 资源 | `100m` |
| `manager.resources.limits.memory` | kruise-manager 的 limit memory 资源 | `256Mi` |
| `manager.resources.requests.cpu` | kruise-manager 的 request CPU 资源 | `100m` |

View File

@ -0,0 +1,42 @@
{
"version.label": {
"message": "v1.0",
"description": "The label for version v1.0"
},
"sidebar.docs.category.Getting Started": {
"message": "快速开始"
},
"sidebar.docs.category.Core Concepts": {
"message": "核心概念"
},
"sidebar.docs.category.User Manuals": {
"message": "用户手册"
},
"sidebar.docs.category.Typical Workloads": {
"message": "通用工作负载"
},
"sidebar.docs.category.Job Workloads": {
"message": "任务工作负载"
},
"sidebar.docs.category.Sidecar container Management": {
"message": "Sidecar容器管理"
},
"sidebar.docs.category.Multi-domain Management": {
"message": "多区域管理"
},
"sidebar.docs.category.Enhanced Operations": {
"message": "增强运维能力"
},
"sidebar.docs.category.Application Protection": {
"message": "应用安全防护"
},
"sidebar.docs.category.Reference": {
"message": "参考"
},
"sidebar.docs.category.Best Practices": {
"message": "最佳实践"
},
"sidebar.docs.category.Developer Manuals": {
"message": "开发者手册"
}
}

View File

@ -0,0 +1,27 @@
---
title: HPA configuration
---
Kruise 中的 Workload比如 CloneSet、Advanced StatefulSet、UnitedDeployment都实现了 scale subresource。
这表示它们都可以适配 HorizontalPodAutoscaler、PodDisruptionBudget 等原生操作。
### 例子
只需要将 CloneSet 的类型、名字写入 `scaleTargetRef` 即可:
```yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
# ...
spec:
scaleTargetRef:
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: your-cloneset-name
```
注意:
1. HPA 的 namespace 需要和你的 CloneSet 相同。
2. `scaleTargetRef` 中的 `apiVersion` 需要和你的 workload 中的相同,比如 `apps.kruise.io/v1alpha1``apps.kruise.io/v1beta1`
对于 Advanced StatefulSet 这种存在多个版本的 workload它取决于你所使用的版本。

View File

@ -0,0 +1,95 @@
---
title: Kubectl Plugin
---
[Kruise-tools](https://github.com/openkruise/kruise-tools) 为 Kruise 的功能提供了一系列命令行工具,包括 `kubectl-kruise`,它的是 `kubectl` 的标准插件。
## 安装
1. 你可以从 [releases](https://github.com/openkruise/kruise-tools/releases) 页面下载二进制文件,目前提供 `linux`、`darwin`OS X、`windows` 系统以及 `x86_64`、`arm64` 架构。如果你在使用其他的操作系统或架构,需要下载 [kruise-tools](https://github.com/openkruise/kruise-tools) 源码并通过 `make build` 打包。
2. 解压缩,并移动到系统 PATH 路径中。
```bash
$ tar xvf kubectl-kruise-darwin-amd64.tar.gz
$ mv darwin-amd64/kubectl-kruise /usr/local/bin/
```
3. 你可以通过 `kubectl-kruise``kubectl kruise` 命令来使用.
```bash
$ kubectl-kruise --help
# or
$ kubectl kruise --help
```
## Usage
### expose
根据一个 workload如 Deployment、CloneSet、Service 或 Pod 来生成一个新的 service 对象。
```bash
$ kubectl kruise expose cloneset nginx --port=80 --target-port=8000
```
### scale
为 workload如 Deployment, ReplicaSet, CloneSet, or Advanced StatefulSet 设置新的副本数。
```bash
$ kubectl kruise scale --replicas=3 cloneset nginx
```
它的效果与 `kubectl scale --replicas=3 cloneset nginx` 相同,即原生 `kubectl scale` 也适用。
### rollout
可用的子命令: `history`, `pause`, `restart`, `resume`, `status`, `undo`.
```bash
$ kubectl kruise rollout undo cloneset/nginx
# built-in statefulsets
$ kubectl kruise rollout status statefulsets/sts1
# kruise statefulsets
$ kubectl kruise rollout status statefulsets.apps.kruise.io/sts2
```
### set
可用的子命令: `env`, `image`, `resources`, `selector`, `serviceaccount`, `subject`.
```bash
$ kubectl kruise set env cloneset/nginx STORAGE_DIR=/local
$ kubectl kruise set image cloneset/nginx busybox=busybox nginx=nginx:1.9.1
```
### migrate
目前支持从 Deployment 迁移到 CloneSet。
```bash
# Create an empty CloneSet from an existing Deployment.
$ kubectl kruise migrate CloneSet --from Deployment -n default --dst-name deployment-name --create
# Create a same replicas CloneSet from an existing Deployment.
$ kubectl kruise migrate CloneSet --from Deployment -n default --dst-name deployment-name --create --copy
# Migrate replicas from an existing Deployment to an existing CloneSet.
$ kubectl-kruise migrate CloneSet --from Deployment -n default --src-name cloneset-name --dst-name deployment-name --replicas 10 --max-surge=2
```
### scaledown
对 cloneset 指定 pod 缩容。
```bash
# Scale down 2 with selective pods
$ kubectl kruise scaledown cloneset/nginx --pods pod-a,pod-b
```
它会将 cloneset 设置 **replicas=replicas-2**,并删除指定的两个 pod。

View File

@ -0,0 +1,88 @@
---
title: 系统架构
---
OpenKruise 的整体架构如下:
![alt](/img/docs/core-concepts/architecture.png)
## API
所有 OpenKruise 的功能都是通过 **Kubernetes API** 来提供, 比如:
- 新的 CRD 定义,比如
```shell script
$ kubectl get crd | grep kruise.io
advancedcronjobs.apps.kruise.io 2021-09-16T06:02:36Z
broadcastjobs.apps.kruise.io 2021-09-16T06:02:36Z
clonesets.apps.kruise.io 2021-09-16T06:02:36Z
containerrecreaterequests.apps.kruise.io 2021-09-16T06:02:36Z
daemonsets.apps.kruise.io 2021-09-16T06:02:36Z
imagepulljobs.apps.kruise.io 2021-09-16T06:02:36Z
nodeimages.apps.kruise.io 2021-09-16T06:02:36Z
podunavailablebudgets.policy.kruise.io 2021-09-16T06:02:36Z
resourcedistributions.apps.kruise.io 2021-09-16T06:02:36Z
sidecarsets.apps.kruise.io 2021-09-16T06:02:36Z
statefulsets.apps.kruise.io 2021-09-16T06:02:36Z
uniteddeployments.apps.kruise.io 2021-09-16T06:02:37Z
workloadspreads.apps.kruise.io 2021-09-16T06:02:37Z
# ...
```
- 资源对象中的特定标识labels, annotations, envs 等),比如
```yaml
apiVersion: v1
kind: Namespace
metadata:
labels:
# 保护这个 namespace 下的 Pod 不被整个 ns 级联删除
policy.kruise.io/delete-protection: Cascading
```
## Manager
Kruise-manager 是一个运行 controller 和 webhook 中心组件,它通过 Deployment 部署在 `kruise-system` 命名空间中。
```bash
$ kubectl get deploy -n kruise-system
NAME READY UP-TO-DATE AVAILABLE AGE
kruise-controller-manager 2/2 2 2 4h6m
$ kubectl get pod -n kruise-system -l control-plane=controller-manager
NAME READY STATUS RESTARTS AGE
kruise-controller-manager-68dc6d87cc-k9vg8 1/1 Running 0 4h6m
kruise-controller-manager-68dc6d87cc-w7x82 1/1 Running 0 4h6m
```
<!-- It can be deployed as multiple replicas with Deployment, but only one of them could become leader and start working, others will keep retrying to acquire the lock. -->
逻辑上来说,如 cloneset-controller/sidecarset-controller 这些的 controller 都是独立运行的。不过为了减少复杂度,它们都被打包在一个独立的二进制文件、并运行在 `kruise-controller-manager-xxx` 这个 Pod 中。
除了 controller 之外,`kruise-controller-manager-xxx` 中还包含了针对 Kruise CRD 以及 Pod 资源的 admission webhook。Kruise-manager 会创建一些 webhook configurations 来配置哪些资源需要感知处理、以及提供一个 Service 来给 kube-apiserver 调用。
```bash
$ kubectl get svc -n kruise-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kruise-webhook-service ClusterIP 172.24.9.234 <none> 443/TCP 4h9m
```
上述的 `kruise-webhook-service` 非常重要,是提供给 kube-apiserver 调用的。
## Daemon
这是从 Kruise v0.8.0 版本开始提供的一个新的 daemon 组件。
它通过 DaemonSet 部署到每个 Node 节点上,提供镜像预热、容器重启等功能。
```bash
$ kubectl get pod -n kruise-system -l control-plane=daemon
NAME READY STATUS RESTARTS AGE
kruise-daemon-6hw6d 1/1 Running 0 4h7m
kruise-daemon-d7xr4 1/1 Running 0 4h7m
kruise-daemon-dqp8z 1/1 Running 0 4h7m
kruise-daemon-dv96r 1/1 Running 0 4h7m
kruise-daemon-q7594 1/1 Running 0 4h7m
kruise-daemon-vnsbw 1/1 Running 0 4h7m
```

View File

@ -0,0 +1,92 @@
---
title: 原地升级
---
原地升级是 OpenKruise 提供的核心功能之一。
目前支持原地升级的 Workload
- [CloneSet](/docs/user-manuals/cloneset)
- [Advanced StatefulSet](/docs/user-manuals/advancedstatefulset)
- [Advanced DaemonSet](/docs/user-manuals/advanceddaemonset)
- [SidecarSet](/docs/user-manuals/advanceddaemonset)
目前 `CloneSet`、`Advanced StatefulSet`、`Advanced DaemonSet` 是复用的同一个代码包 [`./pkg/util/inplaceupdate`](https://github.com/openkruise/kruise/tree/master/pkg/util/inplaceupdate) 并且有类似的原地升级行为。在本文中,我们会介绍它的用法和工作流程。
注意,`SidecarSet` 的原地升级流程和其他 workloads 不太一样,比如它在升级 Pod 之前并不会把 Pod 设置为 not-ready 状态。因此,下文中讨论的内容并不完全适用于 `SidecarSet`
## 什么是原地升级
当我们要升级一个存量 Pod 中的镜像时,这是 *重建升级**原地升级* 的区别:
![alt](/img/docs/core-concepts/inplace-update-comparation.png)
**重建升级**时我们要删除旧 Pod、创建新 Pod
- Pod 名字和 uid 发生变化,因为它们是完全不同的两个 Pod 对象(比如 Deployment 升级)
- Pod 名字可能不变、但 uid 变化,因为它们是不同的 Pod 对象,只是复用了同一个名字(比如 StatefulSet 升级)
- Pod 所在 Node 名字发生变化,因为新 Pod 很大可能性是不会调度到之前所在的 Node 节点的
- Pod IP 发生变化,因为新 Pod 很大可能性是不会被分配到之前的 IP 地址的
但是对于**原地升级**,我们仍然复用同一个 Pod 对象,只是修改它里面的字段。因此:
- 可以避免如 *调度*、*分配 IP*、*分配、挂载盘* 等额外的操作和代价
- 更快的镜像拉取,因为开源复用已有旧镜像的大部分 layer 层,只需要拉取新镜像变化的一些 layer
- 当一个容器在原地升级时Pod 中的其他容器不会受到影响,仍然维持运行
## 理解 *InPlaceIfPossible*
这种 Kruise workload 的升级类型名为 `InPlaceIfPossible`,它意味着 Kruise 会尽量对 Pod 采取原地升级,如果不能则退化到重建升级。
以下的改动会被允许执行原地升级:
1. 更新 workload 中的 `spec.template.metadata.*`,比如 labels/annotationsKruise 只会将 metadata 中的改动更新到存量 Pod 上。
2. 更新 workload 中的 `spec.template.spec.containers[x].image`Kruise 会原地升级 Pod 中这些容器的镜像,而不会重建整个 Pod。
3. **从 Kruise v1.0 版本开始(包括 v1.0 alpha/beta**,更新 `spec.template.metadata.labels/annotations` 并且 container 中有配置 env from 这些改动的 labels/anntationsKruise 会原地升级这些容器来生效新的 env 值。
否则,其他字段的改动,比如 `spec.template.spec.containers[x].env``spec.template.spec.containers[x].resources`,都是会回退为重建升级。
例如对下述 CloneSet YAML
1. 修改 `app-image:v1` 镜像,会触发原地升级。
2. 修改 annotations 中 `app-config` 的 value 内容,会触发原地升级(参考下文[使用要求](#使用要求))。
3. 同时修改上述两个字段,会在原地升级中同时更新镜像和环境变量。
4. 直接修改 env 中 `APP_NAME` 的 value 内容或者新增 env 等其他操作,会触发 Pod 重建升级。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
...
spec:
replicas: 1
template:
metadata:
annotations:
app-config: "... the real env value ..."
spec:
containers:
- name: app
image: app-image:v1
env:
- name: APP_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['app-config']
- name: APP_NAME
value: xxx
updateStrategy:
type: InPlaceIfPossible
```
## 工作流程总览
可以在下图中看到原地升级的整体工作流程(*你可能需要右击在新标签页中打开*
![alt](/img/docs/core-concepts/inplace-update-workflow.png)
## 使用要求
如果要使用 env from metadata 原地升级能力,你需要在安装或升级 Kruise chart 的时候打开 `kruise-daemon`(默认打开)和 `InPlaceUpdateEnvFromMetadata` 两个 feature-gate。
注意,如果你有一些 virtual-kubelet 类型的 Node 节点kruise-daemon 可能是无法在上面运行的,因此也无法使用 env from metadata 原地升级。

View File

@ -0,0 +1,161 @@
---
title: Golang client
---
如果要在一个 Golang 项目中对 OpenKruise 的资源做 create/get/update/delete 这些操作、或者通过 informer 做 list-watch你需要一个支持 OpenKruise 的 client。
你需要在你的项目中引入 [kruise-api](https://github.com/openkruise/kruise-api) 仓库,它包含了 Kruise 的 schema 定义以及 clientset 等工具。
**不要**直接引入整个 [kruise](https://github.com/openkruise/kruise) 仓库作为依赖。
## 使用方式
首先,在你的 `go.mod` 中引入 `kruise-api` 依赖 (版本号最好和你安装的 Kruise 版本相同):
```
require github.com/openkruise/kruise-api v0.10.0
```
| Kubernetes Version in your Project | Import Kruise-api < v0.10 | Import Kruise-api >= v0.10 |
| ---------------------------------- | ---------------------------- | ---------------------------- |
| < 1.18 | v0.x.y (x <= 9) | v0.x.y-legacy (x >= 10) |
| >= 1.18 | v0.x.y-1.18 (7 <= x <= 9) | v0.x.y (x >= 10) |
然后,有两种方式在你的代码中使用 `kruise-api`:直接使用 或 通过 `controller-runtime` 使用。
如果你的项目是通过 [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) 或 [operator-sdk](https://github.com/operator-framework/operator-sdk) 生成的,
那么建议你通过 `controller-runtime` 使用。否则,你可以直接使用。
### 直接使用
1. New Kruise client using your rest config:
```go
import kruiseclientset "github.com/openkruise/kruise-api/client/clientset/versioned"
// cfg is the rest config defined in client-go, you should get it using kubeconfig or serviceaccount
kruiseClient := kruiseclientset.NewForConfigOrDie(cfg)
```
2. Get/List Kruise resources:
```go
cloneSet, err := kruiseClient.AppsV1alpha1().CloneSets(namespace).Get(name, metav1.GetOptions{})
cloneSetList, err := kruiseClient.AppsV1alpha1().CloneSets(namespace).List(metav1.ListOptions{})
```
3. Create/Update Kruise resources:
```go
import kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
cloneSet := kruiseappsv1alpha1.CloneSet{
// ...
}
err = kruiseClient.AppsV1alpha1().CloneSets(namespace).Create(&cloneSet, metav1.CreateOptions)
```
```go
// Get first
cloneSet, err := kruiseClient.AppsV1alpha1().CloneSets(namespace).Get(name, metav1.GetOptions{})
if err != nil {
return err
}
// Modify object, such as replicas or template
cloneSet.Spec.Replicas = utilpointer.Int32Ptr(5)
// Update
// This might get conflict, should retry it
if err = kruiseClient.AppsV1alpha1().CloneSets(namespace).Update(&cloneSet, metav1.UpdateOptions); err != nil {
return err
}
```
4. Watch Kruise resources:
```go
import kruiseinformer "github.com/openkruise/kruise-api/client/informers/externalversions"
kruiseInformerFactory := kruiseinformer.NewSharedInformerFactory(kruiseClient, 0)
kruiseInformerFactory.Apps().V1alpha1().CloneSets().Informer().AddEventHandler(...)
kruiseInformerFactory.Start(...)
```
### 通过 controller-runtime 使用
1. Add kruise apis into the scheme in your `main.go`
```go
import kruiseapi "github.com/openkruise/kruise-api"
// ...
_ = kruiseapi.AddToScheme(scheme)
```
2. New client
This is needed when use controller-runtime client directly.
If your project is generated by [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) or [operator-sdk](https://github.com/operator-framework/operator-sdk),
you should get the client from `mgr.GetClient()` instead of the example below.
```go
import "sigs.k8s.io/controller-runtime/pkg/client"
apiClient, err := client.New(c, client.Options{Scheme: scheme})
```
3. Get/List
```go
import (
kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
cloneSet := kruiseappsv1alpha1.CloneSet{}
err = apiClient.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: name}, &cloneSet)
cloneSetList := kruiseappsv1alpha1.CloneSetList{}
err = apiClient.List(context.TODO(), &cloneSetList, client.InNamespace(instance.Namespace))
```
4. Create/Update/Delete
Create a new CloneSet:
```go
import kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
cloneSet := kruiseappsv1alpha1.CloneSet{
// ...
}
err = apiClient.Create(context.TODO(), &cloneSet)
```
Update an existing CloneSet:
```go
import kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
// Get first
cloneSet := kruiseappsv1alpha1.CloneSet{}
if err = apiClient.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: name}, &cloneSet); err != nil {
return err
}
// Modify object, such as replicas or template
cloneSet.Spec.Replicas = utilpointer.Int32Ptr(5)
// Update
// This might get conflict, should retry it
if err = apiClient.Update(context.TODO(), &cloneSet); err != nil {
return err
}
```
5. List watch and informer
If your project is generated by [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) or [operator-sdk](https://github.com/operator-framework/operator-sdk) and get the client from `mgr.GetClient()`,
then methods like `Get`/`List` have already queried from informer instead of apiserver.

View File

@ -0,0 +1,11 @@
---
title: Java client
---
目前我们有一个 [client-java](https://github.com/openkruise/client-java) 仓库提供了 Kruise 资源的 schema 定义。
不过,这个它已经不太推荐使用,我们仍然强烈建议你使用 [Golang Client](./go-client)。
如果你需要使用 client-java要注意
1. 它的 schema 定义可能会落后于最新的 Kruise 版本,我们不会为它生成每个 release 版本。
2. 这个包没有上传到官方的 maven 仓库中,你需要手动下载这个项目并打包为 jar 包使用。

View File

@ -0,0 +1,8 @@
---
title: Other languages
---
目前Kruise 没有提供除了 Golang 和 Java 之外语言的 SDK事实上我们也只推荐你使用 [Golang Client](./go-client),它是保证了最新版本和稳定性。
如果你要使用其他编程语言比如 Python你只能使用它们的官方 K8s client 比如 [kubernetes-client/python](https://github.com/kubernetes-client/python)。
通常来说,它们都会提供一些让你操作任意 CR 自定义资源的方法。

View File

@ -0,0 +1,3 @@
---
title: FAQ
---

View File

@ -0,0 +1,143 @@
---
title: 安装
---
从 v1.0.0 (alpha/beta) 开始OpenKruise 要求在 **Kubernetes >= 1.16** 以上版本的集群中安装和使用。
## 通过 helm 安装
建议采用 helm v3.1+ 来安装 Kruisehelm 是一个简单的命令行工具可以从 [这里](https://github.com/helm/helm/releases) 获取。
```bash
# Firstly add openkruise charts repository if you haven't do this.
$ helm repo add openkruise https://openkruise.github.io/charts/
# [Optional]
$ helm repo update
# Install the latest version.
$ helm install kruise openkruise/kruise --version 1.0.0
```
*如果你希望安装稳定版本,阅读[文档](/docs/installation)。*
## 通过 helm 升级
```bash
# Firstly add openkruise charts repository if you haven't do this.
$ helm repo add openkruise https://openkruise.github.io/charts/
# [Optional]
$ helm repo update
# Upgrade the latest version.
$ helm upgrade kruise openkruise/kruise --version 1.0.0 [--force]
```
注意:
1. 在升级之前,**必须** 先阅读 [Change Log](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md) ,确保你已经了解新版本的不兼容变化。
2. 如果你要重置之前旧版本上用的参数或者配置一些新参数,建议在 `helm upgrade` 命令里加上 `--reset-values`
3. 如果你在将 Kruise 从 0.x 升级到 1.x 版本,你需要为 upgrade 命令添加 `--force` 参数,其他情况下这个参数是可选的。
## 可选的:手工下载 charts 包
如果你在生产环境无法连接到 `https://openkruise.github.io/charts/`,可以先在[这里](https://github.com/openkruise/charts/releases)手工下载 chart 包,再用它安装或更新到集群中。
```bash
$ helm install/upgrade kruise /PATH/TO/CHART
```
## 可选项
注意直接安装 chart 会使用默认的 template values你也可以根据你的集群情况指定一些特殊配置比如修改 resources 限制或者配置 feature-gates。
### 可选: chart 安装参数
下表展示了 chart 所有可配置的参数和它们的默认值:
| Parameter | Description | Default |
| ----------------------------------------- | ------------------------------------------------------------ | ----------------------------- |
| `featureGates` | 可配置的 feature gates 参数,空表示按默认开关处理 | `` |
| `installation.namespace` | kruise 安装到的 namespace一般不建议修改 | `kruise-system` |
| `manager.log.level` | kruise-manager 日志输出级别 | `4` |
| `manager.replicas` | kruise-manager 的期望副本数 | `2` |
| `manager.image.repository` | kruise-manager/kruise-daemon 镜像仓库 | `openkruise/kruise-manager` |
| `manager.image.tag` | kruise-manager/kruise-daemon 镜像版本 | `1.0.0` |
| `manager.resources.limits.cpu` | kruise-manager 的 limit CPU 资源 | `100m` |
| `manager.resources.limits.memory` | kruise-manager 的 limit memory 资源 | `256Mi` |
| `manager.resources.requests.cpu` | kruise-manager 的 request CPU 资源 | `100m` |
| `manager.resources.requests.memory` | kruise-manager 的 request memory 资源 | `256Mi` |
| `manager.metrics.port` | metrics 服务的监听端口 | `8080` |
| `manager.webhook.port` | webhook 服务的监听端口 | `9443` |
| `manager.nodeAffinity` | kruise-manager 部署的 node affinity 亲和性 | `{}` |
| `manager.nodeSelector` | kruise-manager 部署的 node selector 亲和性 | `{}` |
| `manager.tolerations` | kruise-manager 部署的 tolerations | `[]` |
| `daemon.log.level` | kruise-daemon 日志输出级别 | `4` |
| `daemon.port` | kruise-daemon 的 metrics/healthz 服务监听端口 | `10221` |
| `daemon.resources.limits.cpu` | kruise-daemon 的 limit CPU 资源 | `50m` |
| `daemon.resources.limits.memory` | kruise-daemon 的 limit memory 资源 | `128Mi` |
| `daemon.resources.requests.cpu` | kruise-daemon 的 request CPU 资源 | `0` |
| `daemon.resources.requests.memory` | kruise-daemon 的 request memory 资源 | `0` |
| `daemon.affinity` | kruise-daemon 部署的 affinity 亲和性 (可以排除一些 node 不部署 daemon) | `{}` |
| `daemon.socketLocation` | Node 节点上 CRI socket 文件所在目录 | `/var/run` |
| `webhookConfiguration.failurePolicy.pods` | Pod webhook 的失败策略 | `Ignore` |
| `webhookConfiguration.timeoutSeconds` | 所有 Kruise webhook 的调用超时时间 | `30` |
| `crds.managed` | 是否安装 Kruise CRD (如何关闭则 chart 不会安装任何 CRD) | `true` |
| `manager.resyncPeriod` | kruise-manager 中 informer 的 resync 周期,默认不做 resync | `0` |
| `manager.hostNetwork` | kruise-manager pod 是否采用 hostnetwork 网络 | `false` |
这些参数可以通过 `--set key=value[,key=value]` 参数在 `helm install``helm upgrade` 命令中生效。
### 可选: feature-gate
Feature-gate 控制了 Kruise 中一些有影响性的功能:
| Name | Description | Default | Side effect (if closed) |
| ---------------------- | ------------------------------------------------------------ | ------- | -----------------------------------------
| `PodWebhook` | 启用对于 Pod **创建** 的 webhook (不建议关闭) | `true` | SidecarSet/KruisePodReadinessGate 不可用 |
| `KruiseDaemon` | 启用 `kruise-daemon` DaemonSet (不建议关闭) | `true` | 镜像预热/容器重启 不可用 |
| `DaemonWatchingPod` | 每个 `kruise-daemon` 会 watch 与自己同节点的 pod (不建议关闭) | `true` | 同 imageID 的原地升级,以及支持 env from labels/annotation 原地升级 不可用 |
| `CloneSetShortHash` | 启用 CloneSet controller 只在 pod label 中设置短 hash 值 | `false` | CloneSet 名字不能超过 54 个字符(默认行为) |
| `KruisePodReadinessGate` | 启用 Kruise webhook 将 'KruisePodReady' readiness-gate 在所有 Pod 创建时注入 | `false` | 只会注入到 Kruise workloads 创建的 Pod 中 |
| `PreDownloadImageForInPlaceUpdate` | 启用 CloneSet 自动为原地升级的过程创建 ImagePullJob 来预热镜像 | `false` | 原地升级无镜像提前预热 |
| `CloneSetPartitionRollback` | 启用如果 partition 被调大, CloneSet controller 会回滚 Pod 到 currentRevision 老版本 | `false` | CloneSet 只会正向发布 Pod 到 updateRevision |
| `ResourcesDeletionProtection` | 资源删除防护 | `false` | 资源删除无保护 |
| `TemplateNoDefaults` | 是否取消对 workload 中 pod/pvc template 的默认值注入 | `false` | Should not close this feature if it has open |
| `PodUnavailableBudgetDeleteGate` | 启用 PodUnavailableBudget 保护 pod 删除、驱逐 | `false` | 不防护 pod 删除、驱逐 |
| `PodUnavailableBudgetUpdateGate` | 启用 PodUnavailableBudget 保护 pod 原地升级 | `false` | 不防护 pod 原地升级 |
| `WorkloadSpread` | 启用 WorkloadSpread 管理应用多分区弹性与拓扑部署 | `false` | 不支持 WorkloadSpread |
| `InPlaceUpdateEnvFromMetadata` | 启用 Kruise 原地升级容器当它存在 env from 的 labels/annotations 发生了变化 | `false` | 容器中只有 image 能够原地升级 |
如果你要配置 feature-gate只要在安装或升级时配置参数即可比如
```bash
$ helm install kruise https://... --set featureGates="ResourcesDeletionProtection=true\,PreDownloadImageForInPlaceUpdate=true"
```
如果你希望打开所有 feature-gate 功能,配置参数 `featureGates=AllAlpha=true`
### 可选: 中国本地镜像
如果你在中国、并且很难从官方 DockerHub 上拉镜像,那么你可以使用托管在阿里云上的镜像仓库:
```bash
$ helm install kruise https://... --set manager.image.repository=openkruise-registry.cn-hangzhou.cr.aliyuncs.com/openkruise/kruise-manager
```
## 最佳实践
### k3s 安装参数
通常来说 k3s 有着与默认 `/var/run` 不同的 runtime socket 路径。所以你需要将 `daemon.socketLocation` 参数设置为你的 k3s 节点上真实的路径(比如 `/run/k3s``/var/run/k3s/`)。
## 卸载
注意:卸载会导致所有 Kruise 下的资源都会删除掉,包括 webhook configurations, services, namespace, CRDs, CR instances 以及所有 Kruise workload 下的 Pod。 请务必谨慎操作!
卸载使用 helm chart 安装的 Kruise
```bash
$ helm uninstall kruise
release "kruise" uninstalled
```

View File

@ -0,0 +1,69 @@
---
title: OpenKruise 简介
slug: /
---
# OpenKruise 是什么
欢迎来到 OpenKruise 的世界!
OpenKruise 是一个基于 Kubernetes 的扩展套件,主要聚焦于云原生应用的自动化,比如*部署、发布、运维以及可用性防护*。
OpenKruise 提供的绝大部分能力都是基于 CRD 扩展来定义,它们不存在于任何外部依赖,可以运行在任意纯净的 Kubernetes 集群中。
## 核心能力
- **增强版本的 Workloads**
OpenKruise 包含了一系列增强版本的 Workloads工作负载比如 CloneSet、Advanced StatefulSet、Advanced DaemonSet、BroadcastJob 等。
它们不仅支持类似于 Kubernetes 原生 Workloads 的基础功能,还提供了如原地升级、可配置的扩缩容/发布策略、并发操作等。
其中,原地升级是一种升级应用容器镜像甚至环境变量的全新方式。它只会用新的镜像重建 Pod 中的特定容器,整个 Pod 以及其中的其他容器都不会被影响。因此它带来了更快的发布速度,以及避免了对其他 Scheduler、CNI、CSI 等组件的负面影响。
- **应用的旁路管理**
OpenKruise 提供了多种通过旁路管理应用 sidecar 容器、多区域部署的方式,“旁路” 意味着你可以不需要修改应用的 Workloads 来实现它们。
比如SidecarSet 能帮助你在所有匹配的 Pod 创建的时候都注入特定的 sidecar 容器,甚至可以原地升级已经注入的 sidecar 容器镜像、并且对 Pod 中其他容器不造成影响。
而 WorkloadSpread 可以约束无状态 Workload 扩容出来 Pod 的区域分布,赋予单一 workload 的多区域和弹性部署的能力。
- **高可用性防护**
OpenKruise 在为应用的高可用性防护方面也做出了很多努力。
目前它可以保护你的 Kubernetes 资源不受级联删除机制的干扰,包括 CRD、Namespace、以及几乎全部的 Workloads 类型资源。
相比于 Kubernetes 原生的 PDB 只提供针对 Pod Eviction 的防护PodUnavailableBudget 能够防护 Pod Deletion、Eviction、Update 等许多种 voluntary disruption 场景。
- **高级的应用运维能力**
OpenKruise 也提供了很多高级的运维能力来帮助你更好地管理应用。
你可以通过 ImagePullJob 来在任意范围的节点上预先拉取某些镜像,或者指定某个 Pod 中的一个或多个容器被原地重启。
## 关系对比
### OpenKruise vs. Kubernetes
简单来说OpenKruise 对于 Kubernetes 是一个辅助扩展角色。
Kubernetes 自身已经提供了一些应用部署管理的功能,比如一些[基础工作负载](https://kubernetes.io/docs/concepts/workloads/)。
但对于大规模应用与集群的场景,这些基础功能是远远不够的。
OpenKruise 可以被很容易地安装到任意 Kubernetes 集群中,它弥补了 Kubernetes 在应用部署、升级、防护、运维 等领域的不足。
### OpenKruise vs. Platform-as-a-Service (PaaS)
OpenKruise **不是**一个 PaaS 平台,并且也**不会**提供任何 PaaS 层的能力。
它是一个 Kubernetes 的标准扩展套件,目前包括 `kruise-manager``kruise-daemon` 两个组件。
PaaS 平台可以通过使用 OpenKruise 提供的这些扩展功能,来使得应用部署、管理流程更加强大与高效。
## What's Next
接下来,我们推荐你:
- 开始 [安装使用 OpenKruise](./installation).
- 了解 OpenKruise 的 [系统架构](core-concepts/architecture).

View File

@ -0,0 +1,54 @@
---
title: AdvancedCronJob
---
AdvancedCronJob 是对于原生 CronJob 的扩展版本。
后者根据用户设置的 schedule 规则,周期性创建 Job 执行任务,而 AdvancedCronJob 的 template 支持多种不同的 job 资源:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: AdvancedCronJob
spec:
template:
# Option 1: use jobTemplate, which is equivalent to original CronJob
jobTemplate:
# ...
# Option 2: use broadcastJobTemplate, which will create a BroadcastJob object when cron schedule triggers
broadcastJobTemplate:
# ...
# Options 3(future): ...
```
- jobTemplate与原生 CronJob 一样创建 Job 执行任务
- broadcastJobTemplate周期性创建 [BroadcastJob](./broadcastjob) 执行任务
![AdvancedCronjob](/img/docs/user-manuals/advancedcronjob.png)
## 用例
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: AdvancedCronJob
metadata:
name: acj-test
spec:
schedule: "*/1 * * * *"
template:
broadcastJobTemplate:
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
completionPolicy:
type: Always
ttlSecondsAfterFinished: 30
```
上述 YAML 定义了一个 AdvancedCronJob每分钟创建一个 BroadcastJob 对象,这个 BroadcastJob 会在所有节点上运行一个 job 任务。

View File

@ -0,0 +1,149 @@
---
title: Advanced DaemonSet
---
这个控制器基于原生 [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/) 上增强了发布能力,比如 灰度分批、按 Node label 选择、暂停、热升级等。
注意 `Advanced DaemonSet` 是一个 CRDkind 名字也是 `DaemonSet`,但是 apiVersion 是 `apps.kruise.io/v1alpha1`
这个 CRD 的所有默认字段、默认行为与原生 DaemonSet 完全一致,除此之外还提供了一些 optional 字段来扩展增强的策略。
因此,用户从原生 `DaemonSet` 迁移到 `Advanced DaemonSet`,只需要把 `apiVersion` 修改后提交即可:
```yaml
- apiVersion: apps/v1
+ apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
metadata:
name: sample-ds
spec:
#...
```
## 增强策略
在 RollingUpdateDaemonSet 中我们新增了以下字段:
```go
const (
+ // StandardRollingUpdateType replace the old daemons by new ones using rolling update i.e replace them on each node one after the other.
+ // this is the default type for RollingUpdate.
+ StandardRollingUpdateType RollingUpdateType = "Standard"
+ // SurgingRollingUpdateType replaces the old daemons by new ones using rolling update i.e replace them on each node one
+ // after the other, creating the new pod and then killing the old one.
+ SurgingRollingUpdateType RollingUpdateType = "Surging"
)
// Spec to control the desired behavior of daemon set rolling update.
type RollingUpdateDaemonSet struct {
+ // Type is to specify which kind of rollingUpdate.
+ Type RollingUpdateType `json:"rollingUpdateType,omitempty" protobuf:"bytes,1,opt,name=rollingUpdateType"`
// ...
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"`
+ // A label query over nodes that are managed by the daemon set RollingUpdate.
+ // Must match in order to be controlled.
+ // It must match the node's labels.
+ Selector *metav1.LabelSelector `json:"selector,omitempty" protobuf:"bytes,3,opt,name=selector"`
+ // The number of DaemonSet pods remained to be old version.
+ // Default value is 0.
+ // Maximum value is status.DesiredNumberScheduled, which means no pod will be updated.
+ // +optional
+ Partition *int32 `json:"partition,omitempty" protobuf:"varint,4,opt,name=partition"`
+ // Indicates that the daemon set is paused and will not be processed by the
+ // daemon set controller.
+ // +optional
+ Paused *bool `json:"paused,omitempty" protobuf:"varint,5,opt,name=paused"`
+ // ...
+ MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,7,opt,name=maxSurge"`
}
```
### 升级方式
Advanced DaemonSet 在 `spec.updateStrategy.rollingUpdate` 中有一个 `rollingUpdateType` 字段,标识了如何进行滚动升级:
- `Standard`: 对于每个 node控制器会先删除旧的 daemon Pod再创建一个新 Pod和原生 DaemonSet 行为一致。
- `Surging`: 对于每个 node控制器会先创建一个新 Pod等它 ready 之后再删除老 Pod。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
type: RollingUpdate
rollingUpdate:
rollingUpdateType: Standard
```
### Selector 标签选择升级
这个策略支持用户通过配置 node 标签的 selector来指定灰度升级某些特定类型 node 上的 Pod。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
type: RollingUpdate
rollingUpdate:
selector:
matchLabels:
nodeType: canary
```
### 分批灰度升级
Partition 的语义是 **保留旧版本 Pod 的数量**,默认为 `0`
如果在发布过程中设置了 `partition`,则控制器只会将 `(status.DesiredNumberScheduled - partition)` 数量的 Pod 更新到最新版本。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 10
```
### 热升级
MaxSurge 是 DaemonSet pods 最大扩出来超过预期的数量,只有在 `rollingUpdateType=Surging` 的时候会生效。
MaxSurge 可以设置为绝对值或者一个百分比,控制器针对百分比会基于 status.desiredNumberScheduled 做计算并向上取整,默认值为 1。
比如当设置为 30% 时,最多有总数的 30% 的 node 上会同时有 2 个 Pod 在运行。
当新 Pod 变为 available 之后控制器会下线老 Pod然后开始更新下一个 node在整个过程中所有正常 Pod 数量不会超过总 node 数量的 130%。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
rollingUpdate:
rollingUpdateType: Surging
maxSurge: 30%
```
### 暂停升级
用户可以通过设置 paused 为 true 暂停发布,不过控制器还是会做 replicas 数量管理:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
rollingUpdate:
paused: true
```

View File

@ -0,0 +1,239 @@
---
title: Advanced StatefulSet
---
这个控制器基于原生 [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/) 上增强了发布能力,比如 maxUnavailable 并行发布、原地升级等。
注意 `Advanced StatefulSet` 是一个 CRDkind 名字也是 `StatefulSet`,但是 apiVersion 是 `apps.kruise.io/v1beta1`
这个 CRD 的所有默认字段、默认行为与原生 StatefulSet 完全一致,除此之外还提供了一些 optional 字段来扩展增强的策略。
因此,用户从原生 `StatefulSet` 迁移到 `Advanced StatefulSet`,只需要把 `apiVersion` 修改后提交即可:
```yaml
- apiVersion: apps/v1
+ apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
name: sample
spec:
#...
```
注意从 Kruise 0.7.0 开始Advanced StatefulSet 版本升级到了 `v1beta1`,并与 `v1alpha1` 兼容。对于低于 v0.7.0 版本的 Kruise只能使用 `v1alpha1`
## MaxUnavailable 最大不可用
Advanced StatefulSet 在 `RollingUpdateStatefulSetStrategy` 中新增了 `maxUnavailable` 策略来支持并行 Pod 发布,它会保证发布过程中最多有多少个 Pod 处于不可用状态。注意,`maxUnavailable` 只能配合 podManagementPolicy 为 `Parallel` 来使用。
这个策略的效果和 `Deployment` 中的类似,但是可能会导致发布过程中的 order 顺序不能严格保证。
如果不配置 `maxUnavailable`,它的默认值为 1也就是和原生 `StatefulSet` 一样只能 one by one 串行发布 Pod即使把 podManagementPolicy 配置为 `Parallel` 也是这样。
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 20%
```
比如说,一个 Advanced StatefulSet 下面有 P0 到 P4 五个 Pod并且应用能容忍 3 个副本不可用。
当我们把 StatefulSet 里的 Pod 升级版本的时候,可以通过以下步骤来做:
1. 设置 `maxUnavailable=3`
2. (可选) 如果需要灰度升级,设置 `partition=4`。Partition 默认的意思是 order 大于等于这个数值的 Pod 才会更新,在这里就只会更新 P4即使我们设置了 `maxUnavailable=3`
3. 在 P4 升级完成后,把 `partition` 调整为 0。此时控制器会同时升级 P1、P2、P3 三个 Pod。注意如果是原生 `StatefulSet`,只能串行升级 P3、P2、P1。
4. 一旦这三个 Pod 中有一个升级完成了,控制器会立即开始升级 P0。
## 原地升级
Advanced StatefulSet 增加了 `podUpdatePolicy` 来允许用户指定重建升级还是原地升级。
- `ReCreate`: 控制器会删除旧 Pod 和它的 PVC然后用新版本重新创建出来。
- `InPlaceIfPossible`: 控制器会优先尝试原地升级 Pod如果不行再采用重建升级。具体参考下方阅读文档。
- `InPlaceOnly`: 控制器只允许采用原地升级。因此,用户只能修改上一条中的限制字段,如果尝试修改其他字段会被 Kruise 拒绝。
**请阅读[该文档](../core-concepts/inplace-update)了解更多原地升级的细节。**
我们还在原地升级中提供了 **graceful period** 选项,作为优雅原地升级的策略。用户如果配置了 `gracePeriodSeconds` 这个字段,控制器在原地升级的过程中会先把 Pod status 改为 not-ready然后等一段时间`gracePeriodSeconds`),最后再去修改 Pod spec 中的镜像版本。
这样,就为 endpoints-controller 这些控制器留出了充足的时间来将 Pod 从 endpoints 端点列表中去除。
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
podUpdatePolicy: InPlaceIfPossible
inPlaceUpdateStrategy:
gracePeriodSeconds: 10
```
**更重要的是**,如果使用 `InPlaceIfPossible``InPlaceOnly` 策略,必须要增加一个 `InPlaceUpdateReady` readinessGate用来在原地升级的时候控制器将 Pod 设置为 NotReady。
一个完整的原地升级 StatefulSet 例子如下:
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
name: sample
spec:
replicas: 3
serviceName: fake-service
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
readinessGates:
# A new condition that ensures the pod remains at NotReady state while the in-place update is happening
- conditionType: InPlaceUpdateReady
containers:
- name: main
image: nginx:alpine
podManagementPolicy: Parallel # allow parallel updates, works together with maxUnavailable
updateStrategy:
type: RollingUpdate
rollingUpdate:
# Do in-place update if possible, currently only image update is supported for in-place update
podUpdatePolicy: InPlaceIfPossible
# Allow parallel updates with max number of unavailable instances equals to 2
maxUnavailable: 2
```
## 升级顺序
Advanced StatefulSet 在 `spec.updateStrategy.rollingUpdate` 下面新增了 `unorderedUpdate` 结构,提供给不按 order 顺序的升级策略。
如果 `unorderedUpdate` 不为空,所有 Pod 的发布顺序就不一定会按照 order 顺序了。注意,`unorderedUpdate` 只能配合 Parallel podManagementPolicy 使用。
目前,`unorderedUpdate` 下面只包含 `priorityStrategy` 一个优先级策略。
### 优先级策略
这个策略定义了控制器计算 Pod 发布优先级的规则,所有需要更新的 Pod 都会通过这个优先级规则计算后排序。
目前 `priority` 可以通过 weight(权重) 和 order(序号) 两种方式来指定。
- `weight`: Pod 优先级是由所有 weights 列表中的 term 来计算 match selector 得出。如下:
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
updateStrategy:
rollingUpdate:
unorderedUpdate:
priorityStrategy:
weightPriority:
- weight: 50
matchSelector:
matchLabels:
test-key: foo
- weight: 30
matchSelector:
matchLabels:
test-key: bar
```
- `order`: Pod 优先级是由 orderKey 的 value 决定,这里要求对应的 value 的结尾能解析为 int 值。比如 value "5" 的优先级是 5value "sts-10" 的优先级是 10。
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
updateStrategy:
rollingUpdate:
unorderedUpdate:
priorityStrategy:
orderPriority:
- orderedKey: some-label-key
```
## 发布暂停
用户可以通过设置 paused 为 true 暂停发布,不过控制器还是会做 replicas 数量管理:
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
updateStrategy:
rollingUpdate:
paused: true
```
## 原地升级自动预热
**FEATURE STATE:** Kruise v0.10.0
如果你在[安装或升级 Kruise](../installation##optional-feature-gate) 的时候启用了 `PreDownloadImageForInPlaceUpdate` feature-gate
Advanced StatefulSet 控制器会自动在所有旧版本 pod 所在 node 节点上预热你正在灰度发布的新版本镜像。 这对于应用发布加速很有帮助。
默认情况下 Advanced StatefulSet 每个新镜像预热时的并发度都是 `1`,也就是一个个节点拉镜像。
如果需要调整,你可以在 annotation 上设置并发度:
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "5"
```
注意,为了避免大部分不必要的镜像拉取,目前只针对 replicas > 3 的 Advanced StatefulSet 做自动预热。
## 序号保留(跳过)
从 Advanced StatefulSet 的 v1beta1 版本开始Kruise >= v0.7.0),支持序号保留功能。
通过在 `reserveOrdinals` 字段中写入需要保留的序号Advanced StatefulSet 会自动跳过创建这些序号的 Pod。如果 Pod 已经存在,则会被删除。
注意,`spec.replicas` 是期望运行的 Pod 数量,`spec.reserveOrdinals` 是要跳过的序号。
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 4
reserveOrdinals:
- 1
```
对于一个 `replicas=4, reserveOrdinals=[1]` 的 Advanced StatefulSet实际运行的 Pod 序号为 `[0,2,3,4]`
- 如果要把 Pod-3 做迁移并保留序号,则把 `3` 追加到 `reserveOrdinals` 列表中。控制器会把 Pod-3 删除并创建 Pod-5此时运行中 Pod 为 `[0,2,4,5]`)。
- 如果只想删除 Pod-3则把 `3` 追加到 `reserveOrdinals` 列表并同时把 `replicas` 减一修改为 `3`。控制器会把 Pod-3 删除(此时运行中 Pod 为 `[0,2,4]`)。
## 流式扩容
**FEATURE STATE:** Kruise v0.10.0
为了避免在一个新 Advanced StatefulSet 创建后有大量失败的 pod 被创建出来,从 Kruise `v0.10.0` 版本开始引入了在 scale strategy 中的 `maxUnavailable` 策略。
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 100
scaleStrategy:
maxUnavailable: 10% # percentage or absolute number
```
当这个字段被设置之后Advanced StatefulSet 会保证创建 pod 之后不可用 pod 数量不超过这个限制值。
比如说,上面这个 StatefulSet 一开始只会一次性创建 10 个 pod。在此之后每当一个 pod 变为 running、ready 状态后,才会再创建一个新 pod 出来。
注意,这个功能只允许在 podManagementPolicy 是 `Parallel` 的 StatefulSet 中使用。

View File

@ -0,0 +1,129 @@
---
title: BroadcastJob
---
这个控制器将 Pod 分发到集群中每个 node 上,类似于 [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/)
但是 BroadcastJob 管理的 Pod 并不是长期运行的 daemon 服务,而是类似于 [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/) 的任务类型 Pod。
最终在每个 node 上的 Pod 都执行完成退出后BroadcastJob 和这些 Pod 并不会占用集群资源。
这个控制器非常有利于做升级基础软件、巡检等过一段时间需要在整个集群中跑一次的工作。
此外BroadcastJob 还可以维持每个 node 跑成功一个 Pod 任务。如果采取这种模式,当后续集群中新增 node 时 BroadcastJob 也会分发 Pod 任务上去执行。
## Spec 定义
### Template
`Template` 描述了 Pod 模板,用于创建任务 Pod。
注意,由于是任务类型的 Pod其中的 restart policy 只能设置为 `Never``OnFailure`,不允许设为 `Always`
### Parallelism
`Parallelism` 指定了最多能允许多少个 Pod 同时在执行任务,默认不做限制。
比如,一个集群里有 10 个 node、并设置了 `Parallelism` 为 3那么 BroadcastJob 会保证同时只会有 3 个 node 上的 Pod 在执行。每当一个 Pod 执行完成BroadcastJob 才会创建一个新 Pod 执行。
### CompletionPolicy
`CompletionPolicy` 支持指定 BroadcastJob 控制器的 reconciling 行为,可以设置为 `Always``Never`
#### Always
`Always` 策略意味着 job 最终会完成,不管是执行成功还是失败了。在 `Always` 策略下还可以设置以下参数:
- `ActiveDeadlineSeconds`:指定一个超时时间,如果 BroadcastJob 开始运行超过了这个时间,所有还在跑着的 job 都会被停止、并标记为失败。
- `BackoffLimit`:指定一个重试次数,超过这个次数后才标记 job 失败默认没有限制。目前Pod 实际的重试次数是看 Pod status 中上报所有容器的 [ContainerStatus.RestartCount](https://github.com/kruiseio/kruise/blob/d61c12451d6a662736c4cfc48682fa75c73adcbc/vendor/k8s.io/api/core/v1/types.go#L2314) 重启次数。如果这个重启次数超过了 `BackoffLimit`,这个 job 就会被标记为失败、并把运行的 Pod 删除掉。
- `TTLSecondsAfterFinished` 限制了 BroadcastJob 在完成之后的存活时间,默认没有限制。比如设置了 `TTLSecondsAfterFinished` 为 10s那么当 job 结束后超过了 10s控制器就会把 job 和下面的所有 Pod 删掉。
#### Never
`Never` 策略意味着 BroadcastJob 永远都不会结束(标记为 Succeeded 或 Failed即使当前 job 下面的 Pod 都已经执行成功了。
这也意味着 `ActiveDeadlineSeconds``BackoffLimit``TTLSecondsAfterFinished` 这三个参数是不能使用的。
比如说,用户希望对集群中每个 node 都下发一个配置,包括后续新增的 node 都需要做,那么就可以创建一个 `Never` 策略的 BroadcastJob。
## 例子
### 监控 BroadcastJob status
在一个单 node 集群中创建一个 BroadcastJob执行 `kubectl get bcj` BroadcastJob 的 short name看到以下状态
```shell
NAME DESIRED ACTIVE SUCCEEDED FAILED
broadcastjob-sample 1 0 1 0
```
- `Desired` : 期望的 Pod 数量(等同于当前集群中匹配的 node 数量)
- `Active`: 运行中的 Pod 数量
- `SUCCEEDED`: 执行成功的 Pod 数量
- `FAILED`: 执行失败的 Pod 数量
### ttlSecondsAfterFinished
创建 BroadcastJob 配置 `ttlSecondsAfterFinished` 为 30。
这个 job 会在执行结束后 30s 被删除。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
name: broadcastjob-ttl
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
completionPolicy:
type: Always
ttlSecondsAfterFinished: 30
```
### activeDeadlineSeconds
创建 BroadcastJob 配置 `activeDeadlineSeconds` 为 10。
这个 job 会在运行超过 10s 之后被标记为失败,并把下面还在运行的 Pod 删除掉。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
name: broadcastjob-active-deadline
spec:
template:
spec:
containers:
- name: sleep
image: busybox
command: ["sleep", "50000"]
restartPolicy: Never
completionPolicy:
type: Always
activeDeadlineSeconds: 10
```
### completionPolicy
创建 BroadcastJob 配置 `completionPolicy``Never`
这个 job 会持续运行即使当前所有 node 上的 Pod 都执行完成了。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
name: broadcastjob-never-complete
spec:
template:
spec:
containers:
- name: sleep
image: busybox
command: ["sleep", "5"]
restartPolicy: Never
completionPolicy:
type: Never
```

View File

@ -0,0 +1,574 @@
---
title: CloneSet
---
CloneSet 控制器提供了高效管理无状态应用的能力,它可以对标原生的 `Deployment`,但 `CloneSet` 提供了很多增强功能。
按照 Kruise 的[命名规范](/blog/workload-classification-guidance)CloneSet 是一个直接管理 Pod 的 **Set** 类型 workload。
一个简单的 CloneSet yaml 文件如下:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
app: sample
name: sample
spec:
replicas: 5
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
containers:
- name: nginx
image: nginx:alpine
```
## 扩缩容功能
### 支持 PVC 模板
CloneSet 允许用户配置 PVC 模板 `volumeClaimTemplates`,用来给每个 Pod 生成独享的 PVC这是 `Deployment` 所不支持的。
如果用户没有指定这个模板CloneSet 会创建不带 PVC 的 Pod。
一些注意点:
- 每个被自动创建的 PVC 会有一个 ownerReference 指向 CloneSet因此 CloneSet 被删除时,它创建的所有 Pod 和 PVC 都会被删除。
- 每个被 CloneSet 创建的 Pod 和 PVC都会带一个 `apps.kruise.io/cloneset-instance-id: xxx` 的 label。关联的 Pod 和 PVC 会有相同的 **instance-id**,且它们的名字后缀都是这个 **instance-id**
- 如果一个 Pod 被 CloneSet controller 缩容删除时,这个 Pod 关联的 PVC 都会被一起删掉。
- 如果一个 Pod 被外部直接调用删除或驱逐时,这个 Pod 关联的 PVC 还都存在;并且 CloneSet controller 发现数量不足重新扩容时,新扩出来的 Pod 会复用原 Pod 的 **instance-id** 并关联原来的 PVC。
- 当 Pod 被**重建升级**时,关联的 PVC 会跟随 Pod 一起被删除、新建。
- 当 Pod 被**原地升级**时,关联的 PVC 会持续使用。
以下是一个带有 PVC 模板的例子:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
app: sample
name: sample-data
spec:
replicas: 5
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
containers:
- name: nginx
image: nginx:alpine
volumeMounts:
- name: data-vol
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: data-vol
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
```
### 指定 Pod 缩容
当一个 CloneSet 被缩容时,有时候用户需要指定一些 Pod 来删除。这对于 `StatefulSet` 或者 `Deployment` 来说是无法实现的,因为 `StatefulSet` 要根据序号来删除 Pod`Deployment`/`ReplicaSet` 目前只能根据控制器里定义的排序来删除。
CloneSet 允许用户在缩小 `replicas` 数量的同时,指定想要删除的 Pod 名字。参考下面这个例子:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
replicas: 4
scaleStrategy:
podsToDelete:
- sample-9m4hp
```
当控制器收到上面这个 CloneSet 更新之后,会确保 replicas 数量为 4。如果 `podsToDelete` 列表里写了一些 Pod 名字,控制器会优先删除这些 Pod。
对于已经被删除的 Pod控制器会自动从 `podsToDelete` 列表中清理掉。
如果你只把 Pod 名字加到 `podsToDelete`,但没有修改 `replicas` 数量,那么控制器会先把指定的 Pod 删掉,然后再扩一个新的 Pod。
另一种直接删除 Pod 的方式是在要删除的 Pod 上打 `apps.kruise.io/specified-delete: true` 标签。
相比于手动直接删除 Pod使用 `podsToDelete``apps.kruise.io/specified-delete: true` 方式会有 CloneSet 的 `maxUnavailable`/`maxSurge` 来保护删除,
并且会触发 `PreparingDelete` 生命周期 hook (见下文)。
### 缩容顺序
1. 未调度 < 已调度
2. PodPending < PodUnknown < PodRunning
3. Not ready < ready
4. [较小 pod-deletion cost < 较大 pod-deletion cost](#pod-deletion-cost)
5. [较大打散权重 < 较小](#deletion-by-spread-constraints)
6. 处于 Ready 时间较短 < 较长
7. 容器重启次数较多 < 较少
8. 创建时间较短 < 较长
#### Pod deletion cost
**FEATURE STATE:** Kruise v0.9.0
[controller.kubernetes.io/pod-deletion-cost](https://kubernetes.io/docs/core-concepts/labels-annotations-taints/#pod-deletion-cost)
是从 Kubernetes 1.21 版本后加入的 annotationDeployment/ReplicaSet 在缩容时会参考这个 cost 数值来排序。
CloneSet 从 Kruise v0.9.0 版本后也同样支持了这个功能。
用户可以把这个 annotation 配置到 pod 上,值的范围在 [-2147483647, 2147483647]。
它表示这个 pod 相较于同个 CloneSet 下其他 pod 的 "删除代价",代价越小的 pod 删除优先级相对越高。
没有设置这个 annotation 的 pod 默认 deletion cost 是 0。
#### Deletion by Spread Constraints
**FEATURE STATE:** Kruise v0.10.0
原始 proposal设计文档在[这里](https://github.com/openkruise/kruise/blob/master/docs/proposals/20210624-cloneset-scaledown-topology-spread.md)。
目前CloneSet 支持 **按同节点打散****按 [pod topolocy spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/) 打散**
如果在 CloneSet template 中存在 Pod Topology Spread Constraints 规则定义,则 controller 在这个 CloneSet 缩容的时候会根据 spread constraints 规则来所打散并选择要删除的 pod。
否则controller 默认情况下是按同节点打散来选择要缩容的 pod。
### 短 hash
**FEATURE STATE:** Kruise v0.9.0
默认情况下CloneSet 在 Pod label 中设置的 `controller-revision-hash` 值为 ControllerRevision 的完整名字,比如
```yaml
apiVersion: v1
kind: Pod
metadata:
labels:
controller-revision-hash: demo-cloneset-956df7994
```
它是通过 CloneSet 名字和 ControllerRevision hash 值拼接而成。
通常 hash 值长度为 8~10 个字符,而 Kubernetes 中的 label 值不能超过 63 个字符。
因此 CloneSet 的名字一般是不能超过 52 个字符的。
因此 `CloneSetShortHash` 这个新的 feature-gate 被引入。
如果它被打开CloneSet 会将 `controller-revision-hash` 的值只设置为 hash 值,比如 `956df7994`,因此 CloneSet 名字则不会有任何限制了。
不用担心,即使打开了 `CloneSetShortHash`CloneSet 仍然会识别和管理过去存量的 revision label 为完整格式的 Pod。
## 扩缩容功能
### 流式扩容
**FEATURE STATE:** Kruise v1.0.0
CloneSet **扩容**时可以指定 `ScaleStrategy.MaxUnavailable` 来限制扩容的步长,以达到服务应用影响最小化的目的。
它可以设置为一个**绝对值**或者**百分比**,如果不填,则 Kruise 会设置为默认值为 `nil`,即表示不设限制。
该字段可以配合 `Spec.MinReadySeconds` 字段使用, 例如:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
minReadySeconds: 60
scaleStrategy:
maxUnavailable: 1
```
上述配置能达到的效果是:在扩容时,只有当上一个扩容出的 Pod 已经 Ready 超过一分钟后CloneSet 才会执行创建下一个 Pod 的操作。
## 升级功能
### 升级类型
CloneSet 提供了 3 种升级方式,默认为 `ReCreate`
- `ReCreate`: 控制器会删除旧 Pod 和它的 PVC然后用新版本重新创建出来。
- `InPlaceIfPossible`: 控制器会优先尝试原地升级 Pod如果不行再采用重建升级。具体参考下方阅读文档。
- `InPlaceOnly`: 控制器只允许采用原地升级。因此,用户只能修改上一条中的限制字段,如果尝试修改其他字段会被 Kruise 拒绝。
**请阅读[该文档](../core-concepts/inplace-update)了解更多原地升级的细节。**
我们还在原地升级中提供了 **graceful period** 选项,作为优雅原地升级的策略。用户如果配置了 `gracePeriodSeconds` 这个字段,控制器在原地升级的过程中会先把 Pod status 改为 not-ready然后等一段时间`gracePeriodSeconds`),最后再去修改 Pod spec 中的镜像版本。
这样,就为 endpoints-controller 这些控制器留出了充足的时间来将 Pod 从 endpoints 端点列表中去除。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
type: InPlaceIfPossible
inPlaceUpdateStrategy:
gracePeriodSeconds: 10
```
### Template 和 revision
`spec.template` 中定义了当前 CloneSet 中最新的 Pod 模板。
控制器会为每次更新过的 `spec.template` 计算一个 revision hash 值,比如针对开头的 CloneSet 例子,
控制器会为 template 计算出 revision hash 为 `sample-744d4796cc` 并上报到 CloneSet status 中。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
generation: 1
# ...
spec:
replicas: 5
# ...
status:
observedGeneration: 1
readyReplicas: 5
replicas: 5
currentRevision: sample-d4d4fb5bd
updateRevision: sample-d4d4fb5bd
updatedReadyReplicas: 5
updatedReplicas: 5
# ...
```
这里是对 CloneSet status 中的字段说明:
- `status.replicas`: Pod 总数
- `status.readyReplicas`: **ready** Pod 数量
- `status.availableReplicas`: **ready and available** Pod 数量 (满足 `minReadySeconds`)
- `status.currentRevision`: 最近一次全量 Pod 推平版本的 revision hash 值
- `status.updateRevision`: 最新版本的 revision hash 值
- `status.updatedReplicas`: 最新版本的 Pod 数量
- `status.updatedReadyReplicas`: 最新版本的 **ready** Pod 数量
### Partition 分批灰度
Partition 的语义是 **保留旧版本 Pod 的数量或百分比**,默认为 `0`。这里的 `partition` 不表示任何 `order` 序号。
如果在发布过程中设置了 `partition`:
- 如果是数字,控制器会将 `(replicas - partition)` 数量的 Pod 更新到最新版本。
- 如果是百分比,控制器会将 `(replicas * (100% - partition))` 数量的 Pod 更新到最新版本。
比如,我们将 CloneSet 例子的 image 更新为 `nginx:mainline` 并且设置 `partition=3`。过了一会,查到的 CloneSet 如下:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
# ...
generation: 2
spec:
replicas: 5
template:
metadata:
labels:
app: sample
spec:
containers:
- image: nginx:mainline
imagePullPolicy: Always
name: nginx
updateStrategy:
partition: 3
# ...
status:
observedGeneration: 2
readyReplicas: 5
replicas: 5
currentRevision: sample-d4d4fb5bd
updateRevision: sample-56dfb978d4
updatedReadyReplicas: 2
updatedReplicas: 2
```
注意 `status.updateRevision` 已经更新为 `sample-56dfb978d4` 新的值。
由于我们设置了 `partition=3`,控制器只升级了 2 个 Pod。
```bash
$ kubectl get pod -L controller-revision-hash
NAME READY STATUS RESTARTS AGE CONTROLLER-REVISION-HASH
sample-chvnr 1/1 Running 0 6m46s sample-d4d4fb5bd
sample-j6c4s 1/1 Running 0 6m46s sample-d4d4fb5bd
sample-ns85c 1/1 Running 0 6m46s sample-d4d4fb5bd
sample-jnjdp 1/1 Running 0 10s sample-56dfb978d4
sample-qqglp 1/1 Running 0 18s sample-56dfb978d4
```
### 通过 partition 回滚
**FEATURE STATE:** Kruise v0.9.0
默认情况下,`partition` 只控制 Pod 更新到 `status.updateRevision` 新版本。
也就是说以上面这个 CloneSet 来看,当 `partition 5 -> 3`CloneSet 会升级 2 个 Pod 到 `status.updateRevision` 版本。
而当把 `partition 3 -> 5` 修改回去时CloneSet 不会做任何事情。
但是如果你启用了 `CloneSetPartitionRollback` 这个 feature-gate
上面这个场景下 CloneSet 会把 2 个 `status.updateRevision` 版本的 Pod 重新回滚为 `status.currentRevision` 版本。
### MaxUnavailable 最大不可用数量
MaxUnavailable 是 CloneSet 限制下属最多不可用的 Pod 数量。
它可以设置为一个**绝对值**或者**百分比**,如果不填 Kruise 会设置为默认值 `20%`
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
maxUnavailable: 20%
```
从 Kruise `v0.9.0` 版本开始,`maxUnavailable` 不仅会保护发布,也会对 Pod 指定删除生效。
也就是说用户通过 `podsToDelete``apps.kruise.io/specified-delete: true` 来指定一个 Pod 期望删除,
CloneSet 只会在当前不可用 Pod 数量(相对于 replicas 总数)小于 `maxUnavailable` 的时候才执行删除。
### MaxSurge 最大弹性数量
MaxSurge 是 CloneSet 控制最多能扩出来超过 `replicas` 的 Pod 数量。
它可以设置为一个**绝对值**或者**百分比**,如果不填 Kruise 会设置为默认值 `0`
如果发布的时候设置了 maxSurge控制器会先多扩出来 `maxSurge` 数量的 Pod此时 Pod 总数为 `(replicas+maxSurge)`),然后再开始发布存量的 Pod。
然后,当新版本 Pod 数量已经满足 `partition` 要求之后,控制器会再把多余的 `maxSurge` 数量的 Pod 删除掉,保证最终的 Pod 数量符合 `replicas`
要说明的是maxSurge 不允许配合 `InPlaceOnly` 更新模式使用。
另外,如果是与 `InPlaceIfPossible` 策略配合使用,控制器会先扩出来 `maxSurge` 数量的 Pod再对存量 Pod 做原地升级。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
maxSurge: 3
```
从 Kruise `v0.9.0` 版本开始,`maxSurge` 不仅会保护发布,也会对 Pod 指定删除生效。
也就是说用户通过 `podsToDelete``apps.kruise.io/specified-delete: true` 来指定一个 Pod 期望删除,
CloneSet 有可能会先创建一个新 Pod、等待它 ready 之后、再删除旧 Pod。这取决于当时的 `maxUnavailable` 和实际不可用 Pod 数量。
比如:
- 对于一个 CloneSet `maxUnavailable=2, maxSurge=1` 且有一个 `pod-a` 处于不可用状态,
如果你对另一个 `pod-b` 打标 `apps.kruise.io/specified-delete: true` 或将它的名字加入 `podsToDelete`
那么 CloneSet 会立即删除它,然后创建一个新 Pod。
- 对于一个 CloneSet `maxUnavailable=1, maxSurge=1` 且有一个 `pod-a` 处于不可用状态,
如果你对另一个 `pod-b` 打标 `apps.kruise.io/specified-delete: true` 或将它的名字加入 `podsToDelete`
那么 CloneSet 会先新建一个 Pod、等待它 ready最后再删除 `pod-b`
- 对于一个 CloneSet `maxUnavailable=1, maxSurge=1` 且有一个 `pod-a` 处于不可用状态,
如果你对这个 `pod-a` 打标 `apps.kruise.io/specified-delete: true` 或将它的名字加入 `podsToDelete`
那么 CloneSet 会立即删除它,然后创建一个新 Pod。
- ...
### 升级顺序
当控制器选择 Pod 做升级时,默认是有一套根据 Pod phase/conditions 的排序逻辑:
**unscheduled < scheduled, pending < unknown < running, not-ready < ready**
在此之外CloneSet 也提供了增强的 `priority`(优先级) 和 `scatter`(打散) 策略来允许用户自定义发布顺序。
#### 优先级策略
这个策略定义了控制器计算 Pod 发布优先级的规则,所有需要更新的 Pod 都会通过这个优先级规则计算后排序。
目前 `priority` 可以通过 weight(权重) 和 order(序号) 两种方式来指定。
- `weight`: Pod 优先级是由所有 weights 列表中的 term 来计算 match selector 得出。如下:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
priorityStrategy:
weightPriority:
- weight: 50
matchSelector:
matchLabels:
test-key: foo
- weight: 30
matchSelector:
matchLabels:
test-key: bar
```
- `order`: Pod 优先级是由 orderKey 的 value 决定,这里要求对应的 value 的结尾能解析为 int 值。比如 value "5" 的优先级是 5value "sts-10" 的优先级是 10。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
priorityStrategy:
orderPriority:
- orderedKey: some-label-key
```
#### 打散策略
这个策略定义了如何将一类 Pod 打散到整个发布过程中。
比如,针对一个 `replica=10` 的 CloneSet我们在 3 个 Pod 中添加了 `foo=bar` 标签、并设置对应的 scatter 策略,那么在发布的时候这 3 个 Pod 会排在第 1、6、10 个发布。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
scatterStrategy:
- key: foo
value: bar
```
注意:
- 尽管 `priority``scatter` 策略可以一起设置,但我们强烈推荐同时只用其中一个。
- 如果使用了 `scatter` 策略,我们强烈建议只配置一个 term key-value。否则实际的打散发布顺序可能会不太好理解。
最后要说明的是,使用上述发布顺序策略都要求对特定一些 Pod 打标,这是在 CloneSet 中没有提供的。
### 发布暂停
用户可以通过设置 paused 为 true 暂停发布,不过控制器还是会做 replicas 数量管理:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
paused: true
```
### 原地升级自动预热
**FEATURE STATE:** Kruise v0.9.0
如果你在[安装或升级 Kruise](../installation##optional-feature-gate) 的时候启用了 `PreDownloadImageForInPlaceUpdate` feature-gate
CloneSet 控制器会自动在所有旧版本 pod 所在 node 节点上预热你正在灰度发布的新版本镜像。 这对于应用发布加速很有帮助。
默认情况下 CloneSet 每个新镜像预热时的并发度都是 `1`,也就是一个个节点拉镜像。
如果需要调整,你可以在 CloneSet annotation 上设置并发度:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "5"
```
注意,为了避免大部分不必要的镜像拉取,目前只针对 replicas > 3 的 CloneSet 做自动预热。
## 生命周期钩子
每个 CloneSet 管理的 Pod 会有明确所处的状态,在 Pod label 中的 `lifecycle.apps.kruise.io/state` 标记:
- Normal正常状态
- PreparingUpdate准备原地升级
- Updating原地升级中
- Updated原地升级完成
- PreparingDelete准备删除
而生命周期钩子,则是通过在上述状态流转中卡点,来实现原地升级前后、删除前的自定义操作(比如开关流量、告警等)。
```golang
type LifecycleStateType string
// Lifecycle contains the hooks for Pod lifecycle.
type Lifecycle struct {
// PreDelete is the hook before Pod to be deleted.
PreDelete *LifecycleHook `json:"preDelete,omitempty"`
// InPlaceUpdate is the hook before Pod to update and after Pod has been updated.
InPlaceUpdate *LifecycleHook `json:"inPlaceUpdate,omitempty"`
}
type LifecycleHook struct {
LabelsHandler map[string]string `json:"labelsHandler,omitempty"`
FinalizersHandler []string `json:"finalizersHandler,omitempty"`
}
```
示例:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# 通过 finalizer 定义 hook
lifecycle:
preDelete:
finalizersHandler:
- example.io/unready-blocker
inPlaceUpdate:
finalizersHandler:
- example.io/unready-blocker
# 或者也可以通过 label 定义
lifecycle:
inPlaceUpdate:
labelsHandler:
example.io/block-unready: "true"
```
### 流转示意
![Lifecycle circulation](/img/docs/user-manuals/cloneset-lifecycle.png)
- 当 CloneSet 删除一个 Pod包括正常缩容和重建升级
- 如果没有定义 lifecycle hook 或者 Pod 不符合 preDelete 条件,则直接删除
- 否则,先只将 Pod 状态改为 `PreparingDelete`。等用户 controller 完成任务去掉 label/finalizer、Pod 不符合 preDelete 条件后kruise 才执行 Pod 删除
- 注意:`PreparingDelete` 状态的 Pod 处于删除阶段,不会被升级
- 当 CloneSet 原地升级一个 Pod 时:
- 升级之前,如果定义了 lifecycle hook 且 Pod 符合 inPlaceUpdate 条件,则将 Pod 状态改为 `PreparingUpdate`
- 等用户 controller 完成任务去掉 label/finalizer、Pod 不符合 inPlaceUpdate 条件后kruise 将 Pod 状态改为 `Updating` 并开始升级
- 升级完成后,如果定义了 lifecycle hook 且 Pod 不符合 inPlaceUpdate 条件,将 Pod 状态改为 `Updated`
- 等用户 controller 完成任务加上 label/finalizer、Pod 符合 inPlaceUpdate 条件后kruise 将 Pod 状态改为 `Normal` 并判断为升级成功
关于从 `PreparingDelete` 回到 `Normal` 状态,从设计上是支持的(通过撤销指定删除),但我们一般不建议这种用法。由于 `PreparingDelete` 状态的 Pod 不会被升级,当回到 `Normal` 状态后可能立即再进入发布阶段,对于用户处理 hook 是一个难题。
### 用户 controller 逻辑示例
按上述例子,可以定义:
- `example.io/unready-blocker` finalizer 作为 hook
- `example.io/initialing` annotation 作为初始化标记
在 CloneSet template 模板里带上这个字段:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
template:
metadata:
annotations:
example.io/initialing: "true"
finalizers:
- example.io/unready-blocker
# ...
lifecycle:
preDelete:
finalizersHandler:
- example.io/unready-blocker
inPlaceUpdate:
finalizersHandler:
- example.io/unready-blocker
```
而后用户 controller 的逻辑如下:
- 对于 `Normal` 状态的 Pod如果 annotation 中有 `example.io/initialing: true` 并且 Pod status 中的 ready condition 为 True则接入流量、去除这个 annotation
- 对于 `PreparingDelete``PreparingUpdate` 状态的 Pod切走流量并去除 `example.io/unready-blocker` finalizer
- 对于 `Updated` 状态的 Pod接入流量并打上 `example.io/unready-blocker` finalizer

View File

@ -0,0 +1,97 @@
---
title: Container Launch Priority
---
**FEATURE STATE:** Kruise v1.0.0
Container Launch Priority 提供了控制一个 Pod 中容器启动顺序的方法。
> 通常来说 Pod 容器的启动和退出顺序是由 Kubelet 管理的。Kubernetes 曾经有一个 [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/753-sidecar-containers) 计划在 container 中增加一个 type 字段来标识不同类型容器的启停优先级。
> 但是由于[sig-node考虑到对现有代码架构的改动太大](https://github.com/kubernetes/enhancements/issues/753#issuecomment-713471597),它已经被拒绝了。
注意,这个功能作用在 Pod 对象上,不管它的 owner 是什么类型的,因此可以适用于 Deployment、CloneSet 以及其他的 workload 种类。
## 用法
### 按照 container 顺序启动
只需要在 Pod 中定义一个 annotation 即可:
```yaml
apiVersion: v1
kind: Pod
annotations:
apps.kruise.io/container-launch-priority: Ordered
spec:
containers:
- name: sidecar
# ...
- name: main
# ...
```
Kruise 会保证前面的容器sidecar会在后面容器main之前启动。
### 按自定义顺序启动
需要在 Pod container 中添加 `KRUISE_CONTAINER_PRIORITY` 环境变量:
```yaml
apiVersion: v1
kind: Pod
spec:
containers:
- name: main
# ...
- name: sidecar
env:
- name: KRUISE_CONTAINER_PRIORITY
value: "1"
# ...
```
1. 值的范围在 `[-2147483647, 2147483647]`,不写默认是 `0`
2. 权重高的容器,会保证在权重低的容器之前启动。
3. 相同权重的容器不保证启动顺序。
## 使用要求
使用 ContainerLaunchPriority 功能需要打开 `PodWebhook` feature-gate默认就是打开的除非显式关闭
## 实现细节
Kruise webhook 会处理所有 Pod 创建的请求。
当 webhook 发现 Pod 中带有 `apps.kruise.io/container-launch-priority` annotation 或是 `KRUISE_CONTAINER_PRIORITY` 环境变量,则在它的每个容器中注入 `KRUISE_CONTAINER_BARRIER` 环境变量。
`KRUISE_CONTAINER_BARRIER` 环境变量是 value from 一个名为 `{pod-name}-barrier` 的 ConfigMapkey 是与这个容器的权重所对应。比如:
```yaml
apiVersion: v1
kind: Pod
spec:
containers:
- name: main
# ...
env:
- name: KRUISE_CONTAINER_BARRIER
valueFrom:
configMapKeyRef:
name: {pod-name}-barrier
key: "p_0"
- name: sidecar
env:
- name: KRUISE_CONTAINER_PRIORITY
value: "1"
- name: KRUISE_CONTAINER_BARRIER
valueFrom:
configMapKeyRef:
name: {pod-name}-barrier
key: "p_1"
# ...
```
然后 Kruise controller 会创建一个空的 ConfigMap并按照权重顺序以及 Pod 中容器的启动状态逐渐将 key 加入到 ConfigMap 中。
以上面的例子来看controller 会先加入 `p_1` key等待 sidecar 容器启动成功后,再加入 `p_0` key 来允许 Kubelet 启动 main 容器。
另外,在 Pod 启动的过程中,用 kubectl 可能会看到 Pod 处于 `CreateContainerConfigError` 状态,这是由于 Kubelet 没有找到部分容器的 ConfigMap key 导致的,在全部容器启动完成后会消失。

View File

@ -0,0 +1,103 @@
---
title: Container Restart
---
**FEATURE STATE:** Kruise v0.9.0
ContainerRecreateRequest 可以帮助用户**重启/重建**存量 Pod 中一个或多个容器。
和 Kruise 提供的原地升级类似当一个容器重建的时候Pod 中的其他容器还保持正常运行。重建完成后Pod 中除了该容器的 restartCount 增加以外不会有什么其他变化。
注意,之前临时写到旧容器 **rootfs** 中的文件会丢失,但是 volume mount 挂载卷中的数据都还存在。
这个功能依赖于 `kruise-daemon` 组件来停止 Pod 容器。
如果 `KruiseDaemon` feature-gate 被关闭了ContainerRecreateRequest 也将无法使用。
## 使用方法
### 提交请求
为要重建容器的 Pod 提交一个 `ContainerRecreateRequest` 自定义资源(缩写 `CRR`
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
namespace: pod-namespace
name: xxx
spec:
podName: pod-name
containers: # 要重建的容器名字列表,至少要有 1 个
- name: app
- name: sidecar
strategy:
failurePolicy: Fail # 'Fail' 或 'Ignore',表示一旦有某个容器停止或重建失败, CRR 立即结束
orderedRecreate: false # 'true' 表示要等前一个容器重建完成了,再开始重建下一个
terminationGracePeriodSeconds: 30 # 等待容器优雅退出的时间,不填默认用 Pod 中定义的
unreadyGracePeriodSeconds: 3 # 在重建之前先把 Pod 设为 not ready并等待这段时间后再开始执行重建
minStartedSeconds: 10 # 重建后新容器至少保持运行这段时间,才认为该容器重建成功
activeDeadlineSeconds: 300 # 如果 CRR 执行超过这个时间,则直接标记为结束(未结束的容器标记为失败)
ttlSecondsAfterFinished: 1800 # CRR 结束后,过了这段时间自动被删除掉
```
*所有 `strategy` 中的字段、以及 `spec` 中的 `activeDeadlineSeconds`/`ttlSecondsAfterFinished` 都是可选的。*
1. 一般来说,列表中的容器会一个个被停止,但可能同时在被重建和启动,除非 `orderedRecreate` 被设置为 `true`
2. `unreadyGracePeriodSeconds` 功能依赖于 `KruisePodReadinessGate` 这个 feature-gate 要打开,后者会在每个 Pod 创建的时候注入一个 readinessGate。
否则,默认只会给 Kruise workload 创建的 Pod 注入 readinessGate也就是说只有这些 Pod 才能在 CRR 重建时使用 `unreadyGracePeriodSeconds`
```bash
# for commandline you can
$ kubectl get containerrecreateqequest -n pod-namespace
# or just short name
$ kubectl get crr -n pod-namespace
```
### 检查状态
CRR status 如下:
```yaml
status:
completionTime: "2021-03-22T11:53:48Z"
containerRecreateStates:
- name: app
phase: Succeeded
- name: sidecar
phase: Succeeded
phase: Completed
```
`status.phase` 包括:
- `Pending`: CRR 等待被执行
- `Recreating`: CRR 正在被执行
- `Completed`: CRR 已经执行完成,完成时间在 `status.completionTime` 字段可见
注意,`status.phase=Completed` 只表示 CRR 完成,并不代表 CRR 中声明的容器都重建成功了,因此还需要检查 `status.containerRecreateStates` 中的信息。
`status.containerRecreateStates[x].phase` 包括:
- `Pending`: container 等待被重建
- `Recreating`: container 正在被重建
- `Failed`: container 重建失败,此时 `status.containerRecreateStates[x].message` 应有错误信息
- `Succeeded`: container 重建成功
**因此,当 CRR 结束了,只有上述 container 状态是 `Succeeded` phase 的才表示重建成功了。**
## 实现介绍
当用户创建了一个 CRRKruise webhook 会把当时容器的 containerID/restartCount 记录到 `spec.containers[x].statusContext` 之中。
**kruise-daemon** 执行的过程中,如果它发现实际容器当前的 containerID 与 `statusContext` 不一致或 restartCount 已经变大,
则认为容器已经被重建成功了(比如可能发生了一次原地升级)。
![ContainerRecreateRequest](/img/docs/user-manuals/containerrecreaterequest.png)
一般情况下,**kruise-daemon** 会执行 preStop hook 后把容器停掉,然后 **kubelet** 感知到容器退出,则会新建一个容器并启动。
最后 **kruise-daemon** 看到新容器已经启动成功超过 `minStartedSeconds` 时间后,会上报这个容器的 phase 状态为 `Succeeded`
如果容器重建和原地升级操作同时触发了:
- 如果 **Kubelet** 根据原地升级要求已经停止或重建了容器,**kruise-daemon** 会判断容器重建已经完成。
- 如果 **kruise-daemon** 先停了容器,**Kubelet** 会继续执行原地升级,即创建一个新版本容器并启动。
如果针对一个 Pod 提交了多个 ContainerRecreateRequest 资源,会按时间先后一个个执行。

View File

@ -0,0 +1,36 @@
---
title: Deletion Protection
---
**FEATURE STATE:** Kruise v0.9.0
该功能提供了一个安全策略,用来在 Kubernetes 级联删除的机制下保护用户的资源和应用可用性。
## 使用方式
首先,需要在[安装或升级 Kruise](../installation##optional-feature-gate) 的时候启用 `ResourcesDeletionProtection` feature-gate。
然后,用户可以给一些特定资源对象加上 `policy.kruise.io/delete-protection` 标签,值可以是:
- `Always`: 这个对象禁止被删除,除非上述 label 被去掉
- `Cascading`: 这个对象如果还有可用的下属资源,则禁止被删除
目前支持的资源类型、以及 cascading 级联关系如下:
| Kind | Group | Version | **Cascading** judgement |
| --------------------------- | ---------------------- | ------------------ | -----------------------------------
| `Namespace` | core | v1 | namespace 下是否还有正常的 Pod |
| `CustomResourceDefinition` | apiextensions.k8s.io | v1beta1, v1 | CRD 下是否还有存量的 CR |
| `Deployment` | apps | v1 | replicas 是否为 0 |
| `StatefulSet` | apps | v1 | replicas 是否为 0 |
| `ReplicaSet` | apps | v1 | replicas 是否为 0 |
| `CloneSet` | apps.kruise.io | v1alpha1 | replicas 是否为 0 |
| `StatefulSet` | apps.kruise.io | v1alpha1, v1beta1 | replicas 是否为 0 |
| `UnitedDeployment` | apps.kruise.io | v1alpha1 | replicas 是否为 0 |
## 风险
通过 [webhook configuration](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-configuration) 的 `objectSelector` 字段,
Kruise webhook 只会拦截处理带有 `policy.kruise.io/delete-protection` 标签的 `Namespace/CustomResourceDefinition/Deployment/StatefulSet/ReplicaSet` 资源。
因此,如果所有 kruise-manager pod 都挂了或者处于异常的状态kube-apiserver 调用 deletion webhook 失败,
只有带有 `policy.kruise.io/delete-protection` 标签的上述资源才会暂时无法删除。

View File

@ -0,0 +1,124 @@
---
title: ImagePullJob
---
NodeImage 和 ImagePullJob 是从 Kruise v0.8.0 版本开始提供的 CRD。
Kruise 会自动为每个 Node 创建一个 NodeImage它包含了哪些镜像需要在这个 Node 上做预热。
用户能创建 ImagePullJob 对象,来指定一个镜像要在哪些 Node 上做预热。
![Image Pulling](/img/docs/user-manuals/imagepulling.png)
注意NodeImage 是一个**偏底层的 API**,一般只在你要明确在某一个节点上做一次预热的时候才使用,否则你都应该**使用 ImagePullJob 来指定某个镜像在一批节点上做预热**。
## ImagePullJob (high-level)
ImagePullJob 是一个 **namespaced-scope** 的资源。
API 定义: https://github.com/openkruise/kruise/blob/master/apis/apps/v1alpha1/imagepulljob_types.go
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
metadata:
name: job-with-always
spec:
image: nginx:1.9.1 # [required] 完整的镜像名 name:tag
parallelism: 10 # [optional] 最大并发拉取的节点梳理, 默认为 1
selector: # [optional] 指定节点的 名字列表 或 标签选择器 (只能设置其中一种)
names:
- node-1
- node-2
matchLabels:
node-type: xxx
# podSelector: # [optional] pod label 选择器来在这些 pod 所在节点上拉取镜像, 与 selector 不能同时设置.
# pod-label: xxx
completionPolicy:
type: Always # [optional] 默认为 Always
activeDeadlineSeconds: 1200 # [optional] 无默认值, 只对 Alway 类型生效
ttlSecondsAfterFinished: 300 # [optional] 无默认值, 只对 Alway 类型生效
pullPolicy: # [optional] 默认 backoffLimit=3, timeoutSeconds=600
backoffLimit: 3
timeoutSeconds: 300
```
你可以在 `selector` 字段中指定节点的 名字列表 或 标签选择器 **(只能设置其中一种)**,如果没有设置 `selector` 则会选择所有节点做预热。
或者你可以配置 `podSelector` 来在这些 pod 所在节点上拉取镜像,`podSelector` 与 `selector` 不能同时设置。
同时ImagePullJob 有两种 completionPolicy 类型:
- `Always` 表示这个 job 是一次性预热,不管成功、失败都会结束
- `activeDeadlineSeconds`: 整个 job 的 deadline 结束时间
- `ttlSecondsAfterFinished`: 结束后超过这个时间,自动清理删除 job
- `Never` 表示这个 job 是长期运行、不会结束,并且会每天都会在匹配的节点上重新预热一次指定的镜像
### 配置 secrets
如果这个镜像来自一个私有仓库,你可能需要配置一些 secret
```yaml
# ...
spec:
pullSecrets:
- secret-name1
- secret-name2
```
因为 ImagePullJob 是一种 namespaced-scope 资源,这些 secret 必须存在 ImagePullJob 所在的 namespace 中。
然后你只需要在 `pullSecrets` 字段中写上这些 secret 的名字即可。
## NodeImage (low-level)
NodeImage 是一个 **cluster-scope** 的资源。
API 定义: https://github.com/openkruise/kruise/blob/master/apis/apps/v1alpha1/nodeimage_types.go
当 Kruise 被安装后nodeimage-controller 会自动为每个 Node 创建一个同名的 NodeImage。
并且当 Node 发生伸缩时nodeimage-controller 也会对应的创建或删除 NodeImage。
除此之外nodeimage-controller 也会将 Node 上的 labels 标签持续同步到 NodeImage 上面,因此对应的 NodeImage 与 Node 拥有相同的名字和标签。
用户可以用 Node 名字来查询一个 NodeImage或者用 Node labels 做 selector 来查询一批 NodeImage。
通常来说一个空的 NodeImage 如下:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: NodeImage
metadata:
labels:
kubernetes.io/arch: amd64
kubernetes.io/os: linux
# ...
name: node-xxx
# ...
spec: {}
status:
desired: 0
failed: 0
pulling: 0
succeeded: 0
```
如果你希望在这个节点上拉去一个 `ubuntu:latest` 镜像,你可以有两种方式
1. 执行 `kubectl edit nodeimage node-xxx` 并将以下写入其中(忽略注释):
```yaml
# ...
spec:
images:
ubuntu: # 镜像 name
tags:
- tag: latest # 镜像 tag
pullPolicy:
ttlSecondsAfterFinished: 300 # [required] 拉取完成(成功或失败)超过 300s 后,将这个任务从 NodeImage 中清除
timeoutSeconds: 600 # [optional] 每一次拉取的超时时间, 默认为 600
backoffLimit: 3 # [optional] 拉取的重试次数,默认为 3
activeDeadlineSeconds: 1200 # [optional] 整个任务的超时时间,无默认值
```
2. `kubectl patch nodeimage node-xxx --type=merge -p '{"spec":{"images":{"ubuntu":{"tags":[{"tag":"latest","pullPolicy":{"ttlSecondsAfterFinished":300}}]}}}}'`
你可以执行 `kubectl get nodeimage node-xxx -o yaml`,从 status 中看到拉取进度以及结果,并且你会发现拉取完成 600s 后任务会被清除。

View File

@ -0,0 +1,94 @@
---
title: PodUnavailableBudget
---
**FEATURE STATE:** Kruise v0.10.0
在诸多[Voluntary Disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) 场景中 Kubernetes [Pod Disruption Budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)
通过限制同时中断的Pod数量来保证应用的高可用性。然而PDB只能防控通过 [Eviction API](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#eviction-api) 来触发的Pod Disruption例如kubectl drain驱逐node上面的所有Pod。
但在如下voluntary disruption场景中即便有kubernetes PDB防护依然将会导致业务中断、服务降级
1. 应用owner通过deployment正在进行版本升级与此同时集群管理员由于机器资源利用率过低正在进行node缩容。
2. 中间件团队利用sidecarSet正在原地升级集群中的sidecar版本例如ServiceMesh envoy同时HPA正在对同一批应用进行缩容。
3. 应用owner和中间件团队利用cloneSet、sidecarSet原地升级的能力正在对同一批Pod进行升级。
在上面这些 kubernetes PDB 无法很好防护的场景中Kruise PodUnavailableBudget 通过对Pod Mutating Webhook的拦截能够覆盖更多的Voluntary Disruption场景进而提供应用更加强大的防护能力。
一个简单的例子如下:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: PodUnavailableBudget
metadata:
name: web-server-pub
namespace: web
spec:
targetRef:
apiVersion: apps.kruise.io/v1alpha1
# cloneset, deployment, statefulset etc.
kind: CloneSet
name: web-server
# selector label query over pods managed by the budget
# selector and TargetReference are mutually exclusive, targetRef is priority to take effect.
# selector is commonly used in scenarios where applications are deployed using multiple workloads,
# and targetRef is used for protection against a single workload.
# selector:
# matchLabels:
# app: web-server
# maximum number of Pods unavailable for the current cloneset, the example is cloneset.replicas(5) * 60% = 3
# maxUnavailable and minAvailable are mutually exclusive, maxUnavailable is priority to take effect
maxUnavailable: 60%
# Minimum number of Pods available for the current cloneset, the example is cloneset.replicas(5) * 40% = 2
# minAvailable: 40%
-----------------------
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
app: web-server
name: web-server
namespace: web
spec:
replicas: 5
selector:
matchLabels:
app: web-server
template:
metadata:
labels:
app: web-server
spec:
containers:
- name: nginx
image: nginx:alpine
```
## Implementation
PUB实现原理如下详细设计请参考[Pub Proposal](https://github.com/openkruise/kruise/blob/master/docs/proposals/20210614-podunavailablebudget.md)
![PodUnavailableBudget](/img/docs/user-manuals/podunavailablebudget.png)
## Comparison with Kubernetes native PDB
Kubernetes PDB是通过Eviction API接口来实现Pod安全防护而Kruise PDB则是拦截了Pod Validating Request来实现诸多Voluntary Disruption场景的防护能力。
**Kruise PUB包含了PDB的所有能力防护Pod Eviction业务可以根据需要两者同时使用也可以单独使用Kruise PUB推荐方式。**
## feature-gates
PodUnavailableBudget Pod安全防护默认是关闭的如果要开启请通过设置 feature-gates *PodUnavailableBudgetDeleteGate**PodUnavailableBudgetUpdateGate*.
```bash
$ helm install kruise https://... --set featureGates="PodUnavailableBudgetDeleteGate=true\,PodUnavailableBudgetUpdateGate=true"
```
## PodUnavailableBudget Status
```yaml
# kubectl describe podunavailablebudgets web-server-pub
Name: web-server-pub
Kind: PodUnavailableBudget
Status:
unavailableAllowed: 3 # unavailableAllowed number of pod unavailable that are currently allowed
currentAvailable: 5 # currentAvailable current number of available pods
desiredAvailable: 2 # desiredAvailable minimum desired number of available pods
totalReplicas: 5 # totalReplicas total number of pods counted by this PUB
```

View File

@ -0,0 +1,202 @@
---
title: ResourceDistribution
---
在对 Secret、ConfigMap 等 namespace-scoped 资源进行跨 namespace 分发及同步的场景中,原生 kubernetes 目前只支持用户 one-by-one 地进行手动分发与同步,十分地不方便。
典型的案例有:
- 当用户需要使用 SidecarSet 的 imagePullSecrets 能力时,要先重复地在相关 namespaces 中创建对应的 Secret并且需要确保这些 Secret 配置的正确性和一致性。
- 当用户想要采用 ConfigMap 来配置一些**通用**的环境变量时,往往需要在多个 namespaces 做 ConfigMap 的下发,并且后续的修改往往也要求多 namespaces 之间保持同步。
因此,面对这些需要跨 namespaces 进行资源分发和**多次同步**的场景我们期望一种更便捷的分发和同步工具来自动化地去做这件事为此我们设计并实现了一个新的CRD --- **ResourceDistribution**
ResourceDistribution 目前支持 **Secret****ConfigMap** 两类资源的分发和同步。
## API 说明
ResourceDistribution是一类 **cluster-scoped** 的 CRD其主要由 **`resource`** 和 **`targets`** 两个字段构成,其中 **`resource`** 字段用于描述用户所要分发的资源,**`targets`** 字段用于描述用户所要分发的目标命名空间。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
... ...
targets:
... ...
```
### Resource 字段说明
**`resource`** 字段必须是一个完整、正确的资源描述。
一个配置正确的 **`resource`** 例子如下所示:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
game.properties: |
enemy.types=aliens,monsters
player.maximum-lives=5
player_initial_lives: "3"
ui_properties_file_name: user-interface.properties
user-interface.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
targets:
... ...
```
Tips: 用户可以先在本地某个命名空间中创建相应资源并进行测试,确认资源配置正确后再拷贝过来。
### Target 字段说明
**`targets`** 字段目前支持四种规则来描述用户所要分发的目标命名空间,包括 `allNamespaces`、`includedNamespaces`、`namespaceLabelSelector` 以及 `excludedNamespaces`
- `allNamespaces`: bool值如果为`true`,则分发至所有命名空间;
- `includedNamespaces`: 通过 Name 来匹配目标命名空间;
- `namespaceLabelSelector`:通过 LabelSelector 来匹配目标命名空间;
- `excludedNamespaces`: 通过 Name 来排除某些不想分发的命名空间;
**目标命名空间的计算规则:**
1. 初始化目标命名空间 *T* = ∅;
2. 如果用户设置了`allNamespaces=true`*T* 则会匹配所有命名空间;
3. 将`includedNamespaces`中列出的命名空间加入 *T*
4. 将与`namespaceLabelSelector`匹配的命名空间加入 *T*
5. 将`excludedNamespaces`中列出的命名空间从 *T* 中剔除;
**`allNamespaces`、`includedNamespaces`、`namespaceLabelSelector` 之间是 或(OR) 的关系,而`excludedNamespaces`一旦被配置则会显式地排除掉这些命名空间。另外targets还将自动忽略kube-system 和 kube-public 两个命名空间。**
一个配置正确的targets字段如下所示
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
... ...
targets:
includedNamespaces:
list:
- name: ns-1
- name: ns-4
namespaceLabelSelector:
matchLabels:
group: test
excludedNamespaces:
list:
- name: ns-3
```
上例中,该 ResourceDistribution 的目标命名空间一定会包含ns-1和ns-4并且Labels满足`namespaceLabelSelector`的命名空间也会被包含进目标命名空间但是即使ns-3即使满足`namespaceLabelSelector`也不会被包含,因为它已经在`excludedNamespaces`中被显式地排除了。
## 完整用例
### 分发资源
当用户将 ResourceDistribution 的 resource 和 targets 两个字段正确配置,并创建这个 ResourceDistribution 资源后,相应的 Controller 会执行资源分发逻辑,这一资源会自动地在各个目标命名空间中创建。一个完整的用例如下所示:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
game.properties: |
enemy.types=aliens,monsters
player.maximum-lives=5
player_initial_lives: "3"
ui_properties_file_name: user-interface.properties
user-interface.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
targets:
excludedNamespaces:
list:
- name: ns-3
includedNamespaces:
list:
- name: ns-1
- name: ns-4
namespaceLabelSelector:
matchLabels:
group: test
```
### 分发状态跟踪
当然资源分发并不总是成功的在分发的过程中可能会遇到各种各样的错误导致分发失败。为此我们在ResourceDistribution.Status字段中记录了资源分发的一些状态以便用户对其进行追踪。
首先Status记录了目标命名空间总数(Desired)、成功分发的目标命名空间数量(Succeeded)、以及失败的目标命名空间数量(Failed)
```yaml
status:
Desired: 3
Failed: 1
Succeeded: 2
```
为了进一步方便用户了解分发失败的原因及地点命名空间ResourceDistribution 还对分发错误类型进行了归纳整理总共分为了六类并记录在status.conditions之中
- 四类 condition 记录了操作资源时出现失败的相关原因,即记录资源的 Get、Create、 Update 和 Delete 四类操作出现的错误信息以及对应的失败命名空间;
- 一类 condition 记录了命名空间不存在的错误;
- 一类 condition 记录了资源冲突的情况即目标命名空间中已经存在Name、Kind、APIVersion都相同的资源且该资源不是该ResourceDistribution分发则会发生资源冲突相应的命名空间会被记录下来。
```yaml
Status:
Conditions:
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: GetResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: CreateResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: UpdateResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: DeleteResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: ConflictOccurred
Failed Namespace:
ns-1
ns-4
Last Transition Time: 2021-09-06T08:45:08Z
Reason: namespace not found
Status: True
Type: NamespaceNotExists
```
上述例子遇到目标命名空间 ns-1 和 ns-4 不存在的错误,相应的错误类型和命名空间被记录了下来。
### 更新并同步资源
ResourceDistribution 允许用户更新resource字段即更新资源并且会自动地对所有目标命名空间中的资源进行同步更新。
每一次更新资源时ResourceDistribution 都会计算新版本资源的哈希值并记录到资源的Annotations之中当 ResourceDistribution 发现新版本的资源与目前资源的哈希值不同时,才会对资源进行更新。
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
annotations:
kruise.io/resourcedistribution.resource.from: sample
kruise.io/resourcedistribution.resource.distributed.timestamp: 2021-09-06 08:44:52.7861421 +0000 UTC m=+12896.810364601
kruise.io/resourcedistribution.resource.hashcode: 0821a13321b2c76b5bd63341a0d97fb46bfdbb2f914e2ad6b613d10632fa4b63
... ...
```
**特别地,我们非常不建议用户绕过 ResourceDistribution 直接对资源进行修改,除非用户知道自己在做什么**
- 直接修改资源后,资源的哈希值不会被自动计算,因此,下次 resource字段被修改后**ResourceDistribution 可能将用户对这些资源的直接修改覆盖掉**
- ResourceDistribution 通过 kruise.io/resourcedistribution.resource.from 来判断资源是否由该 ResourceDistribution 分发,如果该 Annotation 被修改或删除,则被修改的资源会被 ResourceDistribution 当成冲突资源,并且无法通过 ResourceDistribution 进行同步更新。
### 级联删除
**ResourceDistribution 通过 OwnerReference 来管控所分发的资源。因此,需要特别注意,当 ResourceDistribution 被删除时,其所分发的所有资源也会被删除。**

View File

@ -0,0 +1,383 @@
---
title: SidecarSet
---
这个控制器支持通过 admission webhook 来自动为集群中创建的符合条件的 Pod 注入 sidecar 容器。
这个注入过程和 [istio](https://istio.io/docs/setup/kubernetes/additional-setup/sidecar-injection/)
的自动注入方式很类似。
除了在 Pod 创建时候注入外SidecarSet 还提供了为运行时 Pod 原地升级其中已经注入的 sidecar 容器镜像的能力。
简单来说SidecarSet 将 sidecar 容器的定义和生命周期与业务容器解耦。
它主要用于管理无状态的 sidecar 容器,比如监控、日志等 agent。
## 范例
### 创建 SidecarSet
如下的 sidecarset.yaml 定义了一个 SidecarSet其中包括了一个名为 sidecar1 的 sidecar 容器:
```yaml
# sidecarset.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: test-sidecarset
spec:
selector:
matchLabels:
app: nginx
updateStrategy:
type: RollingUpdate
maxUnavailable: 1
containers:
- name: sidecar1
image: centos:6.7
command: ["sleep", "999d"] # do nothing at all
volumeMounts:
- name: log-volume
mountPath: /var/log
volumes: # this field will be merged into pod.spec.volumes
- name: log-volume
emptyDir: {}
```
创建这个 YAML:
```bash
kubectl apply -f sidecarset.yaml
```
### 创建 Pod
定义一个匹配 SidecarSet selector 的 Pod
```yaml
apiVersion: v1
kind: Pod
metadata:
labels:
app: nginx # matches the SidecarSet's selector
name: test-pod
spec:
containers:
- name: app
image: nginx:1.15.1
```
创建这个 Pod你会发现其中被注入了 sidecar1 容器:
```bash
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
test-pod 2/2 Running 0 118s
```
此时SidecarSet status 被更新为:
```bash
$ kubectl get sidecarset test-sidecarset -o yaml | grep -A4 status
status:
matchedPods: 1
observedGeneration: 1
readyPods: 1
updatedPods: 1
```
### 更新sidecar container Image
更新sidecarSet中sidecar container的image=centos:7
```bash
$ kubectl edit sidecarsets test-sidecarset
# sidecarset.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: test-sidecarset
spec:
containers:
- name: sidecar1
image: centos:7
```
此时发现pod中的sidecar容器已经被更新为了centos:7并且pod以及其它的容器没有重启。
```bash
$ kubectl get pods |grep test-pod
test-pod 2/2 Running 1 7m34s
$ kubectl get pods test-pod -o yaml |grep 'image: centos'
image: centos:7
$ kubectl describe pods test-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 5m47s kubelet Container sidecar1 definition changed, will be restarted
Normal Pulling 5m17s kubelet Pulling image "centos:7"
Normal Created 5m5s (x2 over 12m) kubelet Created container sidecar1
Normal Started 5m5s (x2 over 12m) kubelet Started container sidecar1
Normal Pulled 5m5s kubelet Successfully pulled image "centos:7"
```
## SidecarSet功能说明
一个简单的 SidecarSet yaml 文件如下:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
selector:
matchLabels:
app: sample
containers:
- name: nginx
image: nginx:alpine
initContainers:
- name: init-container
image: busybox:latest
command: [ "/bin/sh", "-c", "sleep 5 && echo 'init container success'" ]
updateStrategy:
type: RollingUpdate
namespace: ns-1
```
- spec.selector 通过label的方式选择需要注入、更新的pod支持matchLabels、MatchExpressions两种方式详情请参考https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
- spec.containers 定义需要注入、更新的pod.spec.containers容器支持完整的k8s container字段详情请参考https://kubernetes.io/docs/concepts/containers/
- spec.initContainers 定义需要注入的pod.spec.initContainers容器支持完整的k8s initContainer字段详情请参考https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
- 注入initContainers容器默认基于container name升级排序
- initContainers只支持注入不支持pod原地升级
- spec.updateStrategy sidecarSet更新策略type表明升级方式
- NotUpdate 不更新,此模式下只会包含注入能力
- RollingUpdate 注入+滚动更新,包含了丰富的滚动更新策略,后面会详细介绍
- spec.namespace sidecarset默认在k8s整个集群范围内生效即对所有的命名空间生效除了kube-system, kube-public当设置该字段时只对该namespace的pod生效
### sidecar container注入
sidecar 的注入只会发生在 Pod 创建阶段,并且只有 Pod spec 会被更新,不会影响 Pod 所属的 workload template 模板。
spec.containers除了默认的k8s container字段还扩展了如下一些字段来方便注入
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
selector:
matchLabels:
app: sample
containers:
# default K8s Container fields
- name: nginx
image: nginx:alpine
volumeMounts:
- mountPath: /nginx/conf
name: nginx.conf
# extended sidecar container fields
podInjectPolicy: BeforeAppContainer
shareVolumePolicy:
type: disabled | enabled
transferEnv:
- sourceContainerName: main
envName: PROXY_IP
volumes:
- Name: nginx.conf
hostPath: /data/nginx/conf
```
- podInjectPolicy 定义container注入到pod.spec.containers中的位置
- BeforeAppContainer(默认) 注入到pod原containers的前面
- AfterAppContainer 注入到pod原containers的后面
- 数据卷共享
- 共享指定卷:通过 spec.volumes 来定义 sidecar 自身需要的 volume详情请参考https://kubernetes.io/docs/concepts/storage/volumes/
- 共享所有卷:通过 spec.containers[i].shareVolumePolicy.type = enabled | disabled 来控制是否挂载pod应用容器的卷常用于日志收集等 sidecar配置为 enabled 后会把应用容器中所有挂载点注入 sidecar 同一路经下(sidecar中本身就有声明的数据卷和挂载点除外
- 环境变量共享
- 可以通过 spec.containers[i].transferEnv 来从别的容器获取环境变量,会把名为 sourceContainerName 容器中名为 envName 的环境变量拷贝到本容器
#### 注入暂停
**FEATURE STATE:** Kruise v0.10.0
对于已经创建的 SidecarSet可通过设置 `spec.injectionStrategy.paused=true` 实现sidecar container的暂停注入
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
... ...
injectionStrategy:
paused: true
```
上述方法只作用于新创建的 Pod对于已注入 Pod 的存量 sidecar container 不产生任何影响。
#### imagePullSecrets
**FEATURE STATE:** Kruise v0.10.0
SidecarSet 可以通过配置 spec.imagePullSecrets来配合 [Secret](https://kubernetes.io/zh/docs/concepts/configuration/secret/) 拉取私有 sidecar 镜像。其实现原理为: 当sidecar注入时SidecarSet 会将其 spec.imagePullSecrets 注入到[ Pod 的 spec.imagePullSecrets](https://kubernetes.io/zh/docs/tasks/configure-pod-container/pull-image-private-registry/)。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
... ....
imagePullSecrets:
- name: my-secret
```
需要特别注意的是,**对于需要拉取私有 sidecar 镜像的 Pod用户必需确保这些 Pod 所在的命名空间中已存在对应的 Secret**,否则会导致拉取私有镜像失败。
### sidecar更新策略
SidecarSet不仅支持sidecar容器的原地升级而且提供了非常丰富的升级、灰度策略。
#### 分批发布
Partition 的语义是 **保留旧版本 Pod 的数量或百分比**,默认为 `0`。这里的 `partition` 不表示任何 `order` 序号。
如果在发布过程中设置了 `partition`:
- 如果是数字,控制器会将 `(replicas - partition)` 数量的 Pod 更新到最新版本。
- 如果是百分比,控制器会将 `(replicas * (100% - partition))` 数量的 Pod 更新到最新版本。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
partition: 90
```
假设该SidecarSet关联的pod数量是100个则本次升级只会升级10个保留90个。
#### 最大不可用数量
MaxUnavailable 是发布过程中保证的,同一时间下最大不可用的 Pod 数量,默认值为 1。用户可以将其设置为绝对值或百分比百分比会被控制器按照selected pod做基数来计算出一个背后的绝对值
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
maxUnavailable: 20%
```
注意maxUnavailable 和 partition 两个值是没有必然关联。举例:
- 当 {matched pod}=100,partition=50,maxUnavailable=10控制器会发布 50 个 Pod 到新版本,但是发布窗口为 10即同一时间只会发布 10 个 Pod每发布好一个 Pod 才会再找一个发布,直到 50 个发布完成。
- 当 {matched pod}=100,partition=80,maxUnavailable=30控制器会发布 20 个 Pod 到新版本,因为满足 maxUnavailable 数量,所以这 20 个 Pod 会同时发布。
#### 更新暂停
用户可以通过设置 paused 为 true 暂停发布此时对于新创建的、扩容的pod依旧会实现注入能力已经更新的pod会保持更新后的版本不动还没有更新的pod会暂停更新。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
paused: true
```
#### 金丝雀发布
对于有金丝雀发布需求的业务可以通过strategy.selector来实现。方式对于需要率先金丝雀灰度的pod打上固定的labels[canary.release] = true再通过strategy.selector.matchLabels来选中该pod
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
canary.release: true
```
### 发布顺序控制
- 默认对升级的pod排序保证多次升级的顺序一致
- 默认选择优先顺序是(越小优先级越高): unscheduled < scheduled, pending < unknown < running, not-ready < ready, newer pods < older pods
- scatter打散排序
#### scatter打散顺序
打散策略允许用户定义将符合某些标签的 Pod 打散到整个发布过程中。比如,一个 SidecarSet所管理的pod为10如果下面有 3 个 Pod 带有 foo=bar 标签,且用户在打散策略中设置了这个标签,那么这 3 个 Pod 会被放在第 1、6、10 个位置发布。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
scatterStrategy:
- key: foo
value: bar
```
**注意:如果使用 scatter 策略,建议只设置一对 key-value 做打散,会比较好理解。**
### Sidecar热升级特性
**FEATURE STATE:** Kruise v0.9.0
SidecarSet原地升级会先停止旧版本的容器然后创建新版本的容器。这种方式更加适合不影响Pod服务可用性的sidecar容器比如说日志收集Agent。
但是对于很多代理或运行时的sidecar容器例如Istio Envoy这种升级方法就有问题了。Envoy作为Pod中的一个代理容器代理了所有的流量如果直接重启Pod服务的可用性会受到影响。如果需要单独升级envoy sidecar就需要复杂的grace终止和协调机制。所以我们为这种sidecar容器的升级提供了一种新的解决方案。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: hotupgrade-sidecarset
spec:
selector:
matchLabels:
app: hotupgrade
containers:
- name: sidecar
image: openkruise/hotupgrade-sample:sidecarv1
imagePullPolicy: Always
lifecycle:
postStart:
exec:
command:
- /bin/sh
- /migrate.sh
upgradeStrategy:
upgradeType: HotUpgrade
hotUpgradeEmptyImage: openkruise/hotupgrade-sample:empty
```
- upgradeType: HotUpgrade代表该sidecar容器的类型是hot upgrade将执行热升级方案
- hotUpgradeEmptyImage: 当热升级sidecar容器时业务必须要提供一个empty容器用于热升级过程中的容器切换。empty容器同sidecar容器具有相同的配置除了镜像地址例如command, lifecycle, probe等但是它不做任何工作。
- lifecycle.postStart: 状态迁移该过程完成热升级过程中的状态迁移该脚本需要由业务根据自身的特点自行实现例如nginx热升级需要完成Listen FD共享以及流量排水reload
热升级特性总共包含以下两个过程:
1. Pod创建时注入热升级容器
2. 原地升级时,完成热升级流程
#### 注入热升级容器
Pod创建时SidecarSet Webhook将会注入两个容器
1. {sidecarContainer.name}-1: 如下图所示 envoy-1这个容器代表正在实际工作的sidecar容器例如envoy:1.16.0
2. {sidecarContainer.name}-2: 如下图所示 envoy-2这个容器是业务配置的hotUpgradeEmptyImage容器例如empty:1.0,用于后面的热升级机制
![sidecarset hotupgrade_injection](/img/docs/user-manuals/sidecarset_hotupgrade_injection.png)
#### 热升级流程
热升级流程主要分为一下三个步骤:
1. Upgrade: 将empty容器升级为当前最新的sidecar容器例如envoy-2.Image = envoy:1.17.0
2. Migration: lifecycle.postStart完成热升级流程中的状态迁移当迁移完成后退出
3. Reset: 状态迁移完成后热升级流程将设置envoy-1容器为empty镜像例如envoy-1.Image = empty:1.0
上述三个步骤完成了热升级中的全部流程当对Pod执行多次热升级时将重复性的执行上述三个步骤。
![sidecarset hotupgrade](/img/docs/user-manuals/sidecarset_hotupgrade.png)
#### Migration Demo
SidecarSet热升级机制不仅完成了mesh容器的切换并且提供了新老版本的协调机制PostStartHook但是至此还只是万里长征的第一步Mesh容器同时还需要提供 PostStartHook 脚本来完成mesh服务自身的平滑升级上述Migration过程Envoy热重启、Mosn无损重启。
为了方便大家能更好的理解Migration过程在kruise仓库下面提供了一个包含代码和镜像的demo供大家参考[Migration Demo](https://github.com/openkruise/samples/tree/master/hotupgrade)
设计文档请参考: [proposals sidecarset hot upgrade](https://github.com/openkruise/kruise/blob/master/docs/proposals/20210305-sidecarset-hotupgrade.md)
当前已知的利用SidecarSet热升级机制的案例
- [ALIYUN ASM](https://help.aliyun.com/document_detail/193804.html) 实现了Service Mesh中数据面的无损升级
### SidecarSet状态说明
通过sidecarset原地升级sidecar容器时可以通过SidecarSet.Status来观察升级的过程
```yaml
# kubectl describe sidecarsets sidecarset-example
Name: sidecarset-example
Kind: SidecarSet
Status:
Matched Pods: 10 # The number of PODs injected and managed by the Sidecarset
Updated Pods: 5 # 5 PODs have been updated to the container version in the latest SidecarSet
Ready Pods: 8 # Matched Pods pod.status.condition.Ready = true number
Updated Ready Pods: 3 # Updated Pods && Ready Pods number
```

View File

@ -0,0 +1,161 @@
---
title: UnitedDeployment
---
这个控制器提供了一个新模式来通过多个 workload 管理多个区域下的 Pod。
这篇 [博客文章](/blog/uniteddeployment) 提供了对 UnitedDeployment 一个高层面的描述。
在一个 Kubernetes 集群中可能存在不同的 node 类型,比如多个可用区、或不同的节点技术(比如 Virtual kueblet这些不同类型的 node 上有 label/taint 标识。
UnitedDeployment 控制器可以提供一个模板来定义应用,并通过管理多个 workload 来匹配下面不同的区域。
每个 UnitedDeployment 下每个区域的 workload 被称为 `subset`,有一个期望的 `replicas` Pod 数量。
目前 subset 支持使用 `StatefulSet`、`Advanced StatefulSet`、`CloneSet`、`Deployment`。
API 定义: https://github.com/openkruise/kruise/blob/master/apis/apps/v1alpha1/uniteddeployment_types.go
下面用一个简单例子来演示如何定义一个 UnitedDeployment 来管理下面三个区域的 StatefulSet所有区域的 Pod 总数为 6。
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
name: sample-ud
spec:
replicas: 6
revisionHistoryLimit: 10
selector:
matchLabels:
app: sample
template:
# statefulSetTemplate or advancedStatefulSetTemplate or cloneSetTemplate or deploymentTemplate
statefulSetTemplate:
metadata:
labels:
app: sample
spec:
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
containers:
- image: nginx:alpine
name: nginx
topology:
subsets:
- name: subset-a
nodeSelectorTerm:
matchExpressions:
- key: node
operator: In
values:
- zone-a
replicas: 1
- name: subset-b
nodeSelectorTerm:
matchExpressions:
- key: node
operator: In
values:
- zone-b
replicas: 50%
- name: subset-c
nodeSelectorTerm:
matchExpressions:
- key: node
operator: In
values:
- zone-c
updateStrategy:
manualUpdate:
partitions:
subset-a: 0
subset-b: 0
subset-c: 0
type: Manual
...
```
## Pod 分发管理
上述例子中可以看到,`spec.topology` 中可以定义 Pod 分发的规则:
```go
// Topology defines the spread detail of each subset under UnitedDeployment.
// A UnitedDeployment manages multiple homogeneous workloads which are called subset.
// Each of subsets under the UnitedDeployment is described in Topology.
type Topology struct {
// Contains the details of each subset. Each element in this array represents one subset
// which will be provisioned and managed by UnitedDeployment.
// +optional
Subsets []Subset `json:"subsets,omitempty"`
}
// Subset defines the detail of a subset.
type Subset struct {
// Indicates subset name as a DNS_LABEL, which will be used to generate
// subset workload name prefix in the format '<deployment-name>-<subset-name>-'.
// Name should be unique between all of the subsets under one UnitedDeployment.
Name string `json:"name"`
// Indicates the node selector to form the subset. Depending on the node selector,
// pods provisioned could be distributed across multiple groups of nodes.
// A subset's nodeSelectorTerm is not allowed to be updated.
// +optional
NodeSelectorTerm corev1.NodeSelectorTerm `json:"nodeSelectorTerm,omitempty"`
// Indicates the tolerations the pods under this subset have.
// A subset's tolerations is not allowed to be updated.
// +optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
// Indicates the number of the pod to be created under this subset. Replicas could also be
// percentage like '10%', which means 10% of UnitedDeployment replicas of pods will be distributed
// under this subset. If nil, the number of replicas in this subset is determined by controller.
// Controller will try to keep all the subsets with nil replicas have average pods.
// +optional
Replicas *intstr.IntOrString `json:"replicas,omitempty"`
}
```
`topology.subsets` 里面我们指定了多个 `subset` 组,每个 subset 其实对应了一个下属的 workload。
当一个 subset 从这个列表里增加或去除时UnitedDeployment 控制器会创建或删除相应的 subset workload。
- 每个 subset workload 有一个独立的名字,前缀是 `<UnitedDeployment-name>-<Subset-name>-`
- subset workload 是根据 UnitedDeployment 的 `spec.template` 做基础来创建,同时将 `subset` 中定义的一些特殊配置(如 `nodeSelector`, `replicas`)合并进去成为一个完整的 workload。
- `subset.replicas` 可以设置**绝对值**或**百分比**。其中,百分比会根据 UnitedDeployment 的 `replicas` 总数计算出来 subset 需要的数量;而如果不设置这个 `subset.replicas`,控制器会根据总数划分给每个 subset 对应的数量。
- `subset.nodeSelector` 会合并到 subset workload 的 `spec.template` 下面,因此这个 workload 创建出来的 Pod 都带有对应的调度规则。
## Pod 更新管理
如果用户修改了 `spec.template` 下面的字段,相当于触发了升级流程。
控制器会把新的 template 更新到各个 subset workload 里面,来触发 subset 控制器升级 Pod。
同时,如果 subset workload 支持 `partition` 策略(目前可用的 `AdvancedStatefulSet`, `StatefulSet` 都是支持的),还可以使用 `manual` 升级策略。
```go
// UnitedDeploymentUpdateStrategy defines the update performance
// when template of UnitedDeployment is changed.
type UnitedDeploymentUpdateStrategy struct {
// Type of UnitedDeployment update strategy.
// Default is Manual.
// +optional
Type UpdateStrategyType `json:"type,omitempty"`
// Includes all of the parameters a Manual update strategy needs.
// +optional
ManualUpdate *ManualUpdate `json:"manualUpdate,omitempty"`
}
// ManualUpdate is a update strategy which allows users to control the update progress
// by providing the partition of each subset.
type ManualUpdate struct {
// Indicates number of subset partition.
// +optional
Partitions map[string]int32 `json:"partitions,omitempty"`
}
```
通过 `manual` 升级策略,用户可以指定 UnitedDeployment 下面每个 subset workload 的灰度升级数量,控制器会把不同的 `partition` 数值同步给对应的 subset workload 里面。

View File

@ -0,0 +1,302 @@
---
title: WorkloadSpread
---
**FEATURE STATE:** Kruise v0.10.0
WorkloadSpread能够将workload的Pod按一定规则分布到不同类型的Node节点上赋予单一workload多区域部署和弹性部署的能力。
常见的一些规则包括:
- 水平打散比如按host、az等维度的平均打散
- 按指定比例打散比如按比例部署Pod到几个指定的 az 中)。
- 带优先级的分区管理,比如:
- 优先部署到ecs资源不足时部署到eci。
- 优先部署固定数量个pod到ecs其余到eci。
- 定制化分区管理,比如:
- 控制workload部署不同数量的Pod到不同的cpu架构上。
- 确保不同的cpu架构上的Pod配有不同的资源配额。
WorkloadSpread与OpenKruise社区的UnitedDeployment功能相似每一个WorkloadSpread定义多个区域定义为`subset`
每个`subset`对应一个`maxReplicas`数量。WorkloadSpread利用Webhook注入`subset`定义的域信息同时控制Pod的扩缩容顺序。
与UnitedDeployment**不同**的是UnitedDeployment是帮助用户创建并管理多个workloadWorkloadSpread仅作用在单个workload之上用户提供workload即可。
当前支持的workload类型`CloneSet`、`Deployment`、`ReplicaSet`。
## Demo
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: workloadspread-demo
spec:
targetRef:
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
kind: Deployment | CloneSet
name: workload-xxx
subsets:
- name: subset-a
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
preferredNodeSelectorTerms:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
maxReplicas: 3
tolertions: []
patch:
metadata:
labels:
xxx-specific-label: xxx
- name: subset-b
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
scheduleStrategy:
type: Adaptive | Fixed
adaptive:
rescheduleCriticalSeconds: 30
```
`targetRef`: 指定WorkloadSpread管理的workload。不可以变更且一个workload只能对应一个WorkloadSpread。
## subsets
`subsets`定义了多个区域(`subset`),每个区域配置不同的subset信息
### sub-fields
- `name`: subset的名称在同一个WorkloadSpread下name唯一代表一个topology区域。
- `maxReplicas`该subset所期望调度的最大副本数需为 >= 0的整数。若设置为空代表不限制subset的副本数。
> 当前版本暂不支持百分比类型。
- `requiredNodeSelectorTerm`: 强制匹配到某个zone。
- `preferredNodeSelectorTerms`: 尽量匹配到某个zone。
**注意**requiredNodeSelectorTerm对应k8s nodeAffinity的requiredDuringSchedulingIgnoredDuringExecution。
preferredNodeSelectorTerms对应nodeAffinity preferredDuringSchedulingIgnoredDuringExecution。
- `tolerations`: `subset`Pod的Node容忍度。
```yaml
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
```
- `patch`: 定制`subset`中的Pod配置可以是Annotations、Labels、Env等。
例子:
```yaml
# patch pod with a topology label:
patch:
metadata:
labels:
topology.application.deploy/zone: "zone-a"
```
```yaml
# patch pod container resources:
patch:
spec:
containers:
- name: main
resources:
limit:
cpu: "2"
memory: 800Mi
```
```yaml
# patch pod container env with a zone name:
patch:
spec:
containers:
- name: main
env:
- name: K8S_AZ_NAME
value: zone-a
```
## 调度策略
WorkloadSpread提供了两种调度策略默认为`Fixed`:
```yaml
scheduleStrategy:
type: Adaptive | Fixed
adaptive:
rescheduleCriticalSeconds: 30
```
- Fixed:
workload严格按照`subsets`定义分布。
- Adaptive:
**Reschedule**Kruise检查`subset`中调度失败的Pod若超过用户定义的时间就将其调度到其他有可用的`subset`上。
## 配置要求
WorkloadSpread 功能默认是关闭的,你需要在 安装/升级 Kruise 的时候打开 feature-gate*WorkloadSpread*
```bash
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
```
### Pod Webhook
WorkloadSpread 利用 `webhook` 向Pod注入域规则。
如果`PodWebhook` feature-gate 被设置为 `false`WorkloadSpread 也将不可用。
### deletion-cost feature
`CloneSet` 已经支持该特性。
其他 native workload 需 kubernetes version >= 1.21。且 1.21 版本需要显式开启 `PodDeletionCost` feature-gate自 1.22 起默认开启。
## 扩缩容顺序:
WorkloadSpread所管理的workload会按照`subsets`中定义的顺序扩缩容,**`subset`的顺序允许改变**,即通过改变`subset`的顺序来调整扩缩容的顺序。
规则如下:
### 扩容
- 按照`spec.subsets`中`subset`定义的顺序调度Pod当前`subset`的active Pod数量达到`maxReplicas`时再调度到下一个`subset`。
### 缩容
- 当`subset`的副本数(active)大于定义的maxReplicas时优先缩容多余的Pod。
- 依据`spec.subsets`中`subset`定义的顺序,后面`subset`的Pod先于前面的被删除。
例如:
```yaml
# subset-a subset-b subset-c
# maxReplicas 10 10 nil
# pods number 10 10 10
# deletion order: c -> b -> a
# subset-a subset-b subset-c
# maxReplicas 10 10 nil
# pods number 20 20 20
# deletion order: b -> a -> c
```
## feature-gates
WorkloadSpread 默认是关闭的,如果要开启请通过设置 feature-gates *WorkloadSpread*.
```bash
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
```
## 例子
### 弹性部署
zone-aack固定100个Podzone-beci做弹性区域
1. 创建WorkloadSpread实例
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadta:
name: ws-demo
namespace: deploy
spec:
targetRef: # 相同namespace下的workload
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: cs-demo
subsets:
- name: ack # zone ack最多100个副本。
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ack
maxReplicas: 100
patch: # 注入label
metadata:
labels:
topology.application.deploy/zone: ack
- name: eci # 弹性区域eci副本数量不限。
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- eci
patch:
metadata:
labels:
topology.application.deploy/zone: eci
```
2. 创建workload副本数可以自由调整。
#### 部署效果
- 当replicas <= 100 时Pod被调度到ack上。
- 当replicas > 100 时100个在ack多余的Pod在弹性域eci。
- 缩容时优先从弹性域eci上缩容。
### 多域部署
分别部署100个副本的Pod到两个机房zone-a, zone-b
1. 创建WorkloadSpread实例
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadta:
name: ws-demo
namespace: deploy
spec:
targetRef: # 相同namespace下的workload
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: cs-demo
subsets:
- name: subset-a # 区域A100个副本。
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
maxReplicas: 100
patch:
metadata:
labels:
topology.application.deploy/zone: zone-a
- name: subset-b # 区域B100个副本。
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
maxReplicas: 100
patch:
metadata:
labels:
topology.application.deploy/zone: zone-b
```
2. 创建一个200副本的新`CloneSet`,或者对现有的`CloneSet`执行滚动更新。
3. 若`subset`副本分布需要变动,先调整对应`subset`的`maxReplicas`再调整workload副本数。

View File

@ -14,9 +14,9 @@
"write-heading-ids": "docusaurus write-heading-ids"
},
"dependencies": {
"@docusaurus/core": "^2.0.0-beta.9",
"@docusaurus/preset-classic": "^2.0.0-beta.9",
"@docusaurus/theme-search-algolia": "^2.0.0-beta.9",
"@docusaurus/core": "^2.0.0-beta.13",
"@docusaurus/preset-classic": "^2.0.0-beta.13",
"@docusaurus/theme-search-algolia": "^2.0.0-beta.13",
"@mdx-js/react": "^1.6.21",
"@svgr/webpack": "^5.5.0",
"clsx": "^1.1.1",

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.6 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.5 MiB

View File

@ -0,0 +1,27 @@
---
title: HPA configuration
---
Kruise workloads, such as CloneSet, Advanced StatefulSet, UnitedDeployment, are all implemented scale subresource,
which means they allow systems like HorizontalPodAutoscaler and PodDisruptionBudget interact with these resources.
### Example
Just set the CloneSet's type and name into `scaleTargetRef`:
```yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
# ...
spec:
scaleTargetRef:
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: your-cloneset-name
```
Note that:
1. The HPA's namespace should be same as the namespace of your CloneSet.
2. The `apiVersion` in `scaleTargetRef` should be same as the `apiVersion` in your workload resource, such as `apps.kruise.io/v1alpha1` or `apps.kruise.io/v1beta1`.
It depends on which version you are using for those workloads that have multiple versions, such as Advanced StatefulSet.

View File

@ -0,0 +1,95 @@
---
title: Kubectl Plugin
---
[Kruise-tools](https://github.com/openkruise/kruise-tools) provides commandline tools for kruise features, such as `kubectl-kruise`, which is a standard plugin of `kubectl`.
## Install
1. You can simply download the binary from the [releases](https://github.com/openkruise/kruise-tools/releases) page. Currently `linux`, `darwin`(OS X), `windows` with `x86_64` and `arm64` are provided. If you are using some other systems or architectures, you have to download the source code and execute `make build` to build the binary.
2. Extract and move it to system PATH.
```bash
$ tar xvf kubectl-kruise-darwin-amd64.tar.gz
$ mv darwin-amd64/kubectl-kruise /usr/local/bin/
```
3. Then you can use it with `kubectl-kruise` or `kubectl kruise`.
```bash
$ kubectl-kruise --help
# or
$ kubectl kruise --help
```
## Usage
### expose
Take a workload(e.g. deployment, cloneset), service or pod and expose it as a new Kubernetes Service.
```bash
$ kubectl kruise expose cloneset nginx --port=80 --target-port=8000
```
### scale
Set a new size for a Deployment, ReplicaSet, CloneSet, or Advanced StatefulSet.
```bash
$ kubectl kruise scale --replicas=3 cloneset nginx
```
It equals to `kubectl scale --replicas=3 cloneset nginx`.
### rollout
Available commands: `history`, `pause`, `restart`, `resume`, `status`, `undo`.
```bash
$ kubectl kruise rollout undo cloneset/nginx
# built-in statefulsets
$ kubectl kruise rollout status statefulsets/sts1
# kruise statefulsets
$ kubectl kruise rollout status statefulsets.apps.kruise.io/sts2
```
### set
Available commands: `env`, `image`, `resources`, `selector`, `serviceaccount`, `subject`.
```bash
$ kubectl kruise set env cloneset/nginx STORAGE_DIR=/local
$ kubectl kruise set image cloneset/nginx busybox=busybox nginx=nginx:1.9.1
```
### migrate
Currently it supports migrate from Deployment to CloneSet.
```bash
# Create an empty CloneSet from an existing Deployment.
$ kubectl kruise migrate CloneSet --from Deployment -n default --dst-name deployment-name --create
# Create a same replicas CloneSet from an existing Deployment.
$ kubectl kruise migrate CloneSet --from Deployment -n default --dst-name deployment-name --create --copy
# Migrate replicas from an existing Deployment to an existing CloneSet.
$ kubectl-kruise migrate CloneSet --from Deployment -n default --src-name cloneset-name --dst-name deployment-name --replicas 10 --max-surge=2
```
### scaledown
Scaledown a cloneset with selective Pods.
```bash
# Scale down 2 with selective pods
$ kubectl kruise scaledown cloneset/nginx --pods pod-a,pod-b
```
It will decrease **replicas=replicas-2** of this cloneset and delete the specified pods.

View File

@ -0,0 +1,88 @@
---
title: Architecture
---
The overall architecture of OpenKruise is shown as below:
![alt](/img/docs/core-concepts/architecture.png)
## API
All features provided by OpenKruise are following **Kubernetes API**, including:
- CRD definition, such as
```shell script
$ kubectl get crd | grep kruise.io
advancedcronjobs.apps.kruise.io 2021-09-16T06:02:36Z
broadcastjobs.apps.kruise.io 2021-09-16T06:02:36Z
clonesets.apps.kruise.io 2021-09-16T06:02:36Z
containerrecreaterequests.apps.kruise.io 2021-09-16T06:02:36Z
daemonsets.apps.kruise.io 2021-09-16T06:02:36Z
imagepulljobs.apps.kruise.io 2021-09-16T06:02:36Z
nodeimages.apps.kruise.io 2021-09-16T06:02:36Z
podunavailablebudgets.policy.kruise.io 2021-09-16T06:02:36Z
resourcedistributions.apps.kruise.io 2021-09-16T06:02:36Z
sidecarsets.apps.kruise.io 2021-09-16T06:02:36Z
statefulsets.apps.kruise.io 2021-09-16T06:02:36Z
uniteddeployments.apps.kruise.io 2021-09-16T06:02:37Z
workloadspreads.apps.kruise.io 2021-09-16T06:02:37Z
# ...
```
- Specific identities (e.g. labels, annotations, envs) in resources, such as
```yaml
apiVersion: v1
kind: Namespace
metadata:
labels:
# To protect pods in this namespace from cascading deletion.
policy.kruise.io/delete-protection: Cascading
```
## Manager
Kruise-manager is a control plane component that runs controllers and webhooks, it is deployed by a Deployment in `kruise-system` namespace.
```bash
$ kubectl get deploy -n kruise-system
NAME READY UP-TO-DATE AVAILABLE AGE
kruise-controller-manager 2/2 2 2 4h6m
$ kubectl get pod -n kruise-system -l control-plane=controller-manager
NAME READY STATUS RESTARTS AGE
kruise-controller-manager-68dc6d87cc-k9vg8 1/1 Running 0 4h6m
kruise-controller-manager-68dc6d87cc-w7x82 1/1 Running 0 4h6m
```
<!-- It can be deployed as multiple replicas with Deployment, but only one of them could become leader and start working, others will keep retrying to acquire the lock. -->
Logically, each controller like cloneset-controller or sidecarset-controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in the `kruise-controller-manager-xxx` single Pod.
Besides controllers, this Pod also contains the admission webhooks for Kruise CRDs and Pod. It creates webhook configurations to configure which resources should be handled, and provides a Service for kube-apiserver calling.
```bash
$ kubectl get svc -n kruise-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kruise-webhook-service ClusterIP 172.24.9.234 <none> 443/TCP 4h9m
```
The `kruise-webhook-service` is much important for kube-apiserver calling.
## Daemon
This is a new daemon component released since Kruise v0.8.0 version.
It is deployed by DaemonSet, runs on every node and manages things like image pre-download, container restarting.
```bash
$ kubectl get pod -n kruise-system -l control-plane=daemon
NAME READY STATUS RESTARTS AGE
kruise-daemon-6hw6d 1/1 Running 0 4h7m
kruise-daemon-d7xr4 1/1 Running 0 4h7m
kruise-daemon-dqp8z 1/1 Running 0 4h7m
kruise-daemon-dv96r 1/1 Running 0 4h7m
kruise-daemon-q7594 1/1 Running 0 4h7m
kruise-daemon-vnsbw 1/1 Running 0 4h7m
```

View File

@ -0,0 +1,92 @@
---
title: InPlace Update
---
In-place Update is one of the key features provided by OpenKruise.
Workloads that support in-place update:
- [CloneSet](/docs/user-manuals/cloneset)
- [Advanced StatefulSet](/docs/user-manuals/advancedstatefulset)
- [Advanced DaemonSet](/docs/user-manuals/advanceddaemonset)
- [SidecarSet](/docs/user-manuals/advanceddaemonset)
Currently `CloneSet`, `Advanced StatefulSet` and `Advanced DaemonSet` re-use the same code package [`./pkg/util/inplaceupdate`](https://github.com/openkruise/kruise/tree/master/pkg/util/inplaceupdate) and have similar behaviours of in-place update. In this article, we would like to introduce the usage and workflow of them.
Note that the in-place update workflow of `SidecarSet` is a little different from the other workloads, such as it will not set Pod to not-ready before update. So the things we talk below do not totally go for `SidecarSet`.
## What is in-place update?
Once we are going to update image in a existing Pod, look at the comparation between *Recreate* and *InPlace* Update:
![alt](/img/docs/core-concepts/inplace-update-comparation.png)
In **ReCreate** way we have to delete the old Pod and create a new Pod:
- Pod name and uid all changed, because they are totally different Pod objects (such as Deployment update)
- Or Pod name may not change but uid changed, because they are still different Pod objects, althrough re-use the same name (such as StatefulSet update)
- Node name of the Pod changed, because the new Pod is almost impossible to be scheduled to the previous node.
- Pod IP changed, because the new Pod is almost impossible to be allocated the previous IP.
But for **InPlace** way we can re-use the Pod object but only modify the fields in it, so that:
- Avoid additional cost of scheduling, allocating IP, allocating and mounting volumes
- Faster image pulling, because of we can re-use most of image layers pulled by the old image and only to pull several new layers
- When a container is in-place updating, the other containers in Pod will not be affected and remain running.
## Understand *InPlaceIfPossible*
The update type in Kruise workloads is named `InPlaceIfPossible`, which tells Kruise to update Pods in-place as possible, and it should go back to ReCreate Update if impossible.
What changes does it consider to be possilble to in-place update?
1. Update `spec.template.metadata.*` in workloads, such as labels and annotations, Kruise will only update the metadata to existing Pods without recreate them.
2. Update `spec.template.spec.containers[x].image` in workloads, Kruise will in-place update the container image in Pods without recreate them.
3. **Since Kruise v1.0 (including v1.0 alpha/beta)**, update `spec.template.metadata.labels/annotations` and there exists container env from the changed labels/annotations, Kruise will in-place update them to renew the env value in containers.
Otherwise, the changes to other fields such as `spec.template.spec.containers[x].env` or `spec.template.spec.containers[x].resources` will go back to ReCreate Update.
Take the CloneSet YAML below as an example:
1. Modify `app-image:v1` image, will trigger in-place update.
2. Modify the value of `app-config` in annotations, will trigger in-place update (Read the [Requirements](#requirements) below).
3. Modify the two fields above together, will tigger in-place update both image and environment.
4. Directly modify the value of `APP_NAME` in env or add a new env, will trigger recreate update.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
...
spec:
replicas: 1
template:
metadata:
annotations:
app-config: "... the real env value ..."
spec:
containers:
- name: app
image: app-image:v1
env:
- name: APP_CONFIG
valueFrom:
fieldRef:
fieldPath: metadata.annotations['app-config']
- name: APP_NAME
value: xxx
updateStrategy:
type: InPlaceIfPossible
```
## Workflow overview
You can see the whole workflow of in-place update below (*you may need to right click and open it in a new tab*):
![alt](/img/docs/core-concepts/inplace-update-workflow.png)
## Requirements
To use InPlace Update for env from metadata, you have to enable `kruise-daemon` (*defaults to be enabled*) and `InPlaceUpdateEnvFromMetadata` feature-gate when install or upgrade Kruise chart.
Note that if you have some nodes of virtual-kubelet type, kruise-daemon may not work on them and in-place update for env from metadata will not be executed.

View File

@ -0,0 +1,165 @@
---
title: Golang client
---
If you want to create/get/update/delete those OpenKruise resources in a Golang project or list-watch them using informer,
you may need a Golang client for OpenKruise.
In that way, you should use the [kruise-api](https://github.com/openkruise/kruise-api) repository,
which only includes schema definition and clientsets of Kruise.
**DO NOT** bring the whole [kruise](https://github.com/openkruise/kruise) repository as dependency into your project.
## Usage
Firstly, import `kruise-api` into your `go.mod` file (the version better to be the Kruise version you installed):
```
require github.com/openkruise/kruise-api v0.10.0
```
| Kubernetes Version in your Project | Import Kruise-api < v0.10 | Import Kruise-api >= v0.10 |
| ---------------------------------- | ---------------------------- | ---------------------------- |
| < 1.18 | v0.x.y (x <= 9) | v0.x.y-legacy (x >= 10) |
| >= 1.18 | v0.x.y-1.18 (7 <= x <= 9) | v0.x.y (x >= 10) |
Then, there are two ways to use `kruise-api` in your code: use it directly or with `controller-runtime`.
It is recommended that you can use it with `controller-runtime` if your project is generated by
[kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) or [operator-sdk](https://github.com/operator-framework/operator-sdk),
which means `controller-runtime` is already imported in your project.
Otherwise, you may use it directly.
### Use kruise-api directly
1. New Kruise client using your rest config:
```go
import kruiseclientset "github.com/openkruise/kruise-api/client/clientset/versioned"
// cfg is the rest config defined in client-go, you should get it using kubeconfig or serviceaccount
kruiseClient := kruiseclientset.NewForConfigOrDie(cfg)
```
2. Get/List Kruise resources:
```go
cloneSet, err := kruiseClient.AppsV1alpha1().CloneSets(namespace).Get(name, metav1.GetOptions{})
cloneSetList, err := kruiseClient.AppsV1alpha1().CloneSets(namespace).List(metav1.ListOptions{})
```
3. Create/Update Kruise resources:
```go
import kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
cloneSet := kruiseappsv1alpha1.CloneSet{
// ...
}
err = kruiseClient.AppsV1alpha1().CloneSets(namespace).Create(&cloneSet, metav1.CreateOptions)
```
```go
// Get first
cloneSet, err := kruiseClient.AppsV1alpha1().CloneSets(namespace).Get(name, metav1.GetOptions{})
if err != nil {
return err
}
// Modify object, such as replicas or template
cloneSet.Spec.Replicas = utilpointer.Int32Ptr(5)
// Update
// This might get conflict, should retry it
if err = kruiseClient.AppsV1alpha1().CloneSets(namespace).Update(&cloneSet, metav1.UpdateOptions); err != nil {
return err
}
```
4. Watch Kruise resources:
```go
import kruiseinformer "github.com/openkruise/kruise-api/client/informers/externalversions"
kruiseInformerFactory := kruiseinformer.NewSharedInformerFactory(kruiseClient, 0)
kruiseInformerFactory.Apps().V1alpha1().CloneSets().Informer().AddEventHandler(...)
kruiseInformerFactory.Start(...)
```
### Use kruise-api with controller-runtime
1. Add kruise apis into the scheme in your `main.go`
```go
import kruiseapi "github.com/openkruise/kruise-api"
// ...
_ = kruiseapi.AddToScheme(scheme)
```
2. New client
This is needed when use controller-runtime client directly.
If your project is generated by [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) or [operator-sdk](https://github.com/operator-framework/operator-sdk),
you should get the client from `mgr.GetClient()` instead of the example below.
```go
import "sigs.k8s.io/controller-runtime/pkg/client"
apiClient, err := client.New(c, client.Options{Scheme: scheme})
```
3. Get/List
```go
import (
kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
"sigs.k8s.io/controller-runtime/pkg/client"
)
cloneSet := kruiseappsv1alpha1.CloneSet{}
err = apiClient.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: name}, &cloneSet)
cloneSetList := kruiseappsv1alpha1.CloneSetList{}
err = apiClient.List(context.TODO(), &cloneSetList, client.InNamespace(instance.Namespace))
```
4. Create/Update/Delete
Create a new CloneSet:
```go
import kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
cloneSet := kruiseappsv1alpha1.CloneSet{
// ...
}
err = apiClient.Create(context.TODO(), &cloneSet)
```
Update an existing CloneSet:
```go
import kruiseappsv1alpha1 "github.com/openkruise/kruise-api/apps/v1alpha1"
// Get first
cloneSet := kruiseappsv1alpha1.CloneSet{}
if err = apiClient.Get(context.TODO(), types.NamespacedName{Namespace: namespace, Name: name}, &cloneSet); err != nil {
return err
}
// Modify object, such as replicas or template
cloneSet.Spec.Replicas = utilpointer.Int32Ptr(5)
// Update
// This might get conflict, should retry it
if err = apiClient.Update(context.TODO(), &cloneSet); err != nil {
return err
}
```
5. List watch and informer
If your project is generated by [kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) or [operator-sdk](https://github.com/operator-framework/operator-sdk) and get the client from `mgr.GetClient()`,
then methods like `Get`/`List` have already queried from informer instead of apiserver.

View File

@ -0,0 +1,13 @@
---
title: Java client
---
We do have a [client-java](https://github.com/openkruise/client-java) repository,
which only includes schema definition and clientsets of Kruise.
However, it is someway deprecated. We strongly recommend you to use the [Golang Client](./go-client).
If you have to use the [client-java](https://github.com/openkruise/client-java), please note that:
1. The schema definition in it may be an older version of Kruise, which means we will not dump each release version for it.
2. This package has not been uploaded to the officail maven repository, which means you should manually download this repo and package it to use.

View File

@ -0,0 +1,9 @@
---
title: Other languages
---
Currently, Kruise has not supported SDK for languages other than Golang and Java.
Actually we only recommend you to use [Golang Client](./go-client), which is guaranteed to be latest and stable.
If you are using other languages such as Python, you have to use their officail K8s client such as [kubernetes-client/python](https://github.com/kubernetes-client/python).
Usually they all have provided methods to let you operate any custom resources.

View File

@ -0,0 +1,3 @@
---
title: FAQ
---

View File

@ -0,0 +1,150 @@
---
title: Installation
---
Since v1.0.0 (alpha/beta), OpenKruise requires **Kubernetes version >= 1.16**.
## Install with helm
Kruise can be simply installed by helm v3.1+, which is a simple command-line tool and you can get it from [here](https://github.com/helm/helm/releases).
```bash
# Firstly add openkruise charts repository if you haven't do this.
$ helm repo add openkruise https://openkruise.github.io/charts/
# [Optional]
$ helm repo update
# Install the latest version.
$ helm install kruise openkruise/kruise --version 1.0.0
```
*If you want to install the stable version, read [doc](/docs/installation).*
## Upgrade with helm
```bash
# Firstly add openkruise charts repository if you haven't do this.
$ helm repo add openkruise https://openkruise.github.io/charts/
# [Optional]
$ helm repo update
# Upgrade the latest version.
$ helm upgrade kruise openkruise/kruise --version 1.0.0 [--force]
```
Note that:
1. Before upgrade, you **must** firstly read the [Change Log](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md)
to make sure that you have understand the breaking changes in the new version.
2. If you want to drop the chart parameters you configured for the old release or set some new parameters,
it is recommended to add `--reset-values` flag in `helm upgrade` command.
Otherwise you should use `--reuse-values` flag to reuse the last release's values.
3. If you are **upgrading Kruise from 0.x to 1.x**, you must add `--force` for upgrade command. Otherwise it is an optional flag.
## Optional: download charts manually
If you have problem with connecting to `https://openkruise.github.io/charts/` in production, you might need to download the chart from [here](https://github.com/openkruise/charts/releases) manually and install or upgrade with it.
```bash
$ helm install/upgrade kruise /PATH/TO/CHART
```
## Options
Note that installing this chart directly means it will use the default template values for Kruise.
You may have to set your specific configurations if it is deployed into a production cluster, or you want to configure feature-gates.
### Optional: chart parameters
The following table lists the configurable parameters of the chart and their default values.
| Parameter | Description | Default |
| ----------------------------------------- | ------------------------------------------------------------ | ----------------------------- |
| `featureGates` | Feature gates for Kruise, empty string means all by default | `` |
| `installation.namespace` | namespace for kruise installation | `kruise-system` |
| `manager.log.level` | Log level that kruise-manager printed | `4` |
| `manager.replicas` | Replicas of kruise-controller-manager deployment | `2` |
| `manager.image.repository` | Repository for kruise-manager image | `openkruise/kruise-manager` |
| `manager.image.tag` | Tag for kruise-manager image | `v1.0.0` |
| `manager.resources.limits.cpu` | CPU resource limit of kruise-manager container | `100m` |
| `manager.resources.limits.memory` | Memory resource limit of kruise-manager container | `256Mi` |
| `manager.resources.requests.cpu` | CPU resource request of kruise-manager container | `100m` |
| `manager.resources.requests.memory` | Memory resource request of kruise-manager container | `256Mi` |
| `manager.metrics.port` | Port of metrics served | `8080` |
| `manager.webhook.port` | Port of webhook served | `9443` |
| `manager.nodeAffinity` | Node affinity policy for kruise-manager pod | `{}` |
| `manager.nodeSelector` | Node labels for kruise-manager pod | `{}` |
| `manager.tolerations` | Tolerations for kruise-manager pod | `[]` |
| `daemon.log.level` | Log level that kruise-daemon printed | `4` |
| `daemon.port` | Port of metrics and healthz that kruise-daemon served | `10221` |
| `daemon.resources.limits.cpu` | CPU resource limit of kruise-daemon container | `50m` |
| `daemon.resources.limits.memory` | Memory resource limit of kruise-daemon container | `128Mi` |
| `daemon.resources.requests.cpu` | CPU resource request of kruise-daemon container | `0` |
| `daemon.resources.requests.memory` | Memory resource request of kruise-daemon container | `0` |
| `daemon.affinity` | Affinity policy for kruise-daemon pod | `{}` |
| `daemon.socketLocation` | Location of the container manager control socket | `/var/run` |
| `webhookConfiguration.failurePolicy.pods` | The failurePolicy for pods in mutating webhook configuration | `Ignore` |
| `webhookConfiguration.timeoutSeconds` | The timeoutSeconds for all webhook configuration | `30` |
| `crds.managed` | Kruise will not install CRDs with chart if this is false | `true` |
| `manager.resyncPeriod` | Resync period of informer kruise-manager, defaults no resync | `0` |
| `manager.hostNetwork` | Whether kruise-manager pod should run with hostnetwork | `false` |
Specify each parameter using the `--set key=value[,key=value]` argument to `helm install` or `helm upgrade`.
### Optional: feature-gate
Feature-gate controls some influential features in Kruise:
| Name | Description | Default | Effect (if closed) |
| ---------------------- | ------------------------------------------------------------ | ------- | --------------------------------------
| `PodWebhook` | Whether to open a webhook for Pod **create** | `true` | SidecarSet/KruisePodReadinessGate disabled |
| `KruiseDaemon` | Whether to deploy `kruise-daemon` DaemonSet | `true` | ImagePulling/ContainerRecreateRequest disabled |
| `DaemonWatchingPod` | Should each `kruise-daemon` watch pods on the same node | `true` | For in-place update with same imageID or env from labels/annotations |
| `CloneSetShortHash` | Enables CloneSet controller only set revision hash name to pod label | `false` | CloneSet name can not be longer than 54 characters |
| `KruisePodReadinessGate` | Enables Kruise webhook to inject 'KruisePodReady' readiness-gate to all Pods during creation | `false` | The readiness-gate will only be injected to Pods created by Kruise workloads |
| `PreDownloadImageForInPlaceUpdate` | Enables CloneSet controller to create ImagePullJobs to pre-download images for in-place update | `false` | No image pre-download for in-place update |
| `CloneSetPartitionRollback` | Enables CloneSet controller to rollback Pods to currentRevision when number of updateRevision pods is bigger than (replicas - partition) | `false` | CloneSet will only update Pods to updateRevision |
| `ResourcesDeletionProtection` | Enables protection for resources deletion | `false` | No protection for resources deletion |
| `TemplateNoDefaults` | Whether to disable defaults injection for pod/pvc template in workloads | `false` | Should not close this feature if it has open |
| `PodUnavailableBudgetDeleteGate` | Enables PodUnavailableBudget for pod deletion, eviction | `false` | No protection for pod deletion, eviction |
| `PodUnavailableBudgetUpdateGate` | Enables PodUnavailableBudget for pod.Spec update | `false` | No protection for in-place update |
| `WorkloadSpread` | Enables WorkloadSpread to manage multi-domain and elastic deploy | `false` | WorkloadSpread disabled |
| `InPlaceUpdateEnvFromMetadata` | Enables Kruise to in-place update a container in Pod when its env from labels/annotations changed and pod is in-place updating | `false` | Only container image can be in-place update |
If you want to configure the feature-gate, just set the parameter when install or upgrade. Such as:
```bash
$ helm install kruise https://... --set featureGates="ResourcesDeletionProtection=true\,PreDownloadImageForInPlaceUpdate=true"
```
If you want to enable all feature-gates, set the parameter as `featureGates=AllAlpha=true`.
### Optional: the local image for China
If you are in China and have problem to pull image from official DockerHub, you can use the registry hosted on Alibaba Cloud:
```bash
$ helm install kruise https://... --set manager.image.repository=openkruise-registry.cn-hangzhou.cr.aliyuncs.com/openkruise/kruise-manager
```
## Best Practices
### Installation parameters for k3s
Usually k3s has the different runtime path from the default `/var/run`. So you have to set `daemon.socketLocation` to the real runtime socket path on your k3s node (e.g. `/run/k3s` or `/var/run/k3s/`).
## Uninstall
Note that this will lead to all resources created by Kruise, including webhook configurations, services, namespace, CRDs, CR instances and Pods managed by Kruise controller, to be deleted!
Please do this ONLY when you fully understand the consequence.
To uninstall kruise if it is installed with helm charts:
```bash
$ helm uninstall kruise
release "kruise" uninstalled
```

View File

@ -0,0 +1,69 @@
---
title: Introduction
slug: /
---
Welcome to OpenKruise!
OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as *deployment, upgrade, ops and avalibility protection*.
Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.
## Key features
- **Advanced Workloads**
OpenKruise contains a set of advanced workloads, such as CloneSet, Advanced StatefulSet, Advanced DaemonSet, BroadcastJob.
They all support not only the basic features which are similar to the original Workloads in Kubernetes, but also more advanced abilities like in-place update, configurable scale/upgrade strategies, parallel operations.
In-place Update is a new methodology to update container images and even environments.
It only restarts the specific container with the new image and the Pod will not be recreated, which leads to much faster update process and much less side effects on other sub-systems such as scheduler, CNI or CSI.
- **Bypass Application Management**
OpenKruise provides several bypass ways to manage sidecar container, multi-domain deployment for applications, which means you can manage these things without modifying the Workloads of applications.
For example, SidecarSet can help you inject sidecar containers into all matching Pods during creation and even update them in-place with no effect on other containers in Pod.
WorkloadSpread constrains the spread of stateless workload, which empowers single workload the abilities for multi-domain and elastic deployment.
- **High-avalibility Protection**
OpenKruise works hard on protecting high-avalibility for applications.
Now it can prevent your Kubernetes resources from the cascading deletion mechanism, including CRD, Namespace and almost all kinds of Workloads.
In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which is not only compatible with Kubernetes PDB protection for Eviction API, but also able to support the protection ability of above scenarios.
- **High-level Operation Features**
OpenKruise also provides high-level operation features to help you manage your applications better.
You can use ImagePullJob to download any images on any nodes you want. Or you can even requires one or more containers in an running Pod to be restarted.
## Relationship
### OpenKruise vs. Kubernetes
Briefly speaking, OpenKruise plays a subsidiary role of Kubernetes.
Kubernetes itself has already provides some features for application deployment and management, such as some [basic Workloads](https://kubernetes.io/docs/concepts/workloads/).
But it is far from enough to deploy and manage lots of applications in large-scale production clusters.
OpenKruise can be easily installed in any Kubernetes clusters.
It makes up for defects of Kubernetes, including but not limited to application deployment, upgrade, protection and operations.
### OpenKruise vs. Platform-as-a-Service (PaaS)
OpenKruise is **not** a PaaS and it will **not** provide any abilities of PaaS.
It is a standard extended suite for Kubernetes, currently contains two components named `kruise-manager` and `kruise-daemon`.
PaaS can use the features provided by OpenKruise to make applications deployment and management better.
## What's Next
Here are some recommended next steps:
- Start to [install OpenKruise](./installation).
- Learn OpenKruise's [Architecture](core-concepts/architecture).

View File

@ -0,0 +1,54 @@
---
title: AdvancedCronJob
---
AdvancedCronJob is an enhanced version of CronJob.
The original CronJob creates Job periodically according to schedule rule, but AdvancedCronJob provides template supported multpile job resources.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: AdvancedCronJob
spec:
template:
# Option 1: use jobTemplate, which is equivalent to original CronJob
jobTemplate:
# ...
# Option 2: use broadcastJobTemplate, which will create a BroadcastJob object when cron schedule triggers
broadcastJobTemplate:
# ...
# Options 3(future): ...
```
- jobTemplatecreate Jobs periodically, which is equivalent to original CronJob
- broadcastJobTemplatecreate [BroadcastJobs](./broadcastjob) periodically, which support to dispatch Jobs on every node
![AdvancedCronjob](/img/docs/user-manuals/advancedcronjob.png)
## Example
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: AdvancedCronJob
metadata:
name: acj-test
spec:
schedule: "*/1 * * * *"
template:
broadcastJobTemplate:
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
completionPolicy:
type: Always
ttlSecondsAfterFinished: 30
```
The YAML below defines an AdvancedCronJob. It will create a BroadcastJob every minute, which will run a job on every node.

View File

@ -0,0 +1,158 @@
---
title: Advanced DaemonSet
---
This controller enhances the rolling update workflow of default [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/)
controller from aspects, such as partition, selector, pause and surging strategies.
Note that Advanced DaemonSet extends the same CRD schema of default DaemonSet with newly added fields.
The CRD kind name is still `DaemonSet`.
This is done on purpose so that user can easily migrate workload to the Advanced DaemonSet from the
default DaemonSet. For example, one may simply replace the value of `apiVersion` in the DaemonSet yaml
file from `apps/v1` to `apps.kruise.io/v1alpha1` after installing Kruise manager.
```yaml
- apiVersion: apps/v1
+ apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
metadata:
name: sample-ds
spec:
#...
```
## Enhanced strategies
These new fields have been added into RollingUpdateDaemonSet:
```go
const (
+ // StandardRollingUpdateType replace the old daemons by new ones using rolling update i.e replace them on each node one after the other.
+ // this is the default type for RollingUpdate.
+ StandardRollingUpdateType RollingUpdateType = "Standard"
+ // SurgingRollingUpdateType replaces the old daemons by new ones using rolling update i.e replace them on each node one
+ // after the other, creating the new pod and then killing the old one.
+ SurgingRollingUpdateType RollingUpdateType = "Surging"
)
// Spec to control the desired behavior of daemon set rolling update.
type RollingUpdateDaemonSet struct {
+ // Type is to specify which kind of rollingUpdate.
+ Type RollingUpdateType `json:"rollingUpdateType,omitempty" protobuf:"bytes,1,opt,name=rollingUpdateType"`
// ...
MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty" protobuf:"bytes,2,opt,name=maxUnavailable"`
+ // A label query over nodes that are managed by the daemon set RollingUpdate.
+ // Must match in order to be controlled.
+ // It must match the node's labels.
+ Selector *metav1.LabelSelector `json:"selector,omitempty" protobuf:"bytes,3,opt,name=selector"`
+ // The number of DaemonSet pods remained to be old version.
+ // Default value is 0.
+ // Maximum value is status.DesiredNumberScheduled, which means no pod will be updated.
+ // +optional
+ Partition *int32 `json:"partition,omitempty" protobuf:"varint,4,opt,name=partition"`
+ // Indicates that the daemon set is paused and will not be processed by the
+ // daemon set controller.
+ // +optional
+ Paused *bool `json:"paused,omitempty" protobuf:"varint,5,opt,name=paused"`
+ // ...
+ MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,7,opt,name=maxSurge"`
}
```
### Type for rolling update
Advanced DaemonSet has a `rollingUpdateType` field in `spec.updateStrategy.rollingUpdate`
which controls the way to rolling update.
- `Standard`: controller will replace the old daemons by new ones using rolling update i.e replace them on each node one after the other.
It is the same behavior as default DaemonSet.
- `Surging`: controller will replace the old daemons by new ones using rolling update i.e replace them on each node one
after the other, creating the new pod and then killing the old one.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
type: RollingUpdate
rollingUpdate:
rollingUpdateType: Standard
```
### Selector for rolling update
It helps users to update Pods on specific nodes whose labels could be matched with the selector.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
type: RollingUpdate
rollingUpdate:
selector:
matchLabels:
nodeType: canary
```
### Partition for rolling update
This strategy defines rules for calculating the priority of updating pods.
Partition is the number of DaemonSet pods that should be remained to be old version.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 10
```
### MaxSurge for rolling update
MaxSurge is the maximum number of DaemonSet pods that can be scheduled above the desired number of pods during the update.
Only when `rollingUpdateType=Surging`, it works.
Value can be an absolute number (ex: 5) or a percentage of the total number of DaemonSet pods at the start of the update (ex: 10%).
The absolute number is calculated from the percentage by rounding up. This cannot be 0. The default value is 1.
Example: when this is set to 30%, at most 30% of the total number of nodes that should be running the daemon pod (i.e. status.desiredNumberScheduled) can have 2 pods running at any given time.
The update starts by starting replacements for at most 30% of those DaemonSet pods.
Once the new pods are available it then stops the existing pods before proceeding onto other DaemonSet pods,
thus ensuring that at most 130% of the desired final number of DaemonSet pods are running at all times during the update.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
rollingUpdate:
rollingUpdateType: Surging
maxSurge: 30%
```
### Paused for rolling update
`paused` indicates that Pods updating is paused, controller will not update Pods but just maintain the number of replicas.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
spec:
# ...
updateStrategy:
rollingUpdate:
paused: true
```

View File

@ -0,0 +1,260 @@
---
title: Advanced StatefulSet
---
This controller enhances the rolling update workflow of default [StatefulSet](https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/)
controller from aspects, such as adding maxUnavailable and introducing in-place update strategy.
Note that Advanced StatefulSet extends the same CRD schema of default StatefulSet with newly added fields.
The CRD kind name is still `StatefulSet`.
This is done on purpose so that user can easily migrate workload to the Advanced StatefulSet from the
default StatefulSet. For example, one may simply replace the value of `apiVersion` in the StatefulSet yaml
file from `apps/v1` to `apps.kruise.io/v1beta1` after installing Kruise manager.
```yaml
- apiVersion: apps/v1
+ apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
name: sample
spec:
#...
```
Note that since Kruise v0.7.0, Advanced StatefulSet has been promoted to `v1beta1`, which is compatible with `v1alpha1`.
And for Kruise version lower than v0.7.0, you can only use `v1alpha1`.
## MaxUnavailable
Advanced StatefulSet adds a `maxUnavailable` capability in the `RollingUpdateStatefulSetStrategy` to allow parallel Pod
updates with the guarantee that the number of unavailable pods during the update cannot exceed this value.
It is only allowed to use when the podManagementPolicy is `Parallel`.
This feature achieves similar update efficiency like Deployment for cases where the order of
update is not critical to the workload. Without this feature, the native `StatefulSet` controller can only
update Pods one by one even if the podManagementPolicy is `Parallel`.
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 20%
```
For example, assuming an Advanced StatefulSet has five Pods named P0 to P4, and the appplication can
tolerate losing three replicas temporally. If we want to update the StatefulSet Pod spec from v1 to
v2, we can perform the following steps using the `MaxUnavailable` feature for fast update.
1. Set `MaxUnavailable` to 3 to allow three unavailable Pods maximally.
2. Optionally, Set `Partition` to 4 in case canary update is needed. Partition means all Pods with an ordinal that is
greater than or equal to the partition will be updated. In this case P4 will be updated even though `MaxUnavailable`
is 3.
3. After P4 finish update, change `Partition` to 0. The controller will update P1,P2 and P3 concurrently.
Note that with default StatefulSet, the Pods will be updated sequentially in the order of P3, P2, P1.
4. Once one of P1, P2 and P3 finishes update, P0 will be updated immediately.
## In-Place Update
Advanced StatefulSet adds a `podUpdatePolicy` field in `spec.updateStrategy.rollingUpdate`
which controls recreate or in-place update for Pods.
- `ReCreate` controller will delete old Pods and create new ones. This is the same behavior as default StatefulSet.
- `InPlaceIfPossible` controller will try to in-place update Pod instead of recreating them if possible. Please ready the concept doc below.
- `InPlaceOnly` controller will in-place update Pod instead of recreating them. With `InPlaceOnly` policy, user cannot modify any fields other than the fields that supported to in-place update.
**You may need to read the [concept doc](../core-concepts/inplace-update) for more details of in-place update.**
We also bring **graceful period** into in-place update. Advanced StatefulSet has supported `gracePeriodSeconds`, which is a period
duration between controller update pod status and update pod images.
So that endpoints-controller could have enough time to remove this Pod from endpoints.
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
rollingUpdate:
podUpdatePolicy: InPlaceIfPossible
inPlaceUpdateStrategy:
gracePeriodSeconds: 10
```
**More importantly**, a readiness-gate named `InPlaceUpdateReady` must be added into `template.spec.readinessGates`
when using `InPlaceIfPossible` or `InPlaceOnly`. The condition `InPlaceUpdateReady` in podStatus will be updated to False before in-place
update and updated to True after the update is finished. This ensures that pod remain at NotReady state while the in-place
update is happening.
An example for StatefulSet using in-place update:
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
name: sample
spec:
replicas: 3
serviceName: fake-service
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
readinessGates:
# A new condition that ensures the pod remains at NotReady state while the in-place update is happening
- conditionType: InPlaceUpdateReady
containers:
- name: main
image: nginx:alpine
podManagementPolicy: Parallel # allow parallel updates, works together with maxUnavailable
updateStrategy:
type: RollingUpdate
rollingUpdate:
# Do in-place update if possible, currently only image update is supported for in-place update
podUpdatePolicy: InPlaceIfPossible
# Allow parallel updates with max number of unavailable instances equals to 2
maxUnavailable: 2
```
## Update sequence
Advanced StatefulSet adds a `unorderedUpdate` field in `spec.updateStrategy.rollingUpdate`, which contains strategies for non-ordered update.
If `unorderedUpdate` is not nil, pods will be updated with non-ordered sequence. Noted that UnorderedUpdate can only be allowed to work with Parallel podManagementPolicy.
Currently `unorderedUpdate` only contains one field: `priorityStrategy`.
### Priority strategy
This strategy defines rules for calculating the priority of updating pods.
All update candidates will be applied with the priority terms.
`priority` can be calculated either by weight or by order.
- `weight`: Priority is determined by the sum of weights for terms that match selector. For example,
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
updateStrategy:
rollingUpdate:
unorderedUpdate:
priorityStrategy:
weightPriority:
- weight: 50
matchSelector:
matchLabels:
test-key: foo
- weight: 30
matchSelector:
matchLabels:
test-key: bar
```
- `order`: Priority will be determined by the value of the orderKey. The update candidates are sorted based on the "int" part of the value string. For example, 5 in string "5" and 10 in string "sts-10".
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
updateStrategy:
rollingUpdate:
unorderedUpdate:
priorityStrategy:
orderPriority:
- orderedKey: some-label-key
```
## Paused update
`paused` indicates that Pods updating is paused, controller will not update Pods but just maintain the number of replicas.
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
updateStrategy:
rollingUpdate:
paused: true
```
## Pre-download image for in-place update
**FEATURE STATE:** Kruise v0.10.0
If you have enabled the `PreDownloadImageForInPlaceUpdate` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate),
Advanced StatefulSet controller will automatically pre-download the image you want to update to the nodes of all old Pods.
It is quite useful to accelerate the progress of applications upgrade.
The parallelism of each new image pre-downloading by Advanced StatefulSet is `1`, which means the image is downloaded on nodes one by one.
You can change the parallelism using the annotation on Advanced StatefulSet according to the capability of image registry,
for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "5"
```
Note that to avoid most unnecessary image downloading, now controller will only pre-download images for Advanced StatefulSet with replicas > `3`.
## Ordinals reserve(skip)
Since Advanced StatefulSet `v1beta1` (Kruise >= v0.7.0), it supports ordinals reserve.
By adding the ordinals to reserve into `reserveOrdinals` fields, Advanced StatefulSet will skip to create Pods with those ordinals.
If these Pods have already existed, controller will delete them.
Note that `spec.replicas` is the expectation number of running Pods and `spec.reserveOrdinals` is the ordinals that should be skipped.
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 4
reserveOrdinals:
- 1
```
For an Advanced StatefulSet with `replicas=4, reserveOrdinals=[1]`, the ordinals of running Pods will be `[0,2,3,4]`.
- If you want to migrate Pod-3 and reserve this ordinal, just append `3` into `reserveOrdinals` list.
Then controller will delete Pod-3 and create Pod-5 (existing Pods will be `[0,2,4,5]`).
- If you just want to delete Pod-3, you should append `3` into `reserveOrdinals` list and set `replicas` to `3`.
Then controller will delete Pod-3 (existing Pods will be `[0,2,4]`).
## Scaling with rate limiting
**FEATURE STATE:** Kruise v0.10.0
To avoid creating all failure pods at once when a new CloneSet applied, a `maxUnavailable` field for scale strategy has been added since Kruise `v0.10.0`.
```yaml
apiVersion: apps.kruise.io/v1beta1
kind: StatefulSet
spec:
# ...
replicas: 100
scaleStrategy:
maxUnavailable: 10% # percentage or absolute number
```
When this field has been set, Advanced StatefulSet will create pods with the guarantee that the number of unavailable pods during the update cannot exceed this value.
For example, the StatefulSet will firstly create 10 pods. After that, it will create one more pod only if one pod created has been running and ready.
Note that it can just be allowed to work with Parallel podManagementPolicy.

View File

@ -0,0 +1,153 @@
---
title: BroadcastJob
---
This controller distributes a Pod on every node in the cluster.
Like a [DaemonSet](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/),
a BroadcastJob makes sure a Pod is created and run on all selected nodes once in a cluster.
Like a [Job](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/),
a BroadcastJob is expected to run to completion.
In the end, BroadcastJob does not consume any resources after each Pod succeeds on every node.
This controller is particularly useful when upgrading a software, e.g., Kubelet, or validation check
in every node, which is typically needed only once within a long period of time or
running an adhoc full cluster inspection script.
Optionally, a BroadcastJob can keep alive after all Pods on desired nodes complete
so that a Pod will be automatically launched for every new node after it is added to the cluster.
## Spec definition
### Template
`Template` describes the Pod template used to run the job.
Note that for the Pod restart policy, only `Never` or `OnFailure` is allowed for BroadcastJob.
### Parallelism
`Parallelism` specifies the maximal desired number of Pods that should be run at
any given time. By default, there's no limit.
For example, if a cluster has ten nodes and `Parallelism` is set to three, there can only be
three pods running in parallel. A new Pod is created only after one running Pod finishes.
### CompletionPolicy
`CompletionPolicy` specifies the controller behavior when reconciling the BroadcastJob.
#### Always
`Always` policy means the job will eventually complete with either failed or succeeded
condition. The following parameters take effect with this policy:
- `ActiveDeadlineSeconds` specifies the duration in seconds relative to the startTime
that the job may be active before the system tries to terminate it.
For example, if `ActiveDeadlineSeconds` is set to 60 seconds, after the BroadcastJob starts
running for 60 seconds, all the running pods will be deleted and the job will be marked
as Failed.
- `BackoffLimit` specifies the number of retries before marking this job failed.
Currently, the number of retries are defined as the aggregated number of restart
counts across all Pods created by the job, i.e., the sum of the
[ContainerStatus.RestartCount](https://github.com/kruiseio/kruise/blob/d61c12451d6a662736c4cfc48682fa75c73adcbc/vendor/k8s.io/api/core/v1/types.go#L2314)
for all containers in every Pod. If this value exceeds `BackoffLimit`, the job is marked
as Failed and all running Pods are deleted. No limit is enforced if `BackoffLimit` is
not set.
- `TTLSecondsAfterFinished` limits the lifetime of a BroadcastJob that has finished execution
(either Complete or Failed). For example, if TTLSecondsAfterFinished is set to 10 seconds,
the job will be kept for 10 seconds after it finishes. Then the job along with all the Pods
will be deleted.
#### Never
`Never` policy means the BroadcastJob will never be marked as Failed or Succeeded even if
all Pods run to completion. This also means above `ActiveDeadlineSeconds`, `BackoffLimit`
and `TTLSecondsAfterFinished` parameters takes no effect if `Never` policy is used.
For example, if user wants to perform an initial configuration validation for every newly
added node in the cluster, he can deploy a BroadcastJob with `Never` policy.
## Examples
### Monitor BroadcastJob status
Assuming the cluster has only one node, run `kubectl get bcj` (shortcut name for BroadcastJob) and
we will see the following:
```shell
NAME DESIRED ACTIVE SUCCEEDED FAILED
broadcastjob-sample 1 0 1 0
```
- `Desired` : The number of desired Pods. This equals to the number of matched nodes in the cluster.
- `Active`: The number of active Pods.
- `SUCCEEDED`: The number of succeeded Pods.
- `FAILED`: The number of failed Pods.
### ttlSecondsAfterFinished
Run a BroadcastJob that each Pod computes a pi, with `ttlSecondsAfterFinished` set to 30.
The job will be deleted in 30 seconds after it is finished.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
name: broadcastjob-ttl
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
completionPolicy:
type: Always
ttlSecondsAfterFinished: 30
```
### activeDeadlineSeconds
Run a BroadcastJob that each Pod sleeps for 50 seconds, with `activeDeadlineSeconds` set to 10 seconds.
The job will be marked as Failed after it runs for 10 seconds, and the running Pods will be deleted.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
name: broadcastjob-active-deadline
spec:
template:
spec:
containers:
- name: sleep
image: busybox
command: ["sleep", "50000"]
restartPolicy: Never
completionPolicy:
type: Always
activeDeadlineSeconds: 10
```
### completionPolicy
Run a BroadcastJob with `Never` completionPolicy. The job will continue to run even if all Pods
have completed on all nodes.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: BroadcastJob
metadata:
name: broadcastjob-never-complete
spec:
template:
spec:
containers:
- name: sleep
image: busybox
command: ["sleep", "5"]
restartPolicy: Never
completionPolicy:
type: Never
```

View File

@ -0,0 +1,596 @@
---
title: CloneSet
---
This controller provides advanced features to efficiently manage stateless applications that
do not have instance order requirement during scaling and rollout. Analogously,
CloneSet can be recognized as an enhanced version of upstream `Deployment` workload, but it does many more.
As name suggests, CloneSet is a [Set -suffix controller](/blog/workload-classification-guidance) which
manages Pods directly. A sample CloneSet yaml looks like following:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
app: sample
name: sample
spec:
replicas: 5
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
containers:
- name: nginx
image: nginx:alpine
```
## Scale features
### Support PVCs
CloneSet allows user to define PVC templates `volumeClaimTemplates` in `CloneSetSpec`, which can support PVCs per Pod.
This cannot be done with `Deployment`. If not specified, CloneSet will only create Pods without PVCs.
A few reminders:
- Each PVC created by CloneSet has an owner reference. So when a CloneSet has been deleted, its PVCs will be cascading deleted.
- Each Pod and PVC created by CloneSet has a "apps.kruise.io/cloneset-instance-id" label key. They use the same string as label value which is composed of a unique **instance-id** as suffix of the CloneSet name.
- When a Pod has been deleted by CloneSet controller, all PVCs related to it will be deleted together.
- When a Pod has been deleted manually, all PVCs related to the Pod are preserved, and CloneSet controller will create a new Pod with the same **instance-id** and reuse the PVCs.
- When a Pod is updated using **recreate** policy, all PVCs related to it will be deleted together.
- When a Pod is updated using **in-place** policy, all PVCs related to it are preserved.
The following shows a sample CloneSet yaml file which contains PVC templates.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
app: sample
name: sample-data
spec:
replicas: 5
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
containers:
- name: nginx
image: nginx:alpine
volumeMounts:
- name: data-vol
mountPath: /usr/share/nginx/html
volumeClaimTemplates:
- metadata:
name: data-vol
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 20Gi
```
### Selective Pod deletion
When a CloneSet is scaled down, sometimes user has preference to deleting specific Pods.
This cannot be done using `StatefulSet` or `Deployment`, because `StatefulSet` always delete Pod
in order and `Deployment`/`ReplicaSet` only delete Pod by its own sorted sequence.
CloneSet allows user to specify to-be-deleted Pod names when scaling down `replicas`. Take the following
sample as an example,
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
replicas: 4
scaleStrategy:
podsToDelete:
- sample-9m4hp
```
when controller receives above update request, it ensures the number of replicas to be 4. If some Pods needs to be
deleted, the Pods listed in `podsToDelete` will be deleted first.
Controller will clear `podsToDelete` automatically once the listed Pods are deleted. Note that:
If one just adds a Pod name to `podsToDelete` and do not modify `replicas`, controller will delete this Pod, and create a new Pod.
If one is unable to change CloneSet directly, an alternative way is to add a label `apps.kruise.io/specified-delete: true` onto the Pod waiting to delete.
Comparing to delete the Pod directly, using `podsToDelete` or `apps.kruise.io/specified-delete: true`
will have CloneSet protection by `maxUnavailable`/`maxSurge` and lifecycle `PreparingDelete` triggering (See below).
### Deletion Sequence
1. Node unassigned < assigned
2. PodPending < PodUnknown < PodRunning
3. Not ready < ready
4. [Lower pod-deletion cost < higher pod-deletion-cost](#pod-deletion-cost)
5. [Higher spread rank < lower spread rank](#deletion-by-spread-constraints)
6. Been ready for empty time < less time < more time
7. Pods with containers with higher restart counts < lower restart counts
8. Empty creation time pods < newer pods < older pods
#### Pod deletion cost
**FEATURE STATE:** Kruise v0.9.0
The [controller.kubernetes.io/pod-deletion-cost](https://kubernetes.io/docs/core-concepts/labels-annotations-taints/#pod-deletion-cost) annotation
is defined in Kubernetes since `v1.21`, Deployment/ReplicaSet will remove pods according to this cost when downscaling.
And CloneSet has also supported it since Kruise `v0.9.0`.
The annotation should be set on the pod, the range is [-2147483647, 2147483647].
It represents the cost of deleting a pod compared to other pods belonging to the same CloneSet.
Pods with lower deletion cost are preferred to be deleted before pods with higher deletion cost.
The implicit value for this annotation for pods that don't set it is 0; negative values are permitted.
#### Deletion by Spread Constraints
**FEATURE STATE:** Kruise v0.10.0
The original proposal(design doc) is [here](https://github.com/openkruise/kruise/blob/master/docs/proposals/20210624-cloneset-scaledown-topology-spread.md).
Currently, it supports **deletion by same node spread** and **deletion by [pod topolocy spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/)**.
If there are Pod Topology Spread Constraints defined in CloneSet template, controller will choose pods according to spread constraints when the cloneset needs to scale down.
Otherwise, controller will choose pods by same node spread by default when scaling down.
### Short hash label
**FEATURE STATE:** Kruise v0.9.0
By default, CloneSet set the `controller-revision-hash` in Pod label to the full name of ControllerRevision, such as:
```yaml
apiVersion: v1
kind: Pod
metadata:
labels:
controller-revision-hash: demo-cloneset-956df7994
```
It is joined by the name of CloneSet and the hash of the ControllerRevision.
Length of the hash is usually 8~10 characters, and the label value in Kubernetes can not be more than 63 characters.
So the name of CloneSet should be less than 52 characters.
A new feature-gate named `CloneSetShortHash` has been introduced.
If it is enabled, CloneSet will only set the `controller-revision-hash` to the real hash, such as `956df7994`.
So there will be no limit to CloneSet name.
Don't worry. Even if you enable the `CloneSetShortHash`, CloneSet will still recognize and manage the old Pods with full revision label.
## Scale features
### Scale up with rate limit
**FEATURE STATE:** Kruise v1.0.0
Users can specify `ScaleStrategy.MaxUnavailable` to limit the step size of CloneSet **Scaling Up**, so as to minimize the impact on application services.
This value can be an absolute number (e.g., 5) or a percentage of desired number of Pods (e.g., 10%). Default value is `nil` (i.e., empty pointer), which indicates non-limitation.
`ScaleStrategy.MaxUnavailable` field can cooperate with 'Spec.MinReadySeconds' field to work, for example:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
minReadySeconds: 60
scaleStrategy:
maxUnavailable: 1
```
The effect of the above configuration is that during scaling up, CloneSet will not create the next pod until the previous pod has been ready for more than one minute.
## Update features
### Update types
CloneSet provides three update types, defaults to `ReCreate`.
- `ReCreate`: controller will delete old Pods and PVCs and create new ones.
- `InPlaceIfPossible`: controller will try to in-place update Pod instead of recreating them if possible. Please ready the concept doc below.
- `InPlaceOnly`: controller will in-place update Pod instead of recreating them. With `InPlaceOnly` policy, user cannot modify any fields other than the fields that supported to in-place update.
**You may need to read the [concept doc](../core-concepts/inplace-update) for more details of in-place update.**
We also bring **graceful period** into in-place update. CloneSet has supported `gracePeriodSeconds`, which is a period
duration between controller update pod status and update pod images.
So that endpoints-controller could have enough time to remove this Pod from endpoints.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
type: InPlaceIfPossible
inPlaceUpdateStrategy:
gracePeriodSeconds: 10
```
### Template and revision
`spec.template` defines the latest pod template in the CloneSet.
Controller will calculate a revision hash for each version of `spec.template` when it has been initialized or modified.
For example, when we create a sample CloneSet, controller will calculate the revision hash `sample-744d4796cc` and
present the hash in CloneSet Status.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
generation: 1
# ...
spec:
replicas: 5
# ...
status:
observedGeneration: 1
readyReplicas: 5
replicas: 5
currentRevision: sample-d4d4fb5bd
updateRevision: sample-d4d4fb5bd
updatedReadyReplicas: 5
updatedReplicas: 5
# ...
```
Here are the explanations for the counters presented in CloneSet status:
- `status.replicas`: Number of pods
- `status.readyReplicas`: Number of **ready** pods
- `status.availableReplicas`: Number of **ready and available** pods (satisfied with `minReadySeconds`)
- `status.currentRevision`: Latest revision hash that has used to be updated to all Pods
- `status.updateRevision`: Latest revision hash of this CloneSet
- `status.updatedReplicas`: Number of pods with the latest revision
- `status.updatedReadyReplicas`: Number of **ready** pods with the latest revision
### Partition
Partition is the **desired number or percent of Pods in old revisions**, defaults to `0`. This field does **NOT** imply any update order.
When `partition` is set during update:
- If it is a number: `(replicas - partition)` number of pods will be updated with the new version.
- If it is a percent: `(replicas * (100% - partition))` number of pods will be updated with the new version.
For example, when we update sample CloneSet's container image to `nginx:mainline` and set `partition=3`, after a while, the sample CloneSet yaml looks like the following:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
# ...
generation: 2
spec:
replicas: 5
template:
metadata:
labels:
app: sample
spec:
containers:
- image: nginx:mainline
imagePullPolicy: Always
name: nginx
updateStrategy:
partition: 3
# ...
status:
observedGeneration: 2
readyReplicas: 5
replicas: 5
currentRevision: sample-d4d4fb5bd
updateRevision: sample-56dfb978d4
updatedReadyReplicas: 2
updatedReplicas: 2
```
Note that `status.updateRevision` has been updated to `sample-56dfb978d4`, a new hash.
Since we set `partition=3`, controller only updates two Pods to the latest revision.
```bash
$ kubectl get pod -L controller-revision-hash
NAME READY STATUS RESTARTS AGE CONTROLLER-REVISION-HASH
sample-chvnr 1/1 Running 0 6m46s sample-d4d4fb5bd
sample-j6c4s 1/1 Running 0 6m46s sample-d4d4fb5bd
sample-ns85c 1/1 Running 0 6m46s sample-d4d4fb5bd
sample-jnjdp 1/1 Running 0 10s sample-56dfb978d4
sample-qqglp 1/1 Running 0 18s sample-56dfb978d4
```
#### Rollback by partition
**FEATURE STATE:** Kruise v0.9.0
By default, `partition` can only control Pods updating to the `status.updateRevision`.
Which means for this CloneSet, when changes `partition 5 -> 3`, CloneSet will update 2 Pods to `status.updateRevision`.
Then changes `partition 3 -> 5` back, CloneSet will do nothing.
But if you have enabled `CloneSetPartitionRollback` feature-gate, in this case,
CloneSet will update the 2 Pods in `status.updateRevision` back to `status.currentRevision`.
### MaxUnavailable
MaxUnavailable is the maximum number of Pods that can be unavailable.
Value can be an absolute number (e.g., 5) or a percentage of desired number of Pods (e.g., 10%).
Default value is 20%.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
maxUnavailable: 20%
```
Since Kruise `v0.9.0`, `maxUnavailable` not only controls Pods update, but also affect Pods specified deletion.
Which means if you declare to delete a Pod via `podsToDelete` or `apps.kruise.io/specified-delete: true`,
CloneSet will delete it only if the number of unavailable Pods (comparing to the replicas number) is less than `maxUnavailable`.
### MaxSurge
MaxSurge is the maximum number of pods that can be scheduled above the desired replicas.
Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
Defaults to 0.
If maxSurge is set somewhere, cloneset-controller will create `maxSurge` number of Pods above the `replicas`,
when it finds multiple active revisions of Pods which means the CloneSet is in the update stage.
After all Pods except `partition` number have been updated to the latest revision, `maxSurge` number Pods will be deleted,
and the number of Pods will be equal to the `replica` number.
What's more, maxSurge is forbidden to use with `InPlaceOnly` policy.
When maxSurge is used with `InPlaceIfPossible`, controller will create additional Pods with latest revision first,
and then update the rest Pods with old revisions,
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
maxSurge: 3
```
Since Kruise `v0.9.0`, `maxSurge` not only controls Pods update, but also affect Pods specified deletion.
Which means if you declare to delete a Pod via `podsToDelete` or `apps.kruise.io/specified-delete: true`,
CloneSet may create new a Pod, wait it to be ready, and them delete the old one.
It depends on `maxUnavailable` and the current number of unavailable Pods.
For example:
- For a CloneSet `maxUnavailable=2, maxSurge=1` and currently only one unavailable Pods is `pod-a`,
if you patch `apps.kruise.io/specified-delete: true` onto `pod-b` or put the Pod name into `podsToDelete`,
CloneSet will delete it directly.
- For a CloneSet `maxUnavailable=1, maxSurge=1` and currently only one unavailable Pods is `pod-a`,
if you patch `apps.kruise.io/specified-delete: true` onto `pod-b` or put the Pod name into `podsToDelete`,
CloneSet will create a new Pod, waiting it to be ready, and finally delete `pod-b`.
- For a CloneSet `maxUnavailable=1, maxSurge=1` and currently only one unavailable Pods is `pod-a`,
if you patch `apps.kruise.io/specified-delete: true` onto `pod-a` or put the Pod name into `podsToDelete`,
CloneSet will delete it directly.
- ...
### Update sequence
When controller chooses Pods to update, it has default sort logic based on Pod phase and conditions:
**unscheduled < scheduled, pending < unknown < running, not-ready < ready**.
In addition, CloneSet also supports advanced `priority` and `scatter` strategies to allow users to specify the update order.
#### priority
This strategy defines rules for calculating the priority of updating pods.
All update candidates will be applied with the priority terms.
`priority` can be calculated either by weight or by order.
- `weight`: Priority is determined by the sum of weights for terms that match selector. For example,
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
priorityStrategy:
weightPriority:
- weight: 50
matchSelector:
matchLabels:
test-key: foo
- weight: 30
matchSelector:
matchLabels:
test-key: bar
```
- `order`: Priority will be determined by the value of the orderKey. The update candidates are sorted based on the "int" part of the value string. For example, 5 in string "5" and 10 in string "sts-10".
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
priorityStrategy:
orderPriority:
- orderedKey: some-label-key
```
#### scatter
This strategy defines rules to make certain Pods be scattered during update.
For example, if a CloneSet has `replica=10`, and we add `foo=bar` label in 3 Pods and specify the following scatter rule. These 3 Pods will
be the 1st, 6th and 10th updated Pods.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
scatterStrategy:
- key: foo
value: bar
```
Note that:
- Although `priority` strategy and `scatter` strategy can be applied together, we strongly suggest to just use one of them to avoid confusion.
- If `scatter` strategy is used, we suggest to just use one term. Otherwise, the update order can be hard to understand.
Last but not the least, the above advanced update strategies require independent Pod labeling mechanisms, which are not provided by CloneSet.
### Paused update
`paused` indicates that Pods updating is paused, controller will not update Pods but just maintain the number of replicas.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# ...
updateStrategy:
paused: true
```
### Pre-download image for in-place update
**FEATURE STATE:** Kruise v0.9.0
If you have enabled the `PreDownloadImageForInPlaceUpdate` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate),
CloneSet controller will automatically pre-download the image you want to update to the nodes of all old Pods.
It is quite useful to accelerate the progress of applications upgrade.
The parallelism of each new image pre-downloading by CloneSet is `1`, which means the image is downloaded on nodes one by one.
You can change the parallelism using the annotation on CloneSet according to the capability of image registry,
for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "5"
```
Note that to avoid most unnecessary image downloading, now controller will only pre-download images for CloneSet with replicas > `3`.
## Lifecycle hook
Each Pod managed by CloneSet has a clear state defined in `lifecycle.apps.kruise.io/state` label:
- Normal
- PreparingUpdate
- Updating
- Updated
- PreparingDelete
Lifecycle hook allows users to do something (for example remove pod from service endpoints) during Pod deleting and before/after in-place update.
```golang
type LifecycleStateType string
// Lifecycle contains the hooks for Pod lifecycle.
type Lifecycle struct {
// PreDelete is the hook before Pod to be deleted.
PreDelete *LifecycleHook `json:"preDelete,omitempty"`
// InPlaceUpdate is the hook before Pod to update and after Pod has been updated.
InPlaceUpdate *LifecycleHook `json:"inPlaceUpdate,omitempty"`
}
type LifecycleHook struct {
LabelsHandler map[string]string `json:"labelsHandler,omitempty"`
FinalizersHandler []string `json:"finalizersHandler,omitempty"`
}
```
Examples:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
# define with finalizer
lifecycle:
preDelete:
finalizersHandler:
- example.io/unready-blocker
inPlaceUpdate:
finalizersHandler:
- example.io/unready-blocker
# or define with label
lifecycle:
inPlaceUpdate:
labelsHandler:
example.io/block-unready: "true"
```
### State circulation
![Lifecycle circulation](/img/docs/user-manuals/cloneset-lifecycle.png)
- When CloneSet delete a Pod (including scale in and recreate update):
- Delete it directly if no lifecycle hook definition or Pod not matched preDelete hook
- Otherwise, CloneSet will firstly update Pod to `PreparingDelete` state and wait for user controller to remove the label/finalizer and Pod not matched preDelete hook
- Note that Pods in `PreparingDelete` state will not be updated
- When CloneSet update a Pod in-place:
- If lifecycle hook defined and Pod matched inPlaceUpdate hook, CloneSet will update Pod to `PreparingUpdate` state
- After user controller remove the label/finalizer (thus Pod not matched inPlaceUpdate hook), CloneSet will update it to `Updating` state and start updating
- After in-place update completed, CloneSet will update Pod to `Updated` state if lifecycle hook defined and Pod not matched inPlaceUpdate hook
- When user controller add label/finalizer into Pod and it matched inPlaceUpdate hook, CloneSet will finally update it to `Normal` state
Besides, although our design supports to change a Pod from `PreparingDelete` back to `Normal` (through cancel specified delete), but it is not recommended. Because Pods in `PreparingDelete` state will not be updated by CloneSet, it might be updating immediately if comes back to `Normal`. This case is hard for user controller to handle.
### Example for user controller logic
Same as yaml example above, we can fisrtly define
- `example.io/unready-blocker` finalizer as hook
- `example.io/initialing` annotation as identity for initializing
Add these fields into CloneSet template:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
spec:
template:
metadata:
annotations:
example.io/initialing: "true"
finalizers:
- example.io/unready-blocker
# ...
lifecycle:
preDelete:
finalizersHandler:
- example.io/unready-blocker
inPlaceUpdate:
finalizersHandler:
- example.io/unready-blocker
```
User controller logic:
- For Pod in `Normal` state, if there is `example.io/initialing: true` in annotation and ready condition in Pod status is True, then add it to endpoints and remove the annotation
- For Pod in `PreparingDelete` and `PreparingUpdate` states, delete it from endpoints and remove `example.io/unready-blocker` finalizer
- For Pod in `Updated` state, add it to endpoints and add `example.io/unready-blocker` finalizer

View File

@ -0,0 +1,101 @@
---
title: Container Launch Priority
---
**FEATURE STATE:** Kruise v1.0.0
Container Launch Priority provides a way to help users **control the sequence of containers start** in a Pod.
> Usually the sequences of containers start and stop are controlled by Kubelet. Kubernetes used to have a [KEP](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/753-sidecar-containers), which plans to add a type field for container to identify the priority of start and stop.
> However, it has been refused because of [sig-node thought it may bring a huge change to code](https://github.com/kubernetes/enhancements/issues/753#issuecomment-713471597).
Note that the feature works for Pod, **no matter what kind of owner it belongs to**, which means Deployment, CloneSet or any other Workloads are all supported.
## Usage
### Start by containers ordinal
It only requires you to add an annotation in Pod:
```yaml
apiVersion: v1
kind: Pod
annotations:
apps.kruise.io/container-launch-priority: Ordered
spec:
containers:
- name: sidecar
# ...
- name: main
# ...
```
Then Kruise will ensure the former container (sidecar) to be started before the later one (main).
### Start by configurable sequence
You should set the priority env `KRUISE_CONTAINER_PRIORITY` in container:
```yaml
apiVersion: v1
kind: Pod
spec:
containers:
- name: main
# ...
- name: sidecar
env:
- name: KRUISE_CONTAINER_PRIORITY
value: "1"
# ...
```
1. The range of the value is `[-2147483647, 2147483647]`. Defaults to `0` if no such env exists.
2. The container with higher priority will be guaranteed to start before the others with lower priority.
3. The containers with same priority have no limit to their start sequence.
## Requirement
ContainerLaunchPriority requires `PodWebhook` feature-gate to be enabled, which is the default state.
## Implementation Details
Kruise webhook will admit for all pod creation.
When webhook finds a pod has `apps.kruise.io/container-launch-priority` annotation or `KRUISE_CONTAINER_PRIORITY` in env,
it will inject `KRUISE_CONTAINER_BARRIER` env into containers.
The value of KRUISE_CONTAINER_BARRIER is from a ConfigMap named `{pod-name}-barrier`, and the key is related to the priority of each container.
For example:
```yaml
apiVersion: v1
kind: Pod
spec:
containers:
- name: main
# ...
env:
- name: KRUISE_CONTAINER_BARRIER
valueFrom:
configMapKeyRef:
name: {pod-name}-barrier
key: "p_0"
- name: sidecar
env:
- name: KRUISE_CONTAINER_PRIORITY
value: "1"
- name: KRUISE_CONTAINER_BARRIER
valueFrom:
configMapKeyRef:
name: {pod-name}-barrier
key: "p_1"
# ...
```
Kruise controller will create an empty ConfigMap for this pod, then add the keys into ConfigMap according to the priorities and containerStatuses of pod.
As the example before, controller will firstly add `p_1` key into ConfigMap, waiting for sidecar container running and ready, and finally add `p_0` into ConfigMap to let Kubelet start main container.
Besides, you may see `CreateContainerConfigError` state when you use `kubectl get` during pod is starting with priority.
It is because Kubelet can't find some keys at that moment, and will be fine after all container in Pod started.

View File

@ -0,0 +1,107 @@
---
title: Container Restart
---
**FEATURE STATE:** Kruise v0.9.0
ContainerRecreateRequest provides a way to let users **restart/recreate** one or more containers in an existing Pod.
Just like the in-place update provided in Kruise, during container recreation, other containers in the same Pod are still running.
Once the recreation is completed, nothing changes in the Pod except that the recreated container's restartCount is increased.
Note that the files written into the **rootfs of the previous container will be lost**, but the data in volume mounts remain.
This feature relies on `kruise-daemon` to stop the container in Pod.
So if the `KruiseDaemon` feature-gate is closed, ContainerRecreateRequest will also be disabled.
## Usage
### Submit request
Create a `ContainerRecreateRequest` (short name `CRR`) for each Pod container recreation:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ContainerRecreateRequest
metadata:
namespace: pod-namespace
name: xxx
spec:
podName: pod-name
containers: # list of container name that need to be recreated, at least one
- name: app
- name: sidecar
strategy:
failurePolicy: Fail # or 'Ignore'. If 'Fail', the CRR will abort when one container failed to stop or recreate
orderedRecreate: false # 'true' indicates to recreate the next container only if the previous one has recreated completely
terminationGracePeriodSeconds: 30 # optional duration in seconds to wait the container terminating gracefully
unreadyGracePeriodSeconds: 3 # duration for the Pod is marked as not ready before its preStop hook is executed and it is stopped
minStartedSeconds: 10 # the new container will be consider as Succeeded only if it has started over minStartedSeconds
activeDeadlineSeconds: 300 # the CRR will be marked as Completed immediately if it has ran over deadline duration since created
ttlSecondsAfterFinished: 1800 # the time CRR remain as completed before the CRR is deleted
```
*All fields in `strategy` and the `activeDeadlineSeconds`/`ttlSecondsAfterFinished` in `spec` are optional.*
1. Typically, containers in list will be stopped one by one, but they may be recreating together, unless the `orderedRecreate` is `true`.
2. The `unreadyGracePeriodSeconds` depends on a new feature-gate named `KruisePodReadinessGate`, which indicates to inject a readinessGate during each Pod creating.
Otherwise `unreadyGracePeriodSeconds` can only work for those new Pods created by Kruise that have the readinessGate.
3. If users set `ttlSecondsAfterFinished`, then CRR will automatically be deleted after completed over this time.
Otherwise, users have to delete the CRR manually.
```bash
# for commandline you can
$ kubectl get containerrecreateqequest -n pod-namespace
# or just short name
$ kubectl get crr -n pod-namespace
```
### Check request status
Status of CRR looks like this:
```yaml
status:
completionTime: "2021-03-22T11:53:48Z"
containerRecreateStates:
- name: app
phase: Succeeded
- name: sidecar
phase: Succeeded
phase: Completed
```
The `status.phase` can be:
- `Pending`: the CRR waits to be executed
- `Recreating`: the CRR is executing
- `Completed`: this CRR has completed, and `status.completionTime` is the timestamp of completion
Note that `status.phase=Completed` does not mean all containers in CRR have recreated successfully.
Users should find the information in `status.containerRecreateStates`.
The `status.containerRecreateStates[x].phase` can be:
- `Pending`: this container waits to recreate
- `Recreating`: this container is recreating
- `Failed`: this container has failed to recreate
- `Succeeded`: this container has succeeded to recreate
**When the CRR has completed, only the containers in `Succeeded` phase are successfully recreated.**
## Implementation
When users create a CRR, Kruise webhook will inject the current containerID and restartCount into `spec.containers[x].statusContext`.
And, when **kruise-daemon** starts to execute, it will skip the container if its containerID is not equal to the one in statusContext or the restartCount has been bigger,
which means the container has already been recreated (maybe by in-place update).
![ContainerRecreateRequest](/img/docs/user-manuals/containerrecreaterequest.png)
Typically, **kruise-daemon** will stop the container with or without preStop hook, then **kubelet** will create a new container and start again.
Finally, **kruise-daemon** will report the container phase as `Succeeded` only if the new container has started over `minStartedSeconds` duration.
If the recreation occurs with an in-place update in the same time:
- If **Kubelet** has stopped or recreated the container because of in-place update, **kruise-daemon** will consider it already recreated.
- If **kruise-daemon** stops the container, **Kubelet** will keep to in-place update the container to the new image.
If multiple ContainerRecreateRequests are submitted for one Pod, they will be executed one by one.

View File

@ -0,0 +1,37 @@
---
title: Deletion Protection
---
**FEATURE STATE:** Kruise v0.9.0
This feature provides a safety policy which could help users protect Kubernetes resources and
applications' availability from the cascading deletion mechanism.
## Usage
Firstly, users have to enable the `ResourcesDeletionProtection` feature-gate during [Kruise installation or upgrade](../installation#optional-feature-gate).
Then, users can add the label named `policy.kruise.io/delete-protection` to some specific resources. The values can be:
- `Always`: this object will always be forbidden to be deleted, unless the label is removed
- `Cascading`: this object will be forbidden to be deleted, if it has active resources owned
The resources supported and the cascading judgement relationship:
| Kind | Group | Version | **Cascading** judgement |
| --------------------------- | ---------------------- | ------------------ | ----------------------------------------------------
| `Namespace` | core | v1 | whether there is active Pods in this namespace |
| `CustomResourceDefinition` | apiextensions.k8s.io | v1beta1, v1 | whether there is existing CRs of this CRD |
| `Deployment` | apps | v1 | whether the replicas is 0 |
| `StatefulSet` | apps | v1 | whether the replicas is 0 |
| `ReplicaSet` | apps | v1 | whether the replicas is 0 |
| `CloneSet` | apps.kruise.io | v1alpha1 | whether the replicas is 0 |
| `StatefulSet` | apps.kruise.io | v1alpha1, v1beta1 | whether the replicas is 0 |
| `UnitedDeployment` | apps.kruise.io | v1alpha1 | whether the replicas is 0 |
## Risk
Using `objectSelector` in [webhook configuration](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#webhook-configuration),
Kruise webhook will only handle those `Namespace/CustomResourceDefinition/Deployment/StatefulSet/ReplicaSet` resources with `policy.kruise.io/delete-protection` label.
So, if all kruise-manager Pods are crashed or in other abnormal states, kube-apiserver fails to call the deletion webhook,
only the resources with `policy.kruise.io/delete-protection` label can not be deleted temporarily.

View File

@ -0,0 +1,127 @@
---
title: ImagePullJob
---
NodeImage and ImagePullJob are new CRDs provided since Kruise v0.8.0 version.
Kruise will create a NodeImage for each Node, and it contains images that should be downloaded on this Node.
Users can create an ImagePullJob to declare an image should be downloaded on which nodes.
![Image Pulling](/img/docs/user-manuals/imagepulling.png)
Note that the NodeImage is quite **a low-level API**. You should only use it when you prepare to pull an image on a definite Node.
Otherwise, you should **use the ImagePullJob to pull an image on a batch of Nodes.**
## ImagePullJob (high-level)
ImagePullJob is a **namespaced-scope** resource.
API definition: https://github.com/openkruise/kruise/blob/master/apis/apps/v1alpha1/imagepulljob_types.go
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ImagePullJob
metadata:
name: job-with-always
spec:
image: nginx:1.9.1 # [required] image to pull
parallelism: 10 # [optional] the maximal number of Nodes that pull this image at the same time, defaults to 1
selector: # [optional] the names or label selector to assign Nodes (only one of them can be set)
names:
- node-1
- node-2
matchLabels:
node-type: xxx
# podSelector: # [optional] label selector over pods that should pull image on nodes of these pods. Mutually exclusive with selector.
# pod-label: xxx
completionPolicy:
type: Always # [optional] defaults to Always
activeDeadlineSeconds: 1200 # [optional] no default, only work for Alway type
ttlSecondsAfterFinished: 300 # [optional] no default, only work for Alway type
pullPolicy: # [optional] defaults to backoffLimit=3, timeoutSeconds=600
backoffLimit: 3
timeoutSeconds: 300
```
You can write the names or label selector in the `selector` field to assign Nodes **(only one of them can be set)**.
If no `selector` is set, the image will be pulled on all Nodes in the cluster.
Or you can write the podSelector to pull image on nodes of these pods. `podSelector` is mutually exclusive with `selector`.
Also, ImagePullJob has two completionPolicy types:
- `Always` means this job will eventually complete with either failed or succeeded.
- `activeDeadlineSeconds`: timeout duration for this job
- `ttlSecondsAfterFinished`: after this job finished (including success or failure) over this time, this job will be removed
- `Never` means this job will never complete, it will continuously pull image on the desired Nodes every day.
### configure secrets
If the image is in a private registry, you may want to configure the pull secrets for the image:
```yaml
# ...
spec:
pullSecrets:
- secret-name1
- secret-name2
```
Because of ImagePullJob is a namespaced-scope resource, the secrets should be in the same namespace of this ImagePullJob,
and you should only put the secret names into `pullSecrets` field.
## NodeImage (low-level)
NodeImage is a **cluster-scope** resource.
API definition: https://github.com/openkruise/kruise/blob/master/apis/apps/v1alpha1/nodeimage_types.go
When Kruise has been installed, nodeimage-controller will create NodeImages for Nodes with the same names immediately.
And when a Node has been added or removed, nodeimage-controller will also create or delete NodeImage for this Node.
What's more, nodeimage-controller will also synchronize labels from Node to NodeImage. So the NodeImage and Node always have
the same name and labels. You can get NodeImage with the Node name, or list NodeImage with the Node labels as selector.
Typically, an empty NodeImage looks like this:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: NodeImage
metadata:
labels:
kubernetes.io/arch: amd64
kubernetes.io/os: linux
# ...
name: node-xxx
# ...
spec: {}
status:
desired: 0
failed: 0
pulling: 0
succeeded: 0
```
If you want to pull an image such as `ubuntu:latest` on this Node, you can:
1. `kubectl edit nodeimage node-xxx` and write below into it (ignore the comments):
```yaml
# ...
spec:
images:
ubuntu: # image name
tags:
- tag: latest # image tag
pullPolicy:
ttlSecondsAfterFinished: 300 # [required] after this image pulling finished (including success or failure) over 300s, this task will be removed
timeoutSeconds: 600 # [optional] timeout duration for once pulling, defaults to 600
backoffLimit: 3 # [optional] retry times for pulling, defaults to 3
activeDeadlineSeconds: 1200 # [optional] timeout duration for this task, no default
```
2. `kubectl patch nodeimage node-xxx --type=merge -p '{"spec":{"images":{"ubuntu":{"tags":[{"tag":"latest","pullPolicy":{"ttlSecondsAfterFinished":300}}]}}}}'`
You can read the NodeImage status using `kubectl get nodeimage node-xxx -o yaml`,
and you will find the task removed from spec and status after it has finished over 300s.

View File

@ -0,0 +1,96 @@
---
title: PodUnavailableBudget
---
**FEATURE STATE:** Kruise v0.10.0
Kubernetes offers [Pod Disruption Budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) to help you run highly available applications even when you introduce frequent [voluntary disruptions](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/).
PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. However, it can only constrain the voluntary disruption triggered by the [Eviction API](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#eviction-api).
For example, when you run kubectl drain, the tool tries to evict all of the Pods on the Node you're taking out of service.
In the following voluntary disruption scenarios, there are still business disruption or SLA degradation situations:
1. The application owner update deployment's pod template for general upgrading, while cluster administrator drain nodes to scale the cluster down(learn about [Cluster Autoscaling](https://github.com/kubernetes/autoscaler/#readme)).
2. The middleware team is using [SidecarSet](./sidecarset) to rolling upgrade the sidecar containers of the cluster, e.g. ServiceMesh envoy, while HPA triggers the scale-down of business applications.
3. The application owner and middleware team release the same Pods at the same time based on OpenKruise cloneSet, sidecarSet in-place upgrades
In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which greatly improves the high availability of application services.
A sample PodUnavailableBudget yaml looks like following:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: PodUnavailableBudget
metadata:
name: web-server-pub
namespace: web
spec:
targetRef:
apiVersion: apps.kruise.io/v1alpha1
# cloneset, deployment, statefulset etc.
kind: CloneSet
name: web-server
# selector label query over pods managed by the budget
# selector and TargetReference are mutually exclusive, targetRef is priority to take effect.
# selector is commonly used in scenarios where applications are deployed using multiple workloads,
# and targetRef is used for protection against a single workload.
# selector:
# matchLabels:
# app: web-server
# maximum number of Pods unavailable for the current cloneset, the example is cloneset.replicas(5) * 60% = 3
# maxUnavailable and minAvailable are mutually exclusive, maxUnavailable is priority to take effect
maxUnavailable: 60%
# Minimum number of Pods available for the current cloneset, the example is cloneset.replicas(5) * 40% = 2
# minAvailable: 40%
-----------------------
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
metadata:
labels:
app: web-server
name: web-server
namespace: web
spec:
replicas: 5
selector:
matchLabels:
app: web-server
template:
metadata:
labels:
app: web-server
spec:
containers:
- name: nginx
image: nginx:alpine
```
## Implementation
This program customizes the PodUnavailableBudget (later referred to as PUB) CRD resource to describe the desired state of the application, and the working mechanism is shown below:
![PodUnavailableBudget](/img/docs/user-manuals/podunavailablebudget.png)
## Comparison with Kubernetes native PDB
Kubernetes PodDisruptionBudget implements protection against Pod Eviction based on the EvictionREST interface,
while PodUnavailableBudget intercepts all pod modification requests through the admission webhook validating mechanism (Many voluntary disruption scenarios can be summarized as modifications to Pod resources),
and reject the request if the modification does not satisfy the desired state of the PUB.
**Pub contains all the protection capabilities of kubernetes PDB, you can use both, or use pub independently to implement your application protection (Recommend).**
## feature-gates
PodUnavailableBudget protection against Pods is turned off by default, if you want to turn it on set feature-gates *PodUnavailableBudgetDeleteGate* and *PodUnavailableBudgetUpdateGate*.
```bash
$ helm install kruise https://... --set featureGates="PodUnavailableBudgetDeleteGate=true\,PodUnavailableBudgetUpdateGate=true"
```
## PodUnavailableBudget Status
```yaml
# kubectl describe podunavailablebudgets web-server-pub
Name: web-server-pub
Kind: PodUnavailableBudget
Status:
unavailableAllowed: 3 # unavailableAllowed number of pod unavailable that are currently allowed
currentAvailable: 5 # currentAvailable current number of available pods
desiredAvailable: 2 # desiredAvailable minimum desired number of available pods
totalReplicas: 5 # totalReplicas total number of pods counted by this PUB
```

View File

@ -0,0 +1,225 @@
---
title: ResourceDistribution
---
For the scenario, where the namespace-scoped resources such as Secret and ConfigMap need to be distributed or synchronized to different namespaces, the native k8s currently only supports manual distribution and synchronization by users one-by-one, which is very inconvenient.
Typical examples:
- When users want to use the imagePullSecrets capability of SidecarSet, they must repeatedly create corresponding Secrets in relevant namespaces, and ensure the correctness and consistency of these Secret configurations;
- When users want to configure some common environment variables, they probably need to distribute ConfigMaps to multiple namespaces, and the subsequent modifications of these ConfigMaps might require synchronization among these namespaces.
Therefore, in the face of these scenarios that require the resource distribution and **continuously synchronization across namespaces**, we provide a tool, namely **ResourceDistribution**, to do this automatically.
Currently, ResourceDistribution supports the two kind resources --- **Secret & ConfigMap**.
## API Description
ResourceDistribution is a kind of **cluster-scoped CRD**, which is mainly composed of two fields: **`resource` and `targets`**.
The **`resource`** field is used to describe the resource to be distributed by the user, and **`targets`** is used to describe the destination namespaces.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
... ...
targets:
... ...
```
### Resource Field
The `resource` field must be a **complete** and **correct** resource description in YAML style.
An example of a correctly configuration of `resource` is as follows:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
game.properties: |
enemy.types=aliens,monsters
player.maximum-lives=5
player_initial_lives: "3"
ui_properties_file_name: user-interface.properties
user-interface.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
targets:
... ...
```
**Tips**: users can first create corresponding resources in a local namespace and test them, and then copy them after confirming that the resource configuration is correct.
### Targets Field
The **`targets`** field currently supports four rules to describe the target namespaces, including `allNamespaces`, `includedNamespaces`, `excludedNamespaces` and `namespaceLabelSelector`:
- `allNamespaces`: match all of the namespaces if it is `true`;
- `includedNamespaces`: match the target namespaces by name;
- `namespaceLabelSelector`: use labelSelector to match the target namespaces;
- `excludedNamespaces`: use name to exclude some namespaces that you do not want to distribute;
**Calculation rule for target namespace:**
1. Initialize target namespace *T* = ∅;
2. Add all namespaces if `allNamespaces=true` to *T*;
2. Add the namespaces listed in `includedNamespaces` to *T*;
3. Add the namespace matching the `namespaceLabelSelector` to *T*;
4. Remove the namespaces listed in `excludedNamespaces` from *T*;
**`AllNamespaces`, `includedNamespaces` and `excludedNamespaces` are *"OR"* relationship, and `excludedNamespaces` will always effect if users set it. By the way, `targets` will always ignore the `kube-system` and `kube-public` namespaces.**
A correctly configured targets field is as follows:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
... ...
targets:
includedNamespaces:
list:
- name: ns-1
- name: ns-4
namespaceLabelSelector:
matchLabels:
group: test
excludedNamespaces:
list:
- name: ns-3
```
In the above example, the target namespaces of the ResourceDistribution will contain `ns-1` and `ns-4`, and the namespaces whose labels meet the `namespaceLabelSelector`. However, even if `ns-3` meets the namespaceLabelSelector, it will not be included because it has been explicitly excluded in `excludedNamespaces`.
## A Complete Use Case
### Distribute Resource
When the user correctly configures the `resource` and `targets` fields, the ResourceDistribution controller will execute the distribution, and this resource will be automatically created in each target namespaces.
A complete configuration is as follows:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: ResourceDistribution
metadata:
name: sample
spec:
resource:
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
data:
game.properties: |
enemy.types=aliens,monsters
player.maximum-lives=5
player_initial_lives: "3"
ui_properties_file_name: user-interface.properties
user-interface.properties: |
color.good=purple
color.bad=yellow
allow.textmode=true
targets:
excludedNamespaces:
list:
- name: ns-3
includedNamespaces:
list:
- name: ns-1
- name: ns-4
namespaceLabelSelector:
matchLabels:
group: test
```
### Tracking Failures After The Distribution
Of course, resource distribution may not be always successful.
In the process of distribution, various errors may occur. To this end, we record some conditions of distribution failures in the `status` field so that users can track them.
**First**, the `status` records the total number of target namespaces (desired), the number of successfully distributed target namespaces (succeeded), and the number of failed target namespaces (failed):
```yaml
status:
Desired: 3
Failed: 1
Succeeded: 2
```
**Then**, in order to further make users understand the reason and location (namespaces) of the failed distributions, `status` also summarizes the types of distribution errors, which are divided into 6 categories and recorded in `status.conditions`:
- Four types of conditions record the failures of operating resources, that are `Get`, `Create`, `Update` and `Delete` errors;
- A type of condition records the error that the namespace does not exist;
- A type of condition records resource conflicts: If a resource with the same name, kind and apiVersion already exists in the target namespace, this conflicts will be recorded in `status.conditions`.
```yaml
Status:
Conditions:
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: GetResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: CreateResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: UpdateResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: DeleteResourceFailed
Last Transition Time: 2021-09-06T08:42:28Z
Reason: Succeeded
Status: False
Type: ConflictOccurred
Failed Namespace:
ns-1
ns-4
Last Transition Time: 2021-09-06T08:45:08Z
Reason: namespace not found
Status: True
Type: NamespaceNotExists
```
The above example shows an error that the target namespaces `ns-1` and `ns-4` do not exist, and both the error type and namespaces are recorded.
### Update/Sync Resource
**ResourceDistribution allows users to update the resource field, and the update will automatically sync to all the target namespaces.**
When a resource is updated, ResourceDistribution will calculate the hash value of the new version of the resource and record it in the `annotations` of the resource CR. When ResourceDistribution finds that the hash value of the resource was changed, it will update it.
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: game-demo
annotations:
kruise.io/resourcedistribution.resource.from: sample
kruise.io/resourcedistribution.resource.distributed.timestamp: 2021-09-06 08:44:52.7861421 +0000 UTC m=+12896.810364601
kruise.io/resourcedistribution.resource.hashcode: 0821a13321b2c76b5bd63341a0d97fb46bfdbb2f914e2ad6b613d10632fa4b63
... ...
```
In particular, we **DO NOT** recommend that users bypass the ResourceDistribution and directly modify the resources unless they know what they are doing:
- After modifying resources directly, the hash value of resources will not be calculated automatically. Therefore, **when the `resource` field is modified, ResourceDistribution may overwrite the user's direct modification of these resources;**
- ResourceDistribution judges whether resources are distributed by the itself through `kruise.io/resourcedistribution.resource.from`. If this annotation was changed, the modified resources will be regarded as conflicts, and will not updated it synchronously any more.
### Cascading Deletion
**ResourceDistribution controls the distributed resources through ownerReference. Therefore, it should be noted that when the ResourceDistribution is deleted, all the resources it distributed will also be deleted.**

View File

@ -0,0 +1,393 @@
---
title: SidecarSet
---
This controller leverages the admission webhook to automatically
inject a sidecar container for every selected Pod when the Pod is created. The Sidecar
injection process is similar to the automatic sidecar injection mechanism used in
[istio](https://istio.io/docs/setup/kubernetes/additional-setup/sidecar-injection/).
Besides injection during Pod creation, SidecarSet controller also provides
additional capabilities such as in-place Sidecar container image upgrade, mounting Sidecar volumes, etc.
Basically, SidecarSet decouples the Sidecar container lifecycle
management from the main container lifecycle management.
The SidecarSet is preferable for managing stateless sidecar containers such as
monitoring tools or operation agents.
## Example
### Create SidecarSet
The `sidecarset.yaml` file below describes a SidecarSet that contains a sidecar container named `sidecar1`:
```yaml
# sidecarset.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: test-sidecarset
spec:
selector:
matchLabels:
app: nginx
updateStrategy:
type: RollingUpdate
maxUnavailable: 1
containers:
- name: sidecar1
image: centos:6.7
command: ["sleep", "999d"] # do nothing at all
volumeMounts:
- name: log-volume
mountPath: /var/log
volumes: # this field will be merged into pod.spec.volumes
- name: log-volume
emptyDir: {}
```
Create a SidecarSet based on the YAML file:
```bash
kubectl apply -f sidecarset.yaml
```
### Create a Pod
Create a pod that matches the sidecarset's selector:
```yaml
apiVersion: v1
kind: Pod
metadata:
labels:
app: nginx # matches the SidecarSet's selector
name: test-pod
spec:
containers:
- name: app
image: nginx:1.15.1
```
Create this pod and now you will find it's injected with `sidecar1`:
```bash
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
test-pod 2/2 Running 0 118s
```
In the meantime, the SidecarSet status updated:
```bash
$ kubectl get sidecarset test-sidecarset -o yaml | grep -A4 status
status:
matchedPods: 1
observedGeneration: 1
readyPods: 1
updatedPods: 1
```
### update sidecar container Image
update sidecarSet's sidecar container image=centos:7
```bash
$ kubectl edit sidecarsets test-sidecarset
# sidecarset.yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: test-sidecarset
spec:
containers:
- name: sidecar1
image: centos:7
```
The Sidecar container in the pod has been updated to centos:7, and the pod and other containers have not been restarted.
```bash
$ kubectl get pods |grep test-pod
test-pod 2/2 Running 1 7m34s
$ kubectl get pods test-pod -o yaml |grep 'image: centos'
image: centos:7
$ kubectl describe pods test-pod
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 5m47s kubelet Container sidecar1 definition changed, will be restarted
Normal Pulling 5m17s kubelet Pulling image "centos:7"
Normal Created 5m5s (x2 over 12m) kubelet Created container sidecar1
Normal Started 5m5s (x2 over 12m) kubelet Started container sidecar1
Normal Pulled 5m5s kubelet Successfully pulled image "centos:7"
```
## SidecarSet features
A sample SidecarSet yaml looks like following:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
selector:
matchLabels:
app: sample
containers:
- name: nginx
image: nginx:alpine
initContainers:
- name: init-container
image: busybox:latest
command: [ "/bin/sh", "-c", "sleep 5 && echo 'init container success'" ]
updateStrategy:
type: RollingUpdate
namespace: ns-1
```
- spec.selector Select the POD that needs to be injected and updated by Label. MatchLabels and MatchExpressions are supported. Please refer to the details: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
- spec.containers Define pod.spec.containers[x] that need to be injected and updated, supporting the full K8S Container field. Please refer to the details: https://kubernetes.io/docs/concepts/containers/
- spec.initContainers Define the pod.spec.initContainers[x] you need to inject, supporting the full K8S InitContainer field. Please refer to the detailshttps://kubernetes.io/docs/concepts/workloads/pods/init-containers/
- We will inject those containers by their name in ascending order
- InitContainers only support injection and do not support POD in-place update
- spec.updateStrategy sidecarSet update strategy, type indicates the upgrade method:
- NotUpdate No updates, in this type only inject sidecar containers in pod
- RollingUpdate Injection and rolling update, which contains a rich update strategy, will be described in more detail later
- spec.namespace By default, sidecarset is cluster scope in k8s, that is, for all namespaces (except kube-system, kube-public). When spec.namespace field set, it only applies to pods of that namespace
### sidecar container injection
The injection of sidecar containers happens at Pod creation time and only Pod spec is updated. The workload template spec will not be updated.
In addition to the default K8s Container field, the following fields have been extended to injection:
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
selector:
matchLabels:
app: sample
containers:
# default K8s Container fields
- name: nginx
image: nginx:alpine
volumeMounts:
- mountPath: /nginx/conf
name: nginx.conf
# extended sidecar container fields
podInjectPolicy: BeforeAppContainer
shareVolumePolicy:
type: disabled | enabled
transferEnv:
- sourceContainerName: main
envName: PROXY_IP
volumes:
- Name: nginx.conf
hostPath: /data/nginx/conf
```
- podInjectPolicy Define where Containers are injected into pod.spec.containers
- BeforeAppContainer(default) Inject into the front of the original pod containers
- AfterAppContainer Inject into the backend of the original pod containers
- data volume sharing
- Share specific volumes: Use spec.volumes to define the volumes needed by Sidecar itself. See detailshttps://kubernetes.io/docs/concepts/storage/volumes/
- Share pod containers volumes: If ShareVolumePolicy.type is enabled, the sidecar container will share the other container's VolumeMounts in the pod(don't contains the injected sidecar container)
- Environment variable sharing
- Environment variables can be fetched from another container through spec.containers[x].transferenv, and the environment variable named envName from the container named sourceContainerName is copied to this container
#### injection pause
**FEATURE STATE:** Kruise v0.10.0
For existing SidecarSetsusers can pause sidecar injection by setting `spec.injectionStrategy.paused=true`
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
... ...
injectionStrategy:
paused: true
```
This feature only works on the newly-created Pods, and has no impact on the sidecar containers that have been injected.
#### imagePullSecrets
**FEATURE STATE:** Kruise v0.10.0
Users can use private images in SidecarSet by configuring [spec.imagePullSecrets](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/).
SidecarSet will inject it in to Pods at injection stage.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
... ....
imagePullSecrets:
- name: my-secret
```
**Note**: Users must ensure that the corresponding Secrets have already existed in the namespaces where Pods need to pull the private images. Otherwise, pulling private images will not succeed.
### sidecarset update strategy
Sidecarset not only supports the in-place update of Sidecar container, but also provides a very rich upgrade strategy.
#### partition
Partition is the **desired number or percent of Pods in old revisions**, defaults to `0`. This field does **NOT** imply any update order.
When `partition` is set during update:
- If it is a number: `(replicas - partition)` number of pods will be updated with the new version.
- If it is a percent: `(replicas * (100% - partition))` number of pods will be updated with the new version.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
partition: 90
```
Assuming that the number of PODs associated with this Sidecarset is 100, this upgrade will only upgrade 10 pods to latest and keep 90 pods old versions.
#### MaxUnavailable
MaxUnavailable is the maximum number of PODs that are unavailable at the same time that is guaranteed during the Posting process. The default value is 1.
The user can set it to either an absolute value or a percentage (the percentage is calculated by the controller as the cardinality of the selected pod to calculate the absolute value behind one).
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
maxUnavailable: 20%
```
Note that maxUnavailable and partition are not necessarily related. For example:
- When {matched pod}=100,partition=50,maxUnavailable=10, the controller will update 50 PODS to the new version, and only 10 PODS will be updated at the same time, until the 50 updated is completed.
- When {matched pod}=100,partition=80,maxUnavailable=30, the controller will update 20 PODS to the new version, because the maxUnavailable number is 30, so the 20 PODS will be updated simultaneously.
#### Pause
A user can pause the release by setting pause to true, and the injection capability will remain for newly created, expanded PODS, while updated PODS will remain the updated version, and those that have not been updated will be paused.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
paused: true
```
#### Selector
For businesses that have Canary update requirements, this can be done through Strategy.selector filed. First: take the canary updated pods on fixed labels [canary. Release] = true, second fix the strategy.selector.MatchLabels to select the pod
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
selector:
matchLabels:
canary.release: true
```
### sidecarset update order
- The PODs of upgrade is sorted by default to ensure the same order of multiple upgrades
- The default priority is (the smaller the higher the priority) unscheduled < scheduled, pending < unknown < running, not-ready < ready, newer pods < older pods
- scatter order
#### scatter
The scatter policy allows users to define the scatters of PODs that conform to certain tags throughout the publishing process.
For example, if a SidecarSet manages 10 PODS, if there are 3 PODS below with the tag foo=bar, and the user sets this tag in the shatter policy, then these 3 PODS will be published in the 1st, 6th, and 10th positions.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
# ...
updateStrategy:
type: RollingUpdate
scatterStrategy:
- key: foo
value: bar
```
**Note: If you use Scatter, it is recommended to set only a pair of key-values for scatter. It will be easier to understand.**
### Hot Upgrade Sidecar
**FEATURE STATE:** Kruise v0.9.0
SidecarSet's in-place upgrade will stop the container of old version first and then create the container of new version. Such method is more suitable for sidecar containers that cannot affects service availability, e.g. logging collector.
But for many proxy or runtime sidecar containers, e.g. Istio Envoy, this upgrade method is problematic. Envoy, as a proxy container in the Pod, proxies all the traffic, and if restarted directly, the availability of service is affected. Complex grace termination and coordination is required if one need to upgrade envoy sidecar independently of the application container. So we provide a new solution for such sidecar container upgrade.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: hotupgrade-sidecarset
spec:
selector:
matchLabels:
app: hotupgrade
containers:
- name: sidecar
image: openkruise/hotupgrade-sample:sidecarv1
imagePullPolicy: Always
lifecycle:
postStart:
exec:
command:
- /bin/sh
- /migrate.sh
upgradeStrategy:
upgradeType: HotUpgrade
hotUpgradeEmptyImage: openkruise/hotupgrade-sample:empty
```
- upgradeType: HotUpgrade indicates hot upgrade for stateful sidecar container.
- hotUpgradeEmptyImage: when upgradeType=HotUpgrade, user needs to provide an empty container for hot upgrades. hotUpgradeEmptyImage has the same configuration as the sidecar container, for example: command, lifecycle, probe, etc, but it doesn't do anything.
- lifecycle.postStart: State Migration, the process completes the state migration of stateful container, which needs to be provided by the sidecar image developer.
Hot upgrade consists of the following two processes:
- inject hot upgrade sidecar containers
- in-place hot upgrade sidecar container
#### Inject Containers
When the sidecar container upgradeStrategy=HotUpgrade, the SidecarSet Webhook will inject two containers into the Pod:
1. {sidecarContainer.name}-1: as shown in the figure below: envoy-1, the container run the actual working sidecar container, such as envoy:1.16.0
2. {sidecarContainer.name}-2: as shown in the figure below: envoy-2, the container run the hot upgrade empty container, and it doesn't have to deal with any real logic, as long as it stays in place, such as empty:1.0
![sidecarset hotupgrade_injection](/img/docs/user-manuals/sidecarset_hotupgrade_injection.png)
#### Hot Upgrade
The SidecarSet Controller breaks down the hot upgrade pgrocess of the sidecar container into three steps:
1. Upgrade: upgrade the empty container to the new version of the sidecar container, such as envoy-2.Image = envoy:1.17.0
2. Migration: the process completes the state migration of stateful container, which needs to be provided by the sidecar image developer. PostStartHook completes the migration of the above process.
(**Note: PostStartHook must block during the migration, and exit when migration complete.**)
3. Reset: the step resets the old version sidecar container into empty container, such as envoy-1.Image = empty:1.0
The above is the complete hot upgrade process. If a Pod needs to be hot upgraded several times, the above three steps can be repeated.
![sidecarset hotupgrade](/img/docs/user-manuals/sidecarset_hotupgrade.png)
#### Migration Demo
The SidecarSet hot upgrade mechanism not only completes the switching between mesh containersbut also provides a coordination mechanism for old and new versions. Yet this is only the first step of a long journey. The mesh container also needs to provide a PostStartHook script to complete the hot migration of the mesh service itself (the above Migration process), such as Envoy hot restart and Mosn lossless restart.
To facilitate a better understanding of the Migration process, a migration demo is provided below the kruise repository: [Migration Demo](https://github.com/openkruise/samples/tree/master/hotupgrade)
For design documentation, please refer to: [proposals sidecarset hot upgrade](https://github.com/openkruise/kruise/blob/master/docs/proposals/20210305-sidecarset-hotupgrade.md)
Currently known cases that utilize the SidecarSet hot upgrade mechanism:
- [ALIYUN ASM](https://help.aliyun.com/document_detail/193804.html) implements lossless upgrade of Data Plane in Service Mesh.
### SidecarSet Status
When upgrading sidecar containers with a SidecarSet, you can observe the process of upgrading through SidecarSet.Status
```yaml
# kubectl describe sidecarsets sidecarset-example
Name: sidecarset-example
Kind: SidecarSet
Status:
Matched Pods: 10 # The number of PODs injected and managed by the Sidecarset
Updated Pods: 5 # 5 PODs have been updated to the container version in the latest SidecarSet
Ready Pods: 8 # Matched Pods pod.status.condition.Ready = true number
Updated Ready Pods: 3 # Updated Pods && Ready Pods number
```

View File

@ -0,0 +1,187 @@
---
title: UnitedDeployment
---
This controller provides a new way to manage pods in multi-domain by using multiple workloads.
A high level description about this workload can be found in this [blog post](/blog/uniteddeployment).
Different domains in one Kubernetes cluster are represented by multiple groups of
nodes identified by labels. UnitedDeployment controller provisions one type of workload
for each group of with corresponding matching `NodeSelector`, so that
the pods created by individual workload will be scheduled to the target domain.
Each workload managed by UnitedDeployment is called a `subset`.
Each domain should at least provide the capacity to run the `replicas` number of pods.
Currently `StatefulSet`, `Advanced StatefulSet`, `CloneSet` and `Deployment` are the supported workloads.
API definition: https://github.com/openkruise/kruise/blob/master/apis/apps/v1alpha1/uniteddeployment_types.go
The below sample yaml presents a UnitedDeployment which manages three StatefulSet instances in three domains.
The total number of managed pods is 6.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: UnitedDeployment
metadata:
name: sample-ud
spec:
replicas: 6
revisionHistoryLimit: 10
selector:
matchLabels:
app: sample
template:
# statefulSetTemplate or advancedStatefulSetTemplate or cloneSetTemplate or deploymentTemplate
statefulSetTemplate:
metadata:
labels:
app: sample
spec:
selector:
matchLabels:
app: sample
template:
metadata:
labels:
app: sample
spec:
containers:
- image: nginx:alpine
name: nginx
topology:
subsets:
- name: subset-a
nodeSelectorTerm:
matchExpressions:
- key: node
operator: In
values:
- zone-a
replicas: 1
- name: subset-b
nodeSelectorTerm:
matchExpressions:
- key: node
operator: In
values:
- zone-b
replicas: 50%
- name: subset-c
nodeSelectorTerm:
matchExpressions:
- key: node
operator: In
values:
- zone-c
updateStrategy:
manualUpdate:
partitions:
subset-a: 0
subset-b: 0
subset-c: 0
type: Manual
...
```
## Pod Distribution Management
This controller provides `spec.topology` to describe the pod distribution specification.
```go
// Topology defines the spread detail of each subset under UnitedDeployment.
// A UnitedDeployment manages multiple homogeneous workloads which are called subset.
// Each of subsets under the UnitedDeployment is described in Topology.
type Topology struct {
// Contains the details of each subset. Each element in this array represents one subset
// which will be provisioned and managed by UnitedDeployment.
// +optional
Subsets []Subset `json:"subsets,omitempty"`
}
// Subset defines the detail of a subset.
type Subset struct {
// Indicates subset name as a DNS_LABEL, which will be used to generate
// subset workload name prefix in the format '<deployment-name>-<subset-name>-'.
// Name should be unique between all of the subsets under one UnitedDeployment.
Name string `json:"name"`
// Indicates the node selector to form the subset. Depending on the node selector,
// pods provisioned could be distributed across multiple groups of nodes.
// A subset's nodeSelectorTerm is not allowed to be updated.
// +optional
NodeSelectorTerm corev1.NodeSelectorTerm `json:"nodeSelectorTerm,omitempty"`
// Indicates the tolerations the pods under this subset have.
// A subset's tolerations is not allowed to be updated.
// +optional
Tolerations []corev1.Toleration `json:"tolerations,omitempty"`
// Indicates the number of the pod to be created under this subset. Replicas could also be
// percentage like '10%', which means 10% of UnitedDeployment replicas of pods will be distributed
// under this subset. If nil, the number of replicas in this subset is determined by controller.
// Controller will try to keep all the subsets with nil replicas have average pods.
// +optional
Replicas *intstr.IntOrString `json:"replicas,omitempty"`
}
```
`topology.subsets` specifies the desired group of `subset`s.
A subset added to or removed from this array will be created or deleted by controller during reconcile.
Each subset workload is created based on the description of UnitedDeployment `spec.template`.
`subset` provides the necessary topology information to create a subset workload.
Each subset has a unique name. A subset workload is created with the name prefix in
format of `<UnitedDeployment-name>-<Subset-name>-`. Each subset will also be configured with
the `nodeSelector`.
When provisioning a StatefulSet `subset`, the `nodeSelector` will be added
to the StatefulSet's `podTemplate`, so that the Pods of the StatefulSet will be created with the
expected node affinity.
By default, UnitedDeployment's Pods are evenly distributed across all subsets.
There are two scenarios the controller does not follow this policy:
The first one is to customize the distribution policy by indicating `subset.replicas`.
A valid `subset.replicas` could be integer to represent a real replicas of pods or
string in format of percentage like '40%' to represent a fixed proportion of pods.
Once a `subset.replicas` is given, the controller is going to reconcile to make sure
each subset has the expected replicas.
The subsets with empty `subset.replicas` will divide the remaining replicas evenly.
The other scenario is that the indicated subset replicas policy becomes invalid.
For example, the UnitedDeployment's `spec.replicas` is decremented to be less
than the sum of all `subset.replicas`.
In this case, the indicated `subset.replicas` is ineffective and the controller
will automatically scale each subset's replicas to match the total replicas number.
The controller will try its best to apply this adjustment smoothly.
## Pod Update Management
When `spec.template` is updated, a upgrade progress will be triggered.
New template will be patch to each subset workload, which triggers subset controller
to update their pods.
Furthermore, if subset workload supports `partition`, like StatefulSet, AdvancedStatefulSet
is also able to provide `Manual` update strategy.
```go
// UnitedDeploymentUpdateStrategy defines the update performance
// when template of UnitedDeployment is changed.
type UnitedDeploymentUpdateStrategy struct {
// Type of UnitedDeployment update strategy.
// Default is Manual.
// +optional
Type UpdateStrategyType `json:"type,omitempty"`
// Includes all of the parameters a Manual update strategy needs.
// +optional
ManualUpdate *ManualUpdate `json:"manualUpdate,omitempty"`
}
// ManualUpdate is a update strategy which allows users to control the update progress
// by providing the partition of each subset.
type ManualUpdate struct {
// Indicates number of subset partition.
// +optional
Partitions map[string]int32 `json:"partitions,omitempty"`
}
```
`Manual` update strategy allows users to control the update progress by indicating
the `partition` of each subset. The controller will pass the `partition` to each subset.

View File

@ -0,0 +1,304 @@
---
title: WorkloadSpread
---
**FEATURE STATE:** Kruise v0.10.0
WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for
multi-domain deployment and elastic deployment.
Some common policies include:
- fault toleration spread (for example, spread evenly among hosts, az, etc)
- spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
- subset management with priority, such as
- deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
- deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
- subset management with customization, such as
- control how many pods in a workload are deployed in different cpu arch
- enable pods in different cpu arch to have different resource requirements
The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain
called `subset`. Each domain may provide the limit to run the replicas number of pods called `maxReplicas`.
WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.
Currently, supported workload: `CloneSet`、`Deployment`、`ReplicaSet`.
## Demo
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadata:
name: workloadspread-demo
spec:
targetRef:
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
kind: Deployment | CloneSet
name: workload-xxx
subsets:
- name: subset-a
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
preferredNodeSelectorTerms:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
maxReplicas: 3
tolertions: []
patch:
metadata:
labels:
xxx-specific-label: xxx
- name: subset-b
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
scheduleStrategy:
type: Adaptive | Fixed
adaptive:
rescheduleCriticalSeconds: 30
```
`targetRef`: specify the target workload. Can not be mutatedand one workload can only correspond to one WorkloadSpread.
## subsets
`subsets` consists of multiple domain called `subset`, and each topology has different configuration.
### sub-fields
- `name`: the name of `subset`, it is distinct in a WorkloadSpread, which represents a topology.
- `maxReplicas`the replicas limit of `subset`, and must be Integer and >= 0. There is no replicas limit while the `maxReplicas` is nil.
> Don't support percentage type in current version.
- `requiredNodeSelectorTerm`: match zone hardly。
- `preferredNodeSelectorTerms`: match zone softly。
**Caution**`requiredNodeSelectorTerm` corresponds the `requiredDuringSchedulingIgnoredDuringExecution` of nodeAffinity.
`preferredNodeSelectorTerms` corresponds the `preferredDuringSchedulingIgnoredDuringExecution` of nodeAffinity.
- `tolerations`: the tolerations of Pod in `subset`.
```yaml
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
```
- `patch`: customize the Pod configuration of `subset`, such as Annotations, Labels, Env.
Example:
```yaml
# patch pod with a topology label:
patch:
metadata:
labels:
topology.application.deploy/zone: "zone-a"
```
```yaml
# patch pod container resources:
patch:
spec:
containers:
- name: main
resources:
limit:
cpu: "2"
memory: 800Mi
```
```yaml
# patch pod container env with a zone name:
patch:
spec:
containers:
- name: main
env:
- name: K8S_AZ_NAME
value: zone-a
```
## Schedule strategy
WorkloadSpread provides two kind strategies, the default strategy is `Fixed`.
```yaml
scheduleStrategy:
type: Adaptive | Fixed
adaptive:
rescheduleCriticalSeconds: 30
```
- Fixed:
Workload is strictly spread according to the definition of the subset.
- Adaptive:
**Reschedule**: Kruise will check the unschedulable Pods of subset. If it exceeds the defined duration, the failed Pods will be rescheduled to the other `subset`.
## Requirements
WorkloadSpread defaults to be disabled. You have to configure the feature-gate *WorkloadSpread* when install or upgrade Kruise:
```bash
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
```
### Pod Webhook
WorkloadSpread uses `webhook` to inject fault domain rules.
If the `PodWebhook` feature-gate is set to false, WorkloadSpread will also be disabled.
### deletion-cost feature
`CloneSet` has supported deletion-cost feature in the latest versions.
The other native workload need kubernetes version >= 1.21. (In 1.21, users need to enable PodDeletionCost feature-gate, and since 1.22 it will be enabled by default)
## Scale order:
The workload managed by WorkloadSpread will scale according to the defined order in `spec.subsets`.
**The order of `subset` in `spec.subsets` can be changed**, which can adjust the scale order of workload.
### Scale out
- The Pods are scheduled in the subset order defined in the `spec.subsets`. It will be scheduled in the next `subset` while the replica number reaches the maxReplicas of `subset`
### Scale in
- When the replica number of the `subset` is greater than the `maxReplicas`, the extra Pods will be removed in a high priority.
- According to the `subset` order in the `spec.subsets`, the Pods of the `subset` at the back are deleted before the Pods at the front.
```yaml
# subset-a subset-b subset-c
# maxReplicas 10 10 nil
# pods number 10 10 10
# deletion order: c -> b -> a
# subset-a subset-b subset-c
# maxReplicas 10 10 nil
# pods number 20 20 20
# deletion order: b -> a -> c
```
## feature-gates
WorkloadSpread feature is turned off by default, if you want to turn it on set feature-gates *WorkloadSpread*.
```bash
$ helm install kruise https://... --set featureGates="WorkloadSpread=true"
```
## Example
### Elastic deployment
`zone-a`(ACK) holds 100 Pods, `zone-b`(ECI) as an elastic zone holds additional Pods.
1. Create a WorkloadSpread instance.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadta:
name: ws-demo
namespace: deploy
spec:
targetRef: # workload in the same namespace
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: workload-xxx
subsets:
- name: ACK # zone ACK
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- ack
maxReplicas: 100
patch: # inject label.
metadata:
labels:
deploy/zone: ack
- name: ECI # zone ECI
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- eci
patch:
metadata:
labels:
deploy/zone: eci
```
2. Creat a corresponding workload, the number of replicas ca be adjusted freely.
#### Effect
- When the number of `replicas` <= 100, the Pods are scheduled in `ACK` zone.
- When the number of `replicas` > 100, the 100 Pods are in `ACK` zone, the extra Pods are scheduled in `ECI` zone.
- The Pods in `ECI` elastic zone are removed first when scaling in.
### Multi-domain deployment
Deploy 100 Pods to two `zone`(zone-a, zone-b) separately.
1. Create a WorkloadSpread instance.
```yaml
apiVersion: apps.kruise.io/v1alpha1
kind: WorkloadSpread
metadta:
name: ws-demo
namespace: deploy
spec:
targetRef:
apiVersion: apps.kruise.io/v1alpha1
kind: CloneSet
name: workload-xxx
subsets:
- name: subset-a
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
maxReplicas: 100
patch:
metadata:
labels:
deploy/zone: zone-a
- name: subset-b
requiredNodeSelectorTerm:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-b
maxReplicas: 100
patch:
metadata:
labels:
deploy/zone: zone-b
```
2. Creat a corresponding workload with a 200 replicas, or perform a rolling update on an existing workload.
3. If the spread of zone needs to be changed, first adjust the `maxReplicas` of `subset`, and then change the `replicas` of workload.

View File

@ -0,0 +1,91 @@
{
"docs": [
{
"type": "category",
"label": "Getting Started",
"collapsed": false,
"items": [
"introduction",
"installation"
]
},
{
"type": "category",
"label": "Core Concepts",
"collapsed": false,
"items": [
"core-concepts/architecture",
"core-concepts/inplace-update"
]
},
{
"type": "category",
"label": "User Manuals",
"collapsed": true,
"items": [
{
"Typical Workloads": [
"user-manuals/cloneset",
"user-manuals/advancedstatefulset",
"user-manuals/advanceddaemonset"
],
"Job Workloads": [
"user-manuals/broadcastjob",
"user-manuals/advancedcronjob"
],
"Sidecar container Management": [
"user-manuals/sidecarset"
],
"Multi-domain Management": [
"user-manuals/workloadspread",
"user-manuals/uniteddeployment"
],
"Enhanced Operations": [
"user-manuals/containerrecreaterequest",
"user-manuals/imagepulljob",
"user-manuals/containerlaunchpriority",
"user-manuals/resourcedistribution"
],
"Application Protection": [
"user-manuals/deletionprotection",
"user-manuals/podunavailablebudget"
]
}
]
},
{
"type": "category",
"label": "Best Practices",
"collapsed": true,
"items": [
"best-practices/hpa-configuration"
]
},
{
"type": "category",
"label": "Developer Manuals",
"collapsed": true,
"items": [
"developer-manuals/go-client",
"developer-manuals/java-client",
"developer-manuals/other-languages"
]
},
{
"type": "category",
"label": "Reference",
"collapsed": true,
"items": [
{
"CLI tools": [
"cli-tool/kubectl-plugin"
]
}
]
},
{
"type": "doc",
"id": "faq"
}
]
}

View File

@ -1,3 +1,4 @@
[
"v1.0",
"v0.10"
]

879
yarn.lock

File diff suppressed because it is too large Load Diff