Add blogs for releases 0.9.0 and 0.10.0 (#15)
Signed-off-by: FillZpp <FillZpp.pub@gmail.com>
This commit is contained in:
parent
e34508f5ad
commit
b2b58908a8
|
@ -0,0 +1,211 @@
|
|||
---
|
||||
slug: openkruise-0.9.0
|
||||
title: OpenKruise 0.9.0, Supports Pod Restart and Deletion Protection
|
||||
authors: [FillZpp]
|
||||
tags: [release]
|
||||
---
|
||||
|
||||
On May 20, 2021, OpenKruise released the latest version v0.9.0, with new features, such as Pod restart and resource cascading deletion protection. This article provides an overview of this new version.
|
||||
|
||||
## Pod Restart and Recreation
|
||||
|
||||
Restarting container is a necessity in daily operation and a common technical method for recovery. In the native Kubernetes, the container granularity is inoperable. Pod, as the minimum operation unit, can only be created or deleted.
|
||||
|
||||
Some may ask: *why do users still need to pay attention to the operation such as container restart in the cloud-native era? Aren't the services the only thing for users to focus on in the ideal Serverless model?*
|
||||
|
||||
To answer this question, we need to see the differences between cloud-native architecture and traditional infrastructures. In the era of traditional physical and virtual machines, multiple application instances are deployed and run on one machine, but the lifecycles of the machine and applications are separated. Thus, application instance restart may only require a `systemctl` or `supervisor` command but not the restart of the entire machine. However, in the era of containers and cloud-native, the lifecycle of the application is bound to that of the Pod container. In other words, under normal circumstances, one container only runs one application process, and one Pod provides services for only one application instance.
|
||||
|
||||
Due to these restrictions, current native Kubernetes provides no API for the container (application) restart for upper-layer services. OpenKruise v0.9.0 supports restarting containers in a single Pod, compatible with standard Kubernetes clusters of version 1.16 or later. After installing or upgrading OpenKruise, users only need to create a `ContainerRecreateRequest` (CRR) object to initiate a restart process. The simplest YAML file is listed below:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: ContainerRecreateRequest
|
||||
metadata:
|
||||
namespace: pod-namespace
|
||||
name: xxx
|
||||
spec:
|
||||
podName: pod-name
|
||||
containers:
|
||||
- name: app
|
||||
- name: sidecar
|
||||
```
|
||||
|
||||
The value of namespace must be the same as the namespace of the Pod to be operated. The name can be set as needed. The `podName` in the spec clause indicates the Pod name. The containers indicate a list that specifies one or more container names in the Pod to restart.
|
||||
|
||||
In addition to the required fields above, CRR also provides a variety of optional restart policies:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
# ...
|
||||
strategy:
|
||||
failurePolicy: Fail
|
||||
orderedRecreate: false
|
||||
terminationGracePeriodSeconds: 30
|
||||
unreadyGracePeriodSeconds: 3
|
||||
minStartedSeconds: 10
|
||||
activeDeadlineSeconds: 300
|
||||
ttlSecondsAfterFinished: 1800
|
||||
```
|
||||
|
||||
- `failurePolicy`: Values: Fail or Ignore. Default value: Fail. If any container stops or fails to recreate, CRR ends immediately.
|
||||
- `orderedRecreate`: Default value: false. Value true indicates when the list contains multiple containers, the new container will only be recreated after the previous recreation is finished.
|
||||
- `terminationGracePeriodSeconds`: The time for the container to gracefully exit. If this parameter is not specified, the time defined for the Pod is used.
|
||||
- `unreadyGracePeriodSeconds`: Set the Pod to the unready state before recreation and wait for the time expiration to execute recreation.
|
||||
- `Note`: This feature needs the feature-gate `KruisePodReadinessGate` to be enabled, which will inject a readinessGate when a Pod is created. Otherwise, only the pods created by the OpenKruise workload are injected with readinessGate by default. It means only these Pods can use the `unreadyGracePeriodSeconds` parameter during the CRR recreation.
|
||||
- `minStartedSeconds`: The minimal period that the new container remains running to judge whether the container is recreated successfully.
|
||||
- `activeDeadlineSeconds`: The expiration period set for CRR execution to mark as ended (unfinished container will be marked as failed.)
|
||||
- `ttlSecondsAfterFinished`: The period after which the CRR will be deleted automatically after the execution ends.
|
||||
|
||||
**How it works under the hood:** After it is created, a CRR is processed by the kruise-manager. Then, it will be sent to the kruise-daemon (contained by the node where Pod resides) for execution. The execution process is listed below:
|
||||
|
||||
1. If `preStop` is specified for a Pod, the kruise-daemon will first call the CRI to run the command specified by `preStop` in the container.
|
||||
2. If no `preStop` exists or `preStop` execution is completed, the kruise-daemon will call the CRI to stop the container.
|
||||
3. When the kubelet detects the container exiting, it creates a new container with an increasing "serial number" and starts it. `postStart` will be executed at the same time.
|
||||
4. When the kruise-daemon detects the start of the new container, it reports to CRR that the restart is completed.
|
||||
|
||||

|
||||
|
||||
The container "serial number" corresponds to the `restartCount` reported by kubelet in the Pod status. Therefore, the `restartCount` of the Pod increases after the container is restarted. Temporary files written to the `rootfs` in the old container will be lost due to the container recreation, but data in the volume mount remains.
|
||||
|
||||
## Cascading Deletion Protection
|
||||
|
||||
The level triggered automation of Kubernetes is a double-edged sword. It brings declarative deployment capabilities to applications while potentially enlarging the influence of mistakes at a final-state scale. For example, with the cascading deletion mechanism, once an owning resource is deleted under normal circumstances (non-orphan deletion), all owned resources associated will be deleted by the following rules:
|
||||
|
||||
1. If a CRD is deleted, all its corresponding CR will be cleared.
|
||||
2. If a namespace is deleted, all resources in this namespace, including Pods, will be cleared.
|
||||
3. If a workload (Deployment, StatefulSet, etc) is deleted, all Pods under it will be cleared.
|
||||
|
||||
Due to failures caused by cascading deletion, we have heard many complaints from Kubernetes users and developers in the community. It is unbearable for any enterprise to mistakenly delete objects at such a large scale in the production environment.
|
||||
|
||||
Therefore, in OpenKruise v0.9.0, we applied the feature of cascading deletion protection to community in the hope of ensuring stability for more users. If you want to use this feature in the current version, the feature-gate of `ResourcesDeletionProtection` needs to be explicitly enabled when installing or upgrading OpenKruise.
|
||||
|
||||
A label of `policy.kruise.io/delete-protection` can be given on the resource objects that require protection. Its value can be the following two things:
|
||||
|
||||
- **Always**: The object cannot be deleted unless the label is removed.
|
||||
- **Cascading**: The object cannot be deleted if any subordinate resources are available.
|
||||
|
||||
The following table lists the supported resource types and cascading relationships:
|
||||
|
||||
| Kind | Group | Version | **Cascading** judgement |
|
||||
| --------------------------- | ---------------------- | ------------------ | ----------------------------------------------------
|
||||
| `Namespace` | core | v1 | whether there is active Pods in this namespace |
|
||||
| `CustomResourceDefinition` | apiextensions.k8s.io | v1beta1, v1 | whether there is existing CRs of this CRD |
|
||||
| `Deployment` | apps | v1 | whether the replicas is 0 |
|
||||
| `StatefulSet` | apps | v1 | whether the replicas is 0 |
|
||||
| `ReplicaSet` | apps | v1 | whether the replicas is 0 |
|
||||
| `CloneSet` | apps.kruise.io | v1alpha1 | whether the replicas is 0 |
|
||||
| `StatefulSet` | apps.kruise.io | v1alpha1, v1beta1 | whether the replicas is 0 |
|
||||
| `UnitedDeployment` | apps.kruise.io | v1alpha1 | whether the replicas is 0 |
|
||||
|
||||
## New Features of CloneSet
|
||||
|
||||
### Deletion Priority
|
||||
|
||||
The `controller.kubernetes.io/pod-deletion-cost` annotation was added to Kubernetes after version 1.21. `ReplicaSet` will sort the Kubernetes resources according to this cost value during scale in. CloneSet has supported the same feature since OpenKruise v0.9.0.
|
||||
|
||||
Users can configure this annotation in the pod. The int type of its value indicates the deletion cost of a certain pod compared to other pods under the same CloneSet. Pods with a lower cost have a higher deletion priority. If this annotation is not set, the deletion cost of the pod is 0 by default.
|
||||
|
||||
*Note*: This deletion order is not determined solely by deletion cost. The real order serves like this:
|
||||
|
||||
1. Not scheduled < scheduled
|
||||
2. PodPending < PodUnknown < PodRunning
|
||||
3. Not ready < ready
|
||||
4. **Smaller pod-deletion cost < larger pod-deletion cost**
|
||||
5. Period in the Ready state: short < long
|
||||
6. Containers restart: more times < fewer times
|
||||
7. Creation time: short < long
|
||||
|
||||
### Image Pre-Download for In-Place Update
|
||||
|
||||
When CloneSet is used for the in-place update of an application, only the container image is updated, while the Pod is not rebuilt. This ensures that the node where the Pod is located will not change. Therefore, if the CloneSet pulls the new image from all the Pod nodes in advance, the Pod in-place update speed will be improved substantially in subsequent batch releases.
|
||||
|
||||
If you want to use this feature in the current version, the feature-gate of `PreDownloadImageForInPlaceUpdate` needs to be explicitly enabled when installing or upgrading OpenKruise. If you update the images in the CloneSet template and the publish policy supports in-place update, CloneSet will create an `ImagePullJob` object automatically (the batch image pre-download function provided by OpenKruise) to download new images in advance on the node where the Pod is located.
|
||||
|
||||
By default, CloneSet sets the parallelism to 1 for `ImagePullJob`, which means images are pulled for one node and then another. For any adjustment, you can set the parallelism in the CloneSet annotation by executing the following code:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: CloneSet
|
||||
metadata:
|
||||
annotations:
|
||||
apps.kruise.io/image-predownload-parallelism: "5"
|
||||
```
|
||||
|
||||
### Pod Replacement by Scale Out and Scale In
|
||||
|
||||
In previous versions, the `maxUnavailable` and `maxSurge` policies of CloneSet only take effect during the application release process. In OpenKruise v0.9.0 and later versions, these two policies also function when deleting a specified Pod.
|
||||
|
||||
When the user specifies one or more Pods to be deleted through `podsToDelete` or `apps.kruise.io/specified-delete`: true, CloneSet will only execute deletion when the number of unavailable Pods (of the total replicas) is less than the value of `maxUnavailable`. In addition, if the user has configured the `maxSurge` policy, the CloneSet will possibly create a new Pod first, wait for the new Pod to be ready, and then delete the old specified Pod.
|
||||
|
||||
The replacement method depends on the value of maxUnavailable and the number of unavailable Pods. For example:
|
||||
|
||||
- For a CloneSet, `maxUnavailable=2, maxSurge=1` and only `pod-a` is unavailable. If you specify `pod-b` to be deleted, CloneSet will delete it promptly and create a new Pod.
|
||||
- For a CloneSet, `maxUnavailable=1, maxSurge=1` and only `pod-a` is unavailable. If you specify `pod-b` to be deleted, CloneSet will create a new Pod, wait for it to be ready, and then delete the pod-b.
|
||||
- For a CloneSet, `maxUnavailable=1, maxSurge=1` and only `pod-a` is unavailable. If you specify this `pod-a` to be deleted, CloneSet will delete it promptly and create a new Pod.
|
||||
|
||||
### Efficient Rollback Based on Partition Final State
|
||||
|
||||
In the native workload, Deployment does not support phased release, while StatefulSet provides partition semantics to allow users to control the times of gray scale upgrades. OpenKruise workloads, such as CloneSet and Advanced StatefulSet, also provide partitions to support phased release.
|
||||
|
||||
For CloneSet, the semantics of Partition is **the number or percentage of Pods remaining in the old version**. For example, for a CloneSet with 100 replicas, if the partition value is changed in the sequence of 80 :arrow_right: 60 :arrow_right: 40 :arrow_right: 20 :arrow_right: 0 by steps during the image upgrade, the CloneSet is released in five batches.
|
||||
|
||||
However, in the past, whether it is Deployment, StatefulSet, or CloneSet, if rollback is required during the release process, the template information (image) must be changed back to the old version. During the phased release of StatefulSet and CloneSet, reducing partition value will trigger the upgrade to a new version. Increasing partition value will not trigger rollback to the old version.
|
||||
|
||||
The partition of CloneSet supports the "final state rollback" function after v0.9.0. If the feature-gate `CloneSetPartitionRollback` is enabled when installing or upgrading OpenKruise, increasing the partition value will trigger CloneSet to roll back the corresponding number of new Pods to the old version.
|
||||
|
||||
There is a clear advantage here. During the phased release, only the partition value needs to be adjusted to flexibly control the numbers of old and new versions. However, the "old and new versions" for CloneSet correspond to `updateRevision` and `currentRevision` in its status:
|
||||
|
||||
- updateRevision: The version of the template defined by the current CloneSet.
|
||||
- currentRevision: The template version of CloneSet during the **previous successful full release**.
|
||||
|
||||
### Short Hash
|
||||
|
||||
By default, the value of `controller-revision-hash` in Pod label set by CloneSet is the full name of the `ControllerRevision`. For example:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
labels:
|
||||
controller-revision-hash: demo-cloneset-956df7994
|
||||
```
|
||||
|
||||
The name is concatenated with the CloneSet name and the `ControllerRevision` hash value. Generally, the hash value is 8 to 10 characters in length. In Kubernetes, a label cannot exceed 63 characters in length. Therefore, the name of CloneSet cannot exceed 52 characters in length, or the Pod cannot be created.
|
||||
|
||||
In v0.9.0, the new feature-gate `CloneSetShortHash` is introduced. If it is enabled, CloneSet will set the value of `controller-revision-hash` in the Pod to a hash value only, like 956df7994. Therefore, the length restriction of the CloneSet name is eliminated. (CloneSet can still recognize and manage the Pod with revision labels in the full format, even if this function is enabled.)
|
||||
|
||||
## New Features of SidecarSet
|
||||
|
||||
### Sidecar Hot Upgrade Function
|
||||
|
||||
SidecarSet is a workload provided by OpenKruise to manage sidecar containers separately. Users can inject and upgrade specified sidecar containers within a certain range of Pods using `SidecarSet`.
|
||||
|
||||
By default, for the independent in-place sidecar upgrade, the sidecar stops the container of the old version first and then creates a container of the new version. This method applies to sidecar containers that do not affect the Pod service availability, such as the log collection agent. However, for sidecar containers acting as a proxy such as Istio Envoy, this upgrade method is defective. Envoy, as a proxy container in the Pod, handles all the traffic. If users restart and upgrade directly, service availability will be affected. Thus, you need a complex grace termination and coordination mechanism to upgrade the envoy sidecar separately. Therefore, we offer a new solution for the upgrade of this kind of sidecar containers, namely, hot upgrade:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: SidecarSet
|
||||
spec:
|
||||
# ...
|
||||
containers:
|
||||
- name: nginx-sidecar
|
||||
image: nginx:1.18
|
||||
lifecycle:
|
||||
postStart:
|
||||
exec:
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- /usr/local/bin/nginx-agent migrate
|
||||
upgradeStrategy:
|
||||
upgradeType: HotUpgrade
|
||||
hotUpgradeEmptyImage: empty:1.0.0
|
||||
```
|
||||
|
||||
- `upgradeType`: `HotUpgrade` indicates that the type of the sidecar container is a hot upgrade, so the hot upgrade solution, `hotUpgradeEmptyImage`, will be executed. When performing a hot upgrade on the sidecar container, an empty container is required to switch services during the upgrade. The empty container has almost the same configuration as the sidecar container, except the image address, for example, command, lifecycle, and probe, but it does no actual work.
|
||||
- `lifecycle.postStart`: State migration. This procedure completes the state migration during the hot upgrade. The script needs to be executed according to business characteristics. For example, NGINX hot upgrade requires shared Listen FD and traffic reloading.
|
||||
|
||||
|
||||
## More
|
||||
|
||||
For more changes, please refer to the [release page](https://github.com/openkruise/kruise/releases) or [ChangeLog](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md).
|
|
@ -0,0 +1,145 @@
|
|||
---
|
||||
slug: openkruise-0.10.0
|
||||
title: OpenKruise 0.10.0, New features of multi-domain management, application protection
|
||||
authors: [FillZpp]
|
||||
tags: [release]
|
||||
---
|
||||
|
||||
On Sep 6th, 2021, OpenKruise released the latest version v0.10.0, with new features, such as WorkloadSpread and PodUnavailableBudget. This article provides an overview of this new version.
|
||||
|
||||
## WorkloadSpread
|
||||
|
||||
WorkloadSpread can distribute Pods of workload to different types of Node according to some polices, which empowers single workload the abilities for
|
||||
multi-domain deployment and elastic deployment.
|
||||
|
||||
Some common policies include:
|
||||
- fault toleration spread (for example, spread evenly among hosts, az, etc)
|
||||
- spread according to the specified ratio (for example, deploy Pod to several specified az according to the proportion)
|
||||
- subset management with priority, such as
|
||||
- deploy Pods to ecs first, and then deploy to eci when its resources are insufficient.
|
||||
- deploy a fixed number of Pods to ecs first, and the rest Pods are deployed to eci.
|
||||
- subset management with customization, such as
|
||||
- control how many pods in a workload are deployed in different cpu arch
|
||||
- enable pods in different cpu arch to have different resource requirements
|
||||
|
||||
The feature of WorkloadSpread is similar with UnitedDeployment in OpenKruise community. Each WorkloadSpread defines multi-domain
|
||||
called `subset`. Each domain may provide the limit to run the replicas number of pods called `maxReplicas`.
|
||||
WorkloadSpread injects the domain configuration into the Pod by Webhook, and it also controls the order of scale in and scale out.
|
||||
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: WorkloadSpread
|
||||
metadata:
|
||||
name: workloadspread-demo
|
||||
spec:
|
||||
targetRef:
|
||||
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
|
||||
kind: Deployment | CloneSet
|
||||
name: workload-xxx
|
||||
subsets:
|
||||
- name: subset-a
|
||||
requiredNodeSelectorTerm:
|
||||
matchExpressions:
|
||||
- key: topology.kubernetes.io/zone
|
||||
operator: In
|
||||
values:
|
||||
- zone-a
|
||||
maxReplicas: 10 | 30%
|
||||
- name: subset-b
|
||||
requiredNodeSelectorTerm:
|
||||
matchExpressions:
|
||||
- key: topology.kubernetes.io/zone
|
||||
operator: In
|
||||
values:
|
||||
- zone-b
|
||||
```
|
||||
|
||||
The WorkloadSpread is related to a Workload via `targetRef`. When a Pod is created by the Workload, it will be injected topology policies by Kruise according to the rules in WorkloadSpread.
|
||||
|
||||
Note that WorkloadSpread uses [Pod Deletion Cost](https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost) to control the priority of scale down. So:
|
||||
|
||||
- If the Workload type is CloneSet, it already supports the feature.
|
||||
- If the Workload type is Deployment or ReplicaSet, it requires your Kubernetes version >= 1.22.
|
||||
|
||||
Also you have to enable `WorkloadSpread` feature-gate when you install or upgrade Kruise.
|
||||
|
||||
## PodUnavailableBudget
|
||||
|
||||
Kubernetes offers [Pod Disruption Budget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) to help you run highly available applications even when you introduce frequent [voluntary disruptions](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/).
|
||||
PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. However, it can only constrain the voluntary disruption triggered by the [Eviction API](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#eviction-api).
|
||||
For example, when you run kubectl drain, the tool tries to evict all of the Pods on the Node you're taking out of service.
|
||||
|
||||
In the following voluntary disruption scenarios, there are still business disruption or SLA degradation situations:
|
||||
1. The application owner update deployment's pod template for general upgrading, while cluster administrator drain nodes to scale the cluster down(learn about [Cluster Autoscaling](https://github.com/kubernetes/autoscaler/#readme)).
|
||||
2. The middleware team is using SidecarSet to rolling upgrade the sidecar containers of the cluster, e.g. ServiceMesh envoy, while HPA triggers the scale-down of business applications.
|
||||
3. The application owner and middleware team release the same Pods at the same time based on OpenKruise cloneSet, sidecarSet in-place upgrades
|
||||
|
||||
In voluntary disruption scenarios, PodUnavailableBudget can achieve the effect of preventing application disruption or SLA degradation, which greatly improves the high availability of application services.
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: PodUnavailableBudget
|
||||
metadata:
|
||||
name: web-server-pub
|
||||
namespace: web
|
||||
spec:
|
||||
targetRef:
|
||||
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
|
||||
kind: Deployment | CloneSet | StatefulSet | ...
|
||||
name: web-server
|
||||
# selector label query over pods managed by the budget
|
||||
# selector and TargetReference are mutually exclusive, targetRef is priority to take effect.
|
||||
# selector is commonly used in scenarios where applications are deployed using multiple workloads,
|
||||
# and targetRef is used for protection against a single workload.
|
||||
# selector:
|
||||
# matchLabels:
|
||||
# app: web-server
|
||||
# maximum number of Pods unavailable for the current cloneset, the example is cloneset.replicas(5) * 60% = 3
|
||||
# maxUnavailable and minAvailable are mutually exclusive, maxUnavailable is priority to take effect
|
||||
maxUnavailable: 60%
|
||||
# Minimum number of Pods available for the current cloneset, the example is cloneset.replicas(5) * 40% = 2
|
||||
# minAvailable: 40%
|
||||
```
|
||||
|
||||
You have to enable the feature-gates when install or upgrade Kruise:
|
||||
|
||||
- PodUnavailableBudgetDeleteGate: protect Pod deletion or eviction.
|
||||
- PodUnavailableBudgetUpdateGate: protect Pod update operations, such as in-place update.
|
||||
|
||||
## CloneSet supports scaledown priority by Spread Constraints
|
||||
|
||||
When `replicas` of a CloneSet decreased, it has the arithmetic to choose Pods and delete them.
|
||||
|
||||
1. Node unassigned < assigned
|
||||
2. PodPending < PodUnknown < PodRunning
|
||||
3. Not ready < ready
|
||||
4. **Lower pod-deletion cost < higher pod-deletion-cost**
|
||||
5. **Higher spread rank < lower spread rank**
|
||||
6. Been ready for empty time < less time < more time
|
||||
7. Pods with containers with higher restart counts < lower restart counts
|
||||
8. Empty creation time pods < newer pods < older pods
|
||||
|
||||
"4" has provided in Kruise v0.9.0 and it is also used by WorkloadSpread to control the Pod deletion. **"5" is added in Kruise v0.10.0 to sort Pods by their Topology Spread Constraints during scaledown.**
|
||||
|
||||
## Advanced StatefulSet supports scaleup with rate limit
|
||||
|
||||
To avoid a large amount of failed Pods after user created an incorrect Advanced StatefulSet, Kruise add a `maxUnavailable` field into its `scaleStrategy`.
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1beta1
|
||||
kind: StatefulSet
|
||||
spec:
|
||||
# ...
|
||||
replicas: 100
|
||||
scaleStrategy:
|
||||
maxUnavailable: 10% # percentage or absolute number
|
||||
```
|
||||
|
||||
When the field is set, Advanced StatefulSet will guarantee that the number of unavailable Pods should not bigger than the strategy number during Pod creation.
|
||||
|
||||
Note that the feature can only be used in StatefulSet with `podManagementPolicy=Parallel`.
|
||||
|
||||
## More
|
||||
|
||||
For more changes, please refer to the [release page](https://github.com/openkruise/kruise/releases) or [ChangeLog](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md).
|
|
@ -0,0 +1,212 @@
|
|||
---
|
||||
slug: openkruise-0.9.0
|
||||
title: OpenKruise 0.9.0:新增Pod容器重启、资源删除防护等功能
|
||||
authors: [FillZpp]
|
||||
tags: [release]
|
||||
---
|
||||
|
||||
OpenKruise 在 2021.5.20 发布了最新的 v0.9.0 版本,新增了 Pod 容器重启、资源级联删除防护等重磅功能,本文以下对新版本做整体的概览介绍。
|
||||
|
||||
## Pod 容器重启/重建
|
||||
|
||||
“重启” 是一个很朴素的需求,即使日常运维的诉求,也是技术领域较为常见的 “恢复手段”。而在原生的 Kubernetes 中,并没有提供任何对容器粒度的操作能力,Pod 作为最小操作单元也只有创建、删除两种操作方式。
|
||||
|
||||
有的同学可能会问,在云原生时代,为什么用户还要关注容器重启这种运维操作呢?在理想的 serverless 模式下,业务只需要关心服务自身就好吧?
|
||||
|
||||
这来自于云原生架构和过去传统基础基础设施的差异性。在传统的物理机、虚拟机时代,一台机器上往往会部署和运行多个应用的实例,并且机器和应用的生命周期是不同的;在这种情况下,应用实例的重启可能仅仅是一条 systemctl 或 supervisor 之类的指令,而无需将整个机器重启。然而,在容器与云原生模式下,应用的生命周期是和 Pod 容器绑定的;即常规情况下,一个容器只运行一个应用进程,一个 Pod 也只提供一个应用实例的服务。
|
||||
|
||||
基于上述的限制,目前原生 Kubernetes 之下是没有 API 来为上层业务提供容器(应用)重启能力的。而 Kruise v0.9.0 版本提供了一种单 Pod 维度的容器重启能力,兼容 1.16 及以上版本的标准 Kubernetes 集群。在安装或升级 Kruise 之后,只需要创建 ContainerRecreateRequest(简称 CRR) 对象来指定重启,最简单的 YAML 如下:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: ContainerRecreateRequest
|
||||
metadata:
|
||||
namespace: pod-namespace
|
||||
name: xxx
|
||||
spec:
|
||||
podName: pod-name
|
||||
containers:
|
||||
- name: app
|
||||
- name: sidecar
|
||||
```
|
||||
|
||||
其中,namespace 需要与要操作的 Pod 在同一个命名空间,name 可自选。spec 中 podName 是 Pod 名字,containers 列表则可以指定 Pod 中一个或多个容器名来执行重启。
|
||||
|
||||
除了上述必选字段外,CRR 还提供了多种可选的重启策略:
|
||||
|
||||
```yaml
|
||||
spec:
|
||||
# ...
|
||||
strategy:
|
||||
failurePolicy: Fail
|
||||
orderedRecreate: false
|
||||
terminationGracePeriodSeconds: 30
|
||||
unreadyGracePeriodSeconds: 3
|
||||
minStartedSeconds: 10
|
||||
activeDeadlineSeconds: 300
|
||||
ttlSecondsAfterFinished: 1800
|
||||
```
|
||||
|
||||
- `failurePolicy`: Fail 或 Ignore,默认 Fail;表示一旦有某个容器停止或重建失败,CRR 立即结束
|
||||
- `orderedRecreate`: 默认 false;true 表示列表有多个容器时,等前一个容器重建完成了,再开始重建下一个
|
||||
- `terminationGracePeriodSeconds`: 等待容器优雅退出的时间,不填默认用 Pod 中定义的时间
|
||||
- `unreadyGracePeriodSeconds`: 在重建之前先把 Pod 设为 not ready,并等待这段时间后再开始执行重建
|
||||
- 注:该功能依赖于 KruisePodReadinessGate 这个 feature-gate 要打开,后者会在每个 Pod 创建的时候注入一个 readinessGate。 否则,默认只会给 Kruise workload 创建的 Pod 注入 readinessGate,也就是说只有这些 Pod 才能在 CRR 重建时使用 unreadyGracePeriodSeconds
|
||||
- `minStartedSeconds`: 重建后新容器至少保持运行这段时间,才认为该容器重建成功
|
||||
- `activeDeadlineSeconds`: 如果 CRR 执行超过这个时间,则直接标记为结束(未完成的容器标记为失败)
|
||||
- `ttlSecondsAfterFinished`: CRR 结束后,过了这段时间自动被删除掉
|
||||
|
||||
实现原理:当用户创建了 CRR 后,经过了 kruise-manager 中心端的初步处理,会被 Pod 所在节点上的 kruise-daemon 收到并开始执行。执行的过程如下:
|
||||
|
||||
1. 如果 Pod 容器定义了 preStop,kruise-daemon 会先走 CRI 运行时 exec 到容器中执行 preStop
|
||||
2. 如果没有 preStop 或执行完成,kruise-daemon 调用 CRI 接口将容器停止
|
||||
3. kubelet 感知到容器退出,则会新建一个 “序号” 递增的新容器,并开始启动(以及执行 postStart)
|
||||
4. kruise-daemon 感知到新容器启动成功,上报 CRR 重启完成
|
||||
|
||||

|
||||
|
||||
上述的容器 “序号” 其实就对应了 Pod status 中 kubelet 上报的 restartCount。因此,在容器重启后会看到 Pod 的 restartCount 增加。另外,因为容器发生了重建,之前临时写到旧容器 rootfs 中的文件会丢失,但是 volume mount 挂载卷中的数据仍然存在。
|
||||
|
||||
## 级联删除防护
|
||||
|
||||
Kubernetes 的面向终态自动化是一把 “双刃剑”,它既为应用带来了声明式的部署能力,同时也潜在地会将一些误操作行为被终态化放大。例如它的 “级联删除” 机制,即正常情况(非 orphan 删除)下一旦父类资源被删除,则所有子类资源都会被关联删除:
|
||||
|
||||
1. 删除一个 CRD,其所有对应的 CR 都被清理掉
|
||||
2. 删除一个 namespace,这个命名空间下包括 Pod 在内所有资源都被一起删除
|
||||
3. 删除一个 workload(Deployment/StatefulSet/...),则下属所有 Pod 被删除
|
||||
|
||||
类似这种 “级联删除” 带来的故障,我们已经听到不少社区 K8s 用户和开发者带来的抱怨。对于任何一家企业来说,其生产环境发生这种规模误删除都是不可承受之痛。
|
||||
|
||||
因此,在 Kruise v0.9.0 版本中,我们建立了防级联删除能力,期望能为更多的用户带来稳定性保障。在当前版本中如果需要使用该功能,则在安装或升级 Kruise 的时候需要显式打开 `ResourcesDeletionProtection` 这个 feature-gate。
|
||||
|
||||
对于需要防护删除的资源对象,用户可以给其打上 policy.kruise.io/delete-protection 标签,value 可以有两种:
|
||||
|
||||
- Always: 表示这个对象禁止被删除,除非上述 label 被去掉
|
||||
- Cascading:这个对象如果还有可用的下属资源,则禁止被删除
|
||||
|
||||
目前支持的资源类型、以及 cascading 级联关系如下:
|
||||
|
||||
| Kind | Group | Version | **Cascading** judgement |
|
||||
| --------------------------- | ---------------------- | ------------------ | ----------------------------------------------------
|
||||
| `Namespace` | core | v1 | whether there is active Pods in this namespace |
|
||||
| `CustomResourceDefinition` | apiextensions.k8s.io | v1beta1, v1 | whether there is existing CRs of this CRD |
|
||||
| `Deployment` | apps | v1 | whether the replicas is 0 |
|
||||
| `StatefulSet` | apps | v1 | whether the replicas is 0 |
|
||||
| `ReplicaSet` | apps | v1 | whether the replicas is 0 |
|
||||
| `CloneSet` | apps.kruise.io | v1alpha1 | whether the replicas is 0 |
|
||||
| `StatefulSet` | apps.kruise.io | v1alpha1, v1beta1 | whether the replicas is 0 |
|
||||
| `UnitedDeployment` | apps.kruise.io | v1alpha1 | whether the replicas is 0 |
|
||||
|
||||
## CloneSet 新增功能
|
||||
|
||||
### 删除优先级
|
||||
|
||||
`controller.kubernetes.io/pod-deletion-cost` 是从 Kubernetes 1.21 版本后加入的 annotation,ReplicaSet 在缩容时会参考这个 cost 数值来排序。 CloneSet 从 Kruise v0.9.0 版本后也同样支持了这个功能。
|
||||
|
||||
用户可以把这个 annotation 配置到 pod 上,它的 value 数值是 int 类型,表示这个 pod 相较于同个 CloneSet 下其他 pod 的 "删除代价",代价越小的 pod 删除优先级相对越高。 没有设置这个 annotation 的 pod 默认 deletion cost 是 0。
|
||||
|
||||
注意这个删除顺序并不是强制保证的,因为真实的 pod 的删除类似于下述顺序:
|
||||
|
||||
1. 未调度 < 已调度
|
||||
2. PodPending < PodUnknown < PodRunning
|
||||
3. Not ready < ready
|
||||
4. **较小 pod-deletion cost < 较大 pod-deletion cost**
|
||||
5. 处于 Ready 时间较短 < 较长
|
||||
6. 容器重启次数较多 < 较少
|
||||
7. 创建时间较短 < 较长
|
||||
|
||||
### 配合原地升级的镜像预热
|
||||
|
||||
当使用 CloneSet 做应用原地升级时,只会升级容器镜像、而 Pod 不会发生重建。这就保证了 Pod 升级前后所在 node 不会发生变化,从而在原地升级的过程中,如果 CloneSet 提前在所有 Pod 节点上先把新版本镜像拉取好,则在后续的发布批次中 Pod 原地升级速度会得到大幅度提高。
|
||||
|
||||
在当前版本中如果需要使用该功能,则在安装或升级 Kruise 的时候需要显式打开 `PreDownloadImageForInPlaceUpdate` 这个 feature-gate。打开后,当用户更新了 CloneSet template 中的镜像、且发布策略支持原地升级,则 CloneSet 会自动为这个新镜像创建 ImagePullJob 对象(OpenKruise 提供的批量镜像预热功能),来提前在 Pod 所在节点上预热新镜像。
|
||||
|
||||
默认情况下 CloneSet 给 ImagePullJob 配置的并发度是 1,也就是一个个节点拉镜像。 如果需要调整,你可以在 CloneSet annotation 上设置其镜像预热时的并发度:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: CloneSet
|
||||
metadata:
|
||||
annotations:
|
||||
apps.kruise.io/image-predownload-parallelism: "5"
|
||||
```
|
||||
|
||||
### 先扩再缩的 Pod 置换方式
|
||||
|
||||
在过去版本中,CloneSet 的 `maxUnavailable`、`maxSurge` 策略只对应用发布过程生效。而从 Kruise v0.9.0 版本开始,这两个策略同样会对 Pod 指定删除生效。
|
||||
|
||||
也就是说,当用户通过 `podsToDelete` 或 `apps.kruise.io/specified-delete: true` 方式(具体见官网文档)来指定一个或多个 Pod 期望删除时,CloneSet 只会在当前不可用 Pod 数量(相对于 replicas 总数)小于 maxUnavailable 的时候才执行删除。同时,如果用户配置了 maxSurge 策略,则 CloneSet 有可能会先创建一个新 Pod、等待新 Pod ready、再删除指定的旧 Pod。
|
||||
|
||||
具体采用什么样的置换方式,取决于当时的 maxUnavailable 和实际不可用 Pod 数量。比如:
|
||||
|
||||
- 对于一个 CloneSet `maxUnavailable=2, maxSurge=1` 且有一个 `pod-a` 处于不可用状态, 如果你对另一个 `pod-b` 指定删除, 那么 CloneSet 会立即删除它,然后创建一个新 Pod。
|
||||
- 对于一个 CloneSet `maxUnavailable=1, maxSurge=1` 且有一个 `pod-a` 处于不可用状态, 如果你对另一个 `pod-b` 指定删除, 那么 CloneSet 会先新建一个 Pod、等待它 ready,最后再删除 pod-b。
|
||||
- 对于一个 CloneSet `maxUnavailable=1, maxSurge=1` 且有一个 `pod-a` 处于不可用状态, 如果你对这个 `pod-a` 指定删除, 那么 CloneSet 会立即删除它,然后创建一个新 Pod。
|
||||
- ...
|
||||
|
||||
### 基于 partition 终态的高效回滚
|
||||
|
||||
在原生的 workload 中,Deployment 自身发布不支持灰度发布,StatefulSet 有 partition 语义来允许用户控制灰度升级的数量;而 Kruise workload 如 CloneSet、Advanced StatefulSet,也都提供了 partition 来支持灰度分批。
|
||||
|
||||
对于 CloneSet,Partition 的语义是 **保留旧版本 Pod 的数量或百分比**。比如说一个 100 个副本的 CloneSet,在升级镜像时将 partition 数值阶段性改为 80 -> 60 -> 40 -> 20 -> 0,则完成了分 5 批次发布。
|
||||
|
||||
但过去,不管是 Deployment、StatefulSet 还是 CloneSet,在发布的过程中如果想要回滚,都必须将 template 信息(镜像)重新改回老版本。后两者在灰度的过程中,将 partition 调小会触发旧版本升级为新版本,但再次 partition 调大则不会处理。
|
||||
|
||||
从 v0.9.0 版本开始,CloneSet 的 partition 支持了 “终态回滚” 功能。如果在安装或升级 Kruise 的时候打开了 `CloneSetPartitionRollback` 这个 feature-gate,则当用户将 partition 调大时,CloneSet 会将对应数量的新版本 Pod 重新回滚到老版本。
|
||||
|
||||
这样带来的好处是显而易见的:在灰度发布的过程中,只需要前后调节 partition 数值,就能灵活得控制新旧版本的比例数量。但需要注意的是,CloneSet 所依据的 “新旧版本” 对应的是其 status 中的 updateRevision 和 currentRevision:
|
||||
|
||||
- updateRevision:对应当前 CloneSet 所定义的 template 版本
|
||||
- currentRevision:该 CloneSet 前一次全量发布成功的 template 版本
|
||||
|
||||
### 短 hash
|
||||
|
||||
默认情况下,CloneSet 在 Pod label 中设置的 `controller-revision-hash` 值为 `ControllerRevision` 的完整名字,比如:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
labels:
|
||||
controller-revision-hash: demo-cloneset-956df7994
|
||||
```
|
||||
|
||||
它是通过 CloneSet 名字和 ControllerRevision hash 值拼接而成。 通常 hash 值长度为 8~10 个字符,而 Kubernetes 中的 label 值不能超过 63 个字符。 因此 CloneSet 的名字一般是不能超过 52 个字符的,如果超过了,则无法成功创建出 Pod。
|
||||
|
||||
在 v0.9.0 版本引入了 `CloneSetShortHash` 新的 feature-gate。 如果它被打开,CloneSet 只会将 Pod 中的 `controller-revision-hash` 的值只设置为 hash 值,比如 956df7994,因此 CloneSet 名字的长度不会有任何限制了。(即使启用该功能,CloneSet 仍然会识别和管理过去存量的 revision label 为完整格式的 Pod。)
|
||||
|
||||
## SidecarSet 新增功能
|
||||
|
||||
### Sidecar 热升级
|
||||
|
||||
SidecarSet 是 Kruise 提供的独立管理 sidecar 容器的 workload。用户可以通过 SidecarSet,来在一定范围的 Pod 中注入和升级指定的 sidecar 容器。
|
||||
|
||||
默认情况下,sidecar 的独立原地升级是先停止旧版本的容器,然后创建新版本的容器。这种方式更加适合不影响Pod服务可用性的sidecar容器,比如说日志收集 agent,但是对于很多代理或运行时的 sidecar 容器,例如 Istio Envoy,这种升级方法就有问题了。Envoy 作为 Pod 中的一个代理容器,代理了所有的流量,如果直接重启升级,Pod 服务的可用性会受到影响。如果需要单独升级 envoy sidecar,就需要复杂的 grace 终止和协调机制。所以我们为这种 sidecar 容器的升级提供了一种新的解决方案,即热升级(hot upgrade)。
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: SidecarSet
|
||||
spec:
|
||||
# ...
|
||||
containers:
|
||||
- name: nginx-sidecar
|
||||
image: nginx:1.18
|
||||
lifecycle:
|
||||
postStart:
|
||||
exec:
|
||||
command:
|
||||
- /bin/bash
|
||||
- -c
|
||||
- /usr/local/bin/nginx-agent migrate
|
||||
upgradeStrategy:
|
||||
upgradeType: HotUpgrade
|
||||
hotUpgradeEmptyImage: empty:1.0.0
|
||||
```
|
||||
|
||||
- `upgradeType`: HotUpgrade代表该sidecar容器的类型是hot upgrade,将执行热升级方案hotUpgradeEmptyImage: 当热升级sidecar容器时,业务必须要提供一个empty容器用于热升级过程中的容器切换。empty容器同sidecar容器具有相同的配置(除了镜像地址),例如:command, lifecycle, probe等,但是它不做任何工作。
|
||||
- `lifecycle.postStart`: 状态迁移,该过程完成热升级过程中的状态迁移,该脚本需要由业务根据自身的特点自行实现,例如:nginx热升级需要完成Listen FD共享以及流量排水(reload)
|
||||
|
||||
|
||||
## 更多
|
||||
|
||||
更多版本变化,请参考 [release page](https://github.com/openkruise/kruise/releases) 或 [ChangeLog](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md)
|
|
@ -0,0 +1,147 @@
|
|||
---
|
||||
slug: openkruise-0.10.0
|
||||
title: OpenKruise 0.10.0:新增应用弹性拓扑管理、应用防护等能力
|
||||
authors: [FillZpp]
|
||||
tags: [release]
|
||||
---
|
||||
|
||||
|
||||
本文将带你一览 v0.10.0 的新变化,其中新增的 WorkloadSpread、PodUnavailableBudget 等大颗粒特性后续还将有专文详细介绍其设计实现原理。
|
||||
|
||||
## WorkloadSpread:旁路的应用弹性拓扑管理能力
|
||||
|
||||
在应用部署运维的场景下,有着多种多样的拓扑打散以及弹性的诉求。其中最常见、最基本的,就是按某种或几种拓扑水平打散,比如:
|
||||
|
||||
- 应用部署需要按 node 维度打散,避免堆叠(提高容灾能力)
|
||||
- 应用部署需要按 AZ(available zone)维度打散(提高容灾能力)
|
||||
|
||||
这些基本的诉求,通过 Kubernetes 原生提供的 pod affinity、[topology spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/) 等能力目前都能够满足了。但在实际的生产场景下,还有着太多更加复杂的分区与弹性需求,以下举一些实际的例子:
|
||||
|
||||
- 按 zone 打散时,需要指定在不同 zone 中部署的比例数,比如某个应用在 zone a、b、c 中部署的 Pod 数量比例为 1 : 1 : 2 等(由于一些现实的原因比如该应用在多个 zone 中的流量不均衡等)
|
||||
- 存在多个 zone 或不同机型的拓扑,应用扩容时,优先部署到某个 zone 或机型上,当资源不足时再部署到另一个 zone 或机型上(往后以此类推);应用缩容时,要按反向顺序,优先缩容后面 zone 或机型上的 Pod(往前以此类推)
|
||||
- 存在多个基础的节点池和弹性的节点池,应用部署时需要固定数量或比例的 Pod 部署在基础节点池,其余的都扩到弹性节点池
|
||||
|
||||
对于这些例子,过去一般只能将一个应用拆分为多个 Workload(比如 Deployment)来部署,才能解决应用在不同拓扑下采用不同比例数量、扩缩容优先级、资源感知、弹性选择等场景的基本问题,但还是需要 PaaS 层深度定制化,来支持对一个应用多个 Workload 的精细化管理。
|
||||
|
||||
针对这些问题,在 Kruise v0.10.0 版本中新增了 WorkloadSpread 资源,目前它支持配合 Deployment、ReplicaSet、CloneSet 这些 Workload 类型,来管理它们下属 Pod 的分区与弹性拓扑。
|
||||
以下是一个简化的例子:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: WorkloadSpread
|
||||
metadata:
|
||||
name: workloadspread-demo
|
||||
spec:
|
||||
targetRef:
|
||||
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
|
||||
kind: Deployment | CloneSet
|
||||
name: workload-xxx
|
||||
subsets:
|
||||
- name: subset-a
|
||||
requiredNodeSelectorTerm:
|
||||
matchExpressions:
|
||||
- key: topology.kubernetes.io/zone
|
||||
operator: In
|
||||
values:
|
||||
- zone-a
|
||||
maxReplicas: 10 | 30%
|
||||
- name: subset-b
|
||||
requiredNodeSelectorTerm:
|
||||
matchExpressions:
|
||||
- key: topology.kubernetes.io/zone
|
||||
operator: In
|
||||
values:
|
||||
- zone-b
|
||||
```
|
||||
|
||||
创建这个 WorkloadSpread 可以通过 targetRef 关联到一个 Workload 对象上,然后这个 Workload 在扩容 pod 的过程中,Pod 会被 Kruise 按上述策略注入对应的拓扑规则。这是一种旁路的注入和管理方式,本身不会干涉 Workload 对 Pod 的扩缩容、发布管理。
|
||||
|
||||
注意:WorkloadSpread 对 Pod 的缩容的优先级控制是通过 [Pod Deletion Cost](https://kubernetes.io/docs/reference/labels-annotations-taints/#pod-deletion-cost) 来实现的:
|
||||
|
||||
- 如果 Workload 类型是 CloneSet,则已经支持了这个 feature,可以实现缩容优先级
|
||||
- 如果 Workload 类型是 Deployment/ReplicaSet,则要求 Kubernetes version >= 1.21,且在 1.21 中要在 kube-controller-manager 上开启 `PodDeletionCost` 这个 feature-gate
|
||||
|
||||
使用 WorkloadSpread 功能,需要在 安装/升级 Kruise v0.10.0 的时候打开 WorkloadSpread 这个 feature-gate。
|
||||
|
||||
## PodUnavailableBudget:应用可用性防护
|
||||
|
||||
在诸多 [Voluntary Disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/) 场景中 Kubernetes 原生提供的 [Pod Disruption Budget(PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) 通过限制同时中断的 Pod 数量,来保证应用的高可用性。
|
||||
|
||||
但还有很多场景中,即便有 PDB 防护依然将会导致业务中断、服务降级,比如:
|
||||
|
||||
- 应用 owner 通过 Deployment 正在进行版本升级,与此同时集群管理员由于机器资源利用率过低正在进行 node 缩容
|
||||
- 中间件团队利用 SidecarSet 正在原地升级集群中的sidecar版本(例如:ServiceMesh envoy),同时HPA正在对同一批应用进行缩容
|
||||
- 应用 owner 和中间件团队利用 CloneSet、SidecarSet 原地升级的能力,正在对同一批 Pod 进行升级
|
||||
|
||||
这其实很好理解 -- PDB 只能防控通过 Eviction API 来触发的 Pod 驱逐(例如 kubectl drain驱逐node上面的所有Pod),但是对于 Pod 删除、原地升级 等很多操作是无法防护的。
|
||||
|
||||
在 Kruise v0.10.0 版本中新增的 PodUnavailableBudget(PUB)功能,则是对原生 PDB 的强化扩展。它包含了 PDB 自身的能力,并在此基础上增加了对更多 Voluntary Disruption 操作的防护,包括但不限于 Pod 删除、原地升级 等。
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1alpha1
|
||||
kind: PodUnavailableBudget
|
||||
metadata:
|
||||
name: web-server-pub
|
||||
namespace: web
|
||||
spec:
|
||||
targetRef:
|
||||
apiVersion: apps/v1 | apps.kruise.io/v1alpha1
|
||||
kind: Deployment | CloneSet | StatefulSet | ...
|
||||
name: web-server
|
||||
# selector 与 targetRef 二选一配置
|
||||
# selector:
|
||||
# matchLabels:
|
||||
# app: web-server
|
||||
# 保证的最大不可用数量
|
||||
maxUnavailable: 60%
|
||||
# 保证的最小可用数量
|
||||
# minAvailable: 40%
|
||||
```
|
||||
|
||||
使用 PodUnavailableBudget 功能,需要在 安装/升级 Kruise v0.10.0 的时候打开feature-gate(两个可以选择打开一个,也可以都打开):
|
||||
|
||||
- PodUnavailableBudgetDeleteGate:拦截防护 Pod 删除、驱逐 等操作
|
||||
- PodUnavailableBudgetUpdateGate:拦截防护 Pod 原地升级 等更新操作
|
||||
|
||||
## CloneSet 支持按拓扑规则缩容
|
||||
|
||||
在 CloneSet 缩容(调小 replicas 数量)的时候,选择哪些 Pod 删除是有一套固定算法排序的:
|
||||
|
||||
1. 未调度 < 已调度
|
||||
2. PodPending < PodUnknown < PodRunning
|
||||
3. Not ready < ready
|
||||
4. **较小 pod-deletion cost < 较大 pod-deletion cost**
|
||||
5. **较大打散权重 < 较小**
|
||||
6. 处于 Ready 时间较短 < 较长
|
||||
7. 容器重启次数较多 < 较少
|
||||
8. 创建时间较短 < 较长
|
||||
|
||||
其中,“4” 是在 Kruise v0.9.0 中开始提供的特性,用于支持用户指定删除顺序(WorkloadSpread 就是利用这个功能实现缩容优先级);**而 “5” 则是当前 v0.10.0 提供的特性,即在缩容的时候会参考应用的拓扑打散来排序**。
|
||||
|
||||
- 如果应用配置了 [topology spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/),则 CloneSet 缩容时会按照其中的 topology 维度打散来选择 Pod 删除(比如尽量打平多个 zone 上部署 Pod 的数量)
|
||||
- 如果应用没有配置 [topology spread constraints](https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/),则默认情况下 CloneSet 缩容时会按照 node 节点维度打散来选择 Pod 删除(尽量减少同 node 上的堆叠数量)
|
||||
|
||||
## Advanced StatefulSet 支持流式扩容
|
||||
|
||||
为了避免在一个新 Advanced StatefulSet 创建后有大量失败的 pod 被创建出来,从 Kruise v0.10.0 版本开始引入了在 scale strategy 中的 maxUnavailable 策略:
|
||||
|
||||
```yaml
|
||||
apiVersion: apps.kruise.io/v1beta1
|
||||
kind: StatefulSet
|
||||
spec:
|
||||
# ...
|
||||
replicas: 100
|
||||
scaleStrategy:
|
||||
maxUnavailable: 10% # percentage or absolute number
|
||||
```
|
||||
|
||||
当这个字段被设置之后,Advanced StatefulSet 会保证创建 pod 之后不可用 pod 数量不超过这个限制值。
|
||||
比如说,上面这个 StatefulSet 一开始只会一次性创建 10 个 pod。在此之后,每当一个 pod 变为 running、ready 状态后,才会再创建一个新 pod 出来。
|
||||
|
||||
注意:这个功能只允许在 podManagementPolicy 是 `Parallel` 的 StatefulSet 中使用。
|
||||
|
||||
## More
|
||||
|
||||
更多版本变化,请参考 [release page](https://github.com/openkruise/kruise/releases) 或 [ChangeLog](https://github.com/openkruise/kruise/blob/master/CHANGELOG.md)
|
||||
|
||||
|
|
@ -60,15 +60,20 @@ table td {
|
|||
}
|
||||
|
||||
.markdown {
|
||||
font-size: 0.9rem;
|
||||
font-size: 0.8rem;
|
||||
}
|
||||
|
||||
.markdown h1:first-child {
|
||||
--ifm-h1-font-size: 2.0rem;
|
||||
}
|
||||
|
||||
.markdown > h2 {
|
||||
--ifm-h2-font-size: 1.8rem;
|
||||
font-size: 1.4rem;
|
||||
}
|
||||
.markdown > h3 {
|
||||
font-size: 1.2rem;
|
||||
}
|
||||
.markdown > h4 {
|
||||
font-size: 1.0rem;
|
||||
}
|
||||
|
||||
.theme-doc-sidebar-menu {
|
||||
|
@ -76,8 +81,21 @@ table td {
|
|||
}
|
||||
|
||||
.blogPostTitle_node_modules-\@docusaurus-theme-classic-lib-next-theme-BlogPostItem-styles-module:first-child {
|
||||
font-size: 2rem;
|
||||
font-size: 1.8rem;
|
||||
}
|
||||
/* .sidebarItemLink_node_modules-\@docusaurus-theme-classic-lib-next-theme-BlogSidebar-styles-module {
|
||||
font-size: 0.7rem;
|
||||
} */
|
||||
.sidebarItem_node_modules-\@docusaurus-theme-classic-lib-next-theme-BlogSidebar-styles-module {
|
||||
font-size: 0.7rem;
|
||||
margin-top: 2.0rem;
|
||||
}
|
||||
.blogPostTitle_GeHD:first-child {
|
||||
font-size: 2rem;
|
||||
font-size: 1.6rem;
|
||||
}
|
||||
.table-of-contents__link {
|
||||
font-size: 0.7rem;
|
||||
}
|
||||
.pagination-nav__label {
|
||||
font-size: 0.8rem;
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue