Docs: workflow working mechanism (#544)

* Docs: workflow working mechanism Signed-off-by: Yin Da <yd219913@alibaba-inc.com> * Docs: add workflow related backoff time in bootstrap paremter document Signed-off-by: Yin Da <yd219913@alibaba-inc.com>
2022-03-22 13:15:21 +08:00 · 2022-03-22 13:15:21 +08:00 · 7594d1fd6f
parent 7aa11e850a
commit 7594d1fd6f
5 changed files with 102 additions and 0 deletions
--- a/docs/platform-engineers/system-operation/bootstrap-parameters.md
+++ b/docs/platform-engineers/system-operation/bootstrap-parameters.md
@ -32,6 +32,9 @@ The introduction of bootstrap parameters in KubeVela controller are listed as be
 |         pprof-addr          | string |                ""                 | The address of pprof, default to be emtpy to disable pprof                                                                              |
 |        perf-enabled         |  bool  |               false               | Enable performance logging, working with monitoring tools like Loki and Grafana to discover performance bottleneck                      |
 | enable-cluster-gateway | bool | false | Enable multi cluster feature |
+| max-workflow-wait-backoff-time | int | 60 | the maximal backoff interval for workflow to retry when workflow step is waiting (unit: second) |
+| max-workflow-failed-backoff-time | int | 300 | the maximal backoff interval for workflow to retry when workflow step fails (unit: second) |
+| max-workflow-step-error-retry-times | int | 10 | the maximal retry times for workflow to retry when workflow step fails |

 > Other parameters not listed in the table are old parameters used in previous versions, the latest version ( v1.1 ) does not use them.

--- a/docs/platform-engineers/workflow/working-mechanism.md
+++ b/docs/platform-engineers/workflow/working-mechanism.md
@ -0,0 +1,48 @@
+---
+title: Working Mechanism
+---
+
+This document will give a brief introduction to the core mechanisms of KubeVela Workflow.
+
+## Running Mode
+
+The execution of workflow has two different running modes: **DAG** mode and **StepByStep** mode. In DAG mode, all steps in the workflow will execute concurrently. They will form a dependency graph for running according to the Input/Output in the step configuration automatically. If one workflow step has not met all its dependencies, it will wait for the conditions. In StepByStep mode, all steps will be executed in order. In KubeVela v1.2+, the defaut running mode is StepByStep. Currently, we do not support using DAG mode explicitly.
+
+## Suspend and Retry
+
+Workflow will retry steps and suspend for different reasons.
+1. If step fails or waits for conditions, the workflow will retry after a backoff time. The backoff time will increase by the retry times.
+2. If step fails too many times, the workflow will enter suspending state and stop retry.
+3. If step is waiting for manual approval, the workflow will enter suspending state immediately. 
+
+### Backoff Time
+
+The backoff time for workflow to retry can be calculated by `int(0.05 * 2^(n-1))` where `n` is the number of retries. The minimal backoff time is 1 second，the first ten backoff time will be like:
+
+| Times | 2^(n-1) | 0.05*2^(n-1) | Requeue After(s) |
+|-------|---------|--------------|------------------|
+| 1     | 1       | 0.05         | 1                |
+| 2     | 2       | 0.1          | 1                |
+| 3     | 4       | 0.2          | 1                |
+| 4     | 8       | 0.4          | 1                |
+| 5     | 16      | 0.8          | 1                |
+| 6     | 32      | 1.6          | 1                |
+| 7     | 64      | 3.2          | 3                |
+| 8     | 128     | 6.4          | 6                |
+| 9     | 256     | 12.8         | 12               |
+| 10    | 512     | 25.6         | 25               |
+| ...   | ...     | ...          | ...              |
+
+If the workflow step is waiting, the max backoff time is 60s, you can change it by setting `--max-workflow-wait-backoff-time` in the [bootstrap parameter](../system-operation/bootstrap-parameters) of KubeVela controller.
+
+If the workflow step is failed, the max backoff time is 300s, you can change it by setting`--max-workflow-failed-backoff-time` in the [bootstrap parameter](../system-operation/bootstrap-parameters) of KubeVela controller.
+
+### Maximum Retry Times
+
+For failure case, the workflow will retry at most 10 times by default and enter suspending state after that. You can change the retry times by setting `--max-workflow-step-error-retry-times` in the [bootstrap parameter](../system-operation/bootstrap-parameters) of KubeVela controller.
+
+> Note that if the workflow step is unhealthy, the workflow step will be marked as wait but not failed and it will wait for healthy.
+
+## Avoid Configuration Drift
+
+When workflow enters running state or suspends due to condition wait, KubeVela application will re-apply applied resources to prevent configuration drift routinely. This process is called **State Keep** in KubeVela. By default, the interval of State Keep is 5 minutes, which can be configured in the [bootstrap parameter](../system-operation/bootstrap-parameters) of KubeVela controller by setting `--application-re-sync-period`. If you want to disable the state keep capability, you can also use the [apply-once](https://github.com/oam-dev/kubevela/blob/master/docs/examples/app-with-policy/apply-once-policy/apply-once.md) policy in the application.
--- a/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/system-operation/bootstrap-parameters.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/system-operation/bootstrap-parameters.md
@ -32,6 +32,9 @@ KubeVela 控制器的各项启动参数及其说明如下。
 |          pprof-addr         | string |                 ""                | 使用 pprof 监测性能的监听地址，默认为空时不监测           |
 |         perf-enabled        |  bool  |               false               | 启用控制器性能记录，可以配合监控组件监测当前控制器的性能瓶颈 |
 | enable-cluster-gateway | bool | false | 启动多集群功能 |
+| max-workflow-wait-backoff-time | int | 60 | 工作流步骤处于等待状态下进行重试的最大时间间隔（单位：秒） |
+| max-workflow-failed-backoff-time | int | 300 | 工作流步骤处于失败状态下进行重试的最大时间间隔（单位：秒） |
+| max-workflow-step-error-retry-times | int | 10 | 工作流步骤处于错误状态下进行重试的最大次数 |

 > 未列在表中的参数为旧版本参数，当前版本 v1.1 无需关心

--- a/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/workflow/working-mechanism.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/workflow/working-mechanism.md
@ -0,0 +1,47 @@
+---
+title: 工作机制
+---
+
+这个文档会简单介绍 KubeVela 工作流的一些内部的核心运行机制。
+
+## 运行模式
+工作流的执行分为两种模式：DAG 模式和 StepByStep 模式。在 DAG 模式下，工作流中的各个步骤会并发运行，并根据各步骤的 Input/Output 形成依赖关系。前置条件未满足的步骤会先处于等待状态。在 StepByStep 模式下，工作流中的各个步骤则是会按照顺序一步步执行。在 KubeVela v1.2+ 的版本中，在配置工作流的情况下，默认采用 StepByStep 模式，暂未支持显式指定工作流以 DAG 模式运行。
+
+## 暂停与重试
+
+工作流会因为不同的原因重试或者暂停。
+1. 当工作流步骤失败或者等待特定条件时，工作流会在一段时间后进行重试。重试的时间会根据重试的次数增加。
+2. 如果工作流步骤失败次数过多，工作流会进入暂停状态并停止重试。
+3. 如果工作流步骤在等待人工审核，工作流会立刻进入暂停状态。
+
+### 重试时间
+
+工作流的重试时间可以依据 `int(0.05 * 2^(n-1))` 进行计算，其中 `n` 是重试的次数。最小的重试时间是 1 秒。前 10 次重试时间如下表所示：
+
+| 重试次数 | 2^(n-1) | 0.05*2^(n-1) | 重试时间 |
+|-------|---------|--------------|------------------|
+| 1     | 1       | 0.05         | 1                |
+| 2     | 2       | 0.1          | 1                |
+| 3     | 4       | 0.2          | 1                |
+| 4     | 8       | 0.4          | 1                |
+| 5     | 16      | 0.8          | 1                |
+| 6     | 32      | 1.6          | 1                |
+| 7     | 64      | 3.2          | 3                |
+| 8     | 128     | 6.4          | 6                |
+| 9     | 256     | 12.8         | 12               |
+| 10    | 512     | 25.6         | 25               |
+| ...   | ...     | ...          | ...              |
+
+如果工作流步骤处于等待状态，最大的重试时间为 60 秒，你可以通过修改[启动参数](../system-operation/bootstrap-parameters) `--max-workflow-wait-backoff-time` 来设置这一时间。
+
+如果工作流步骤处于失败状态，最大的重试时间为 300 秒，你可以通过修改[启动参数](../system-operation/bootstrap-parameters) `--max-workflow-failed-backoff-time` 来设置这一时间。
+
+### 最大重试次数
+
+对于工作流步骤失败的场景，工作流默认情况下会在重试最多 10 次后进入等待状态。你可以通过修改[启动参数](../system-operation/bootstrap-parameters) `--max-workflow-step-error-retry-times` 来设置这一时间。
+
+> 注意如果工作流步骤是因为资源不健康（如 Pod 尚未启动），工作流步骤会被标记为等待而不是失败。
+
+## 状态维持
+
+当工作流处于健康运行状态 (running) 或是由于等待资源健康状态而暂停时 (suspending)，KubeVela 的应用在默认配置下会定期检查之前下发的资源是否存在配置漂移，并将这些资源恢复成原先下发时的配置。默认定期检查的时间是 5 分钟，可以通过在 KubeVela 控制器[启动参数](../system-operation/bootstrap-parameters)在中设置 `--application-re-sync-period` 来调节。如果想要禁用状态维持的能力，也可以在应用中配置 [apply-once](https://github.com/oam-dev/kubevela/blob/master/docs/examples/app-with-policy/apply-once-policy/apply-once.md) 策略。
--- a/sidebars.js
+++ b/sidebars.js
@ -160,6 +160,7 @@ module.exports = {
          "Workflow System": [
            "platform-engineers/workflow/workflow",
            "platform-engineers/workflow/cue-actions",
+            "platform-engineers/workflow/working-mechanism",
          ],
        },
        {