Docs: add English version of performance fine-tuning (#220)

2021-08-31 14:07:58 +08:00 · 2021-08-31 14:07:58 +08:00 · 7f25f5788d
parent 794e74edeb
commit 7f25f5788d
4 changed files with 78 additions and 8 deletions
--- a/docs/platform-engineers/system-operation/bootstrap-parameters.md
+++ b/docs/platform-engineers/system-operation/bootstrap-parameters.md
@ -1,3 +1,44 @@
 ---
 title: Bootstrap Parameters
---
+---
+
+The introduction of bootstrap parameters in KubeVela controller are listed as below
+
+|            Parameter Name           |  Type  |               Default Value              | Description                                                                         |
+|:---------------------------:|:------:|:---------------------------------:|------------------------------------------------------------------------------|
+|         use-webhook         |  bool  |               false               | Use Admission Webhook                                                       |
+|       webhook-cert-dir      | string | /k8s-webhook-server/serving-certs | The directory of Admission Webhook cert/secret                                               |
+|         webhook-port        |   int  |                9443               | The address of Admission Webhook                                                 |
+|         metrics-addr        | string |               :8080               | The address of Prometheus metrics                                                    |
+|    enable-leader-election   |  bool  |               false               | Enable Leader Election for Controller Manager and ensure that only one replica is active |
+|  leader-election-namespace  | string |                 ""                | The namespace of Leader Election ConfigMap                                  |
+|        log-file-path        | string |                 ""                | The log file path                                                                 |
+|      log-file-max-size      |   int  |                1024               | The maximum size (MBi) of log files                                                  |
+|          log-debug          |  bool  |               false               | Set the logging level to DEBUG, used in develop environment                                           |
+|  application-revision-limit |   int  |                 10                | The maximum number of application revisions to keep. When the number of revisions exceeeds this number, older version will be discarded    |
+|  definition-revision-limit  |   int  |                 20                | The maximum number of definition revisions to keep                                                 |
+| autogen-workload-definition |  bool  |                true               | Generate WorkloadDefinition for ComponentDefinition automatically                                              |
+|         health-addr         | string |               :9440               | The address of health check                                                               |
+|       apply-once-only       | string |               false               | The Workload and Trait will not change after being applied. Used in specific scenario                           |
+|         disable-caps        | string |                 ""                | Disable internal capabilities                                                                 |
+|        storage-driver       | string |               Local               | The storage driver for applications                                                             |
+|  informer-re-sync-interval  |  time  |                 1h                | The resync period for for controller informer, also the time for application to be reconciled when no spec changes were made                                         |
+| system-definition-namespace | string |            vela-system            | The namespace for storing system definitions                                                      |
+|    concurrent-reconciles    |   int  |                 4                 | The number of threads that controller uses to process requests                                                     |
+|         kube-api-qps        |   int  |                 50                | The QPS for controller to access apiserver                                                    |
+|        kube-api-burst       |   int  |                100                | The burst for controller to access apiserver                                            |
+|         oam-spec-var        | string |                v0.3               | The version of OAM spec to use                                                             |
+|          pprof-addr         | string |                 ""                | The address of pprof, default to be emtpy to disable pprof           |
+|         perf-enabled        |  bool  |               false               | Enable performance logging, working with monitoring tools like Loki and Grafana to discover performance bottleneck |
+
+> Other parameters not listed in the table are old parameters used in previous versions, the latest version ( v1.1 ) does not use them.
+
+### Key Parameters
+
+- **informer-resync-interval**: The resync time of applications when no changes were made. A short time may cause controller to reconcile frequently but uselessly. The regular reconciles of applications can help ensure that application and its components keep up-to-date in case some unexpected differences.
+- **concurrent-reconciles**: The number of threads to use for controller to handle requests. When rich CPU resources are available, a small number of working threads may lead to insufficient usage of CPU resources.
+- **kube-api-qps / kube-api-burst**: The rate limit for KubeVela controller to access apiserver. When managed applications are complex (containing multiple components and resources), if the access rate of apiserver is limited, it will be hard to increase the concurrency of KubeVela controller. However, high access rate may cause huge burden to apiserver. It is critical to keep a balance when handling massive applications.
+- **pprof-addr**: The pprof address to enable controller performance debugging.
+- **perf-enabled**: Use this flag if you would like to check time costs for different stages when reconciling applications. Switch it off to simplify loggings.
+
+> Several sets of recommended parameter configurations are enclosed in section [Performance Fine-tuning](./performance-finetuning).
--- a/docs/platform-engineers/system-operation/performance-finetuning.md
+++ b/docs/platform-engineers/system-operation/performance-finetuning.md
@ -1,3 +1,28 @@
 ---
-title: Performance Finetuning
---
+title: Performance Fine-tuning
+---
+
+### Recommended Configurations
+
+When cluster scale becomes large and more applications are needed for managing, KubeVela controller performance might have bottleneck problems due to inappropriate parameters.
+
+According to the KubeVela performance test, three sets of parameters are recommended in clusters with different scales as below.
+
+| Scale | #Nodes | #Apps | #Pods  | concurrent-reconciles | kube-api-qps | kube-api-burst |  CPU  | Memory |
+| :---: | -------: | ------------: | -------: | --------------------: | :----------: | -------------: | ----: | -----: |
+| Small |  < 200   |   < 3,000     | < 18,000 |              2        |      300     |      500       |   0.5 |   1Gi  |
+| Medium | < 500   |   < 5,000     | < 30,000 |              4        |      500     |      800       |     1 |   2Gi  |
+| Large | < 1,000  |   < 12,000    | < 72,000 |              4        |      800     |     1,000      |     2 |   4Gi  |
+
+> The above configurations are based on medium size applications (each application contains 2~3 components and 5~6 resources). If the applications in your scenario are generally larger, e.g., containing 20 resources, then you could increase the application number accordingly to find the appropriate configuration and parameters.
+
+### Fine-tuning Methods
+
+You might encounter various performance bottlenecks. Read the following examples and try to find the proper solution for your problem.
+
+1. Applications could be created. Its managed resources are available but indirect resources are not. For example, Deployments in webservice are successfully created but Pods are not. You could check kube-controller-manager and see if there is performance bottleneck problems with it.
+2. Applications could be created. Its managed resources are not available and there is no rendering problem with the application. Check if apiserver has lots of requests waiting in queue. The mutating requests for managed resources might be blocked at apiserver.
+3. Applications could be found in cluster but no status information could be displayed. If there is no problem with the application content, it might be caused by the KubeVela controller bottleneck, such as limiting requests to apiserver. Increase **kube-api-qps / kube-api-burst** and check if CPU is overloaded. If CPU is not overloaded, check if the thread number is below the number of CPU cores.
+4. KubeVela Controller itself could crash frequently due to Out-Of-Memory. Increase the memory to solve it.
+
+> Read more details in [KubeVela Performance Test Report](/blog/kubevela-performance-test)
--- a/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/system-operation/bootstrap-parameters.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/system-operation/bootstrap-parameters.md
@ -40,3 +40,5 @@ KubeVela 控制器的各项启动参数及其说明如下。
 - **kube-api-qps / kube-api-burst**: 用来控制 KubeVela 控制器访问 apiserver 的频率。当 KubeVela 控制器管理的应用较为复杂时 ( 包含较多的组件及资源 )，如果 KubeVela 控制器对 apiserver 的访问速率受限，则较难提高 KubeVela 控制器的并发量。然而过高的请求速率也有可能对 apiserver 造成较大的负担
 - **pprof-addr**: 开启该地址可以启用 pprof 进行控制器性能调试
 - **perf-enabled**: 启用时可以在日志中看到 KubeVela 控制器管理应用时各个阶段的时间开销，关闭可以简化日志记录
+
+> [性能调优](./performance-finetuning)章节中包含了若干组不同场景下的推荐参数配置。
--- a/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/system-operation/performance-finetuning.md
+++ b/i18n/zh/docusaurus-plugin-content-docs/current/platform-engineers/system-operation/performance-finetuning.md
@ -14,13 +14,15 @@ title: 性能调优
 | 中 | < 500   |   < 5,000     | < 30,000 |              4        |      500     |      800       |     1 |   2Gi  |
 | 大 | < 1,000  |   < 12,000    | < 72,000 |              4        |      800     |     1,000      |     2 |   4Gi  |

-> 其中上述配置中，单一应用的规模应在 2～3 个组件，5～6 个资源左右。如果您的场景下，应用普遍较大，比如平均而言一个应用需要对应 20 个资源，那么您可以按照比例相应提高您的应用数规模，将其对应在表中合适的位置。
+> 上述配置中，单一应用的规模应在 2～3 个组件，5～6 个资源左右。如果您的场景下，应用普遍较大，如单个应用需要对应 20 个资源，那么您可以按照比例相应提高各项配置。

 ### 调优方法

-性能瓶颈的出现一般可能会有以下一些场景
+性能瓶颈出现时一般可能会有以下一些不同的表现：

-1. 新创建的应用能够获取到，其直接关联资源获取得到，但间接关联资源获取不到。如应用内包含的 webservice 对应的 Deployment 成功创建，但 Pod 迟迟无法创建。这种情况一般和相应资源的控制器有关，比如 kube-controller。可以排查相应控制器是否存在性能瓶颈或问题。
+1. 新创建的应用能够获取到，其直接关联资源获取得到，但间接关联资源获取不到。如应用内包含的 webservice 对应的 Deployment 成功创建，但 Pod 迟迟无法创建。这种情况一般和相应资源的控制器有关，比如 kube-controller-manager。可以排查相应控制器是否存在性能瓶颈或问题。
 2. 新创建的应用能够获取到，关联资源无法获取，且应用渲染本身没有问题 ( 在应用的信息内没有出现渲染错误 )。检查 apiserver 内是否存在大量排队请求，这种场景有可能是由于分发的下属资源，如 Deployment 请求到了 apiserver，但由于先前的资源在 apiserver 处排队导致新请求无法及时处理。
-3. 新创建的应用能够获取到，但是没有状态信息。这种情况如果应用本身的内容格式没有什么问题的话，有可能是由于 KubeVela 控制器出现瓶颈，如访问 apiserver 被限流，导致吞吐量跟不上请求的速率。可以通过提高 **kube-api-qps / kube-api-burst** 来解决。如果限流不存在问题，可以检查控制器所用的 CPU 资源是否已经打满，如果 CPU 过载。则可以通过提高控制器的 CPU 资源来解决。如果 CPU 资源未使用满，但始终保持在同一负载上，有可能是由于线程数小于所给 CPU 数量。
-4. KubeVela 控制器本身由于内存不足频繁 Crash，可以通过给控制器提高内存量解决。
+3. 新创建的应用能够获取到，但是没有状态信息。这种情况如果应用本身的内容格式没有问题，有可能是由于 KubeVela 控制器出现瓶颈，如访问 apiserver 被限流，导致吞吐量跟不上请求的速率。可以通过提高 **kube-api-qps / kube-api-burst** 来解决。如果限流不存在问题，可以检查控制器所用的 CPU 资源是否已经用满；如果 CPU 过载。则可以通过提高控制器的 CPU 资源来解决。如果 CPU 资源未使用满，但始终保持在同一负载上，有可能是线程数小于所给 CPU 数量。
+4. KubeVela 控制器本身由于内存不足频繁崩溃，可以通过给控制器提高内存量解决。
+
+> 更多细节可以参考 [KubeVela 性能测试报告](/blog/kubevela-performance-test)