Docs: rework observability arch & translate (#1046)

* Docs: rework observability

Signed-off-by: Yin Da <yd219913@alibaba-inc.com>

* Docs: update o11y v1.6

Signed-off-by: Yin Da <yd219913@alibaba-inc.com>

Signed-off-by: Yin Da <yd219913@alibaba-inc.com>
This commit is contained in:
Somefive 2022-11-02 22:31:51 +08:00 committed by GitHub
parent 92258e4c5f
commit 2e750f6a9f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
46 changed files with 1676 additions and 2204 deletions

View File

@ -0,0 +1,91 @@
---
title: Installation
---
:::tip
Before installing observability addons, we recommend you to start from the [introduction of the observability feature](../observability).
:::
## Quick Start
To enable the observability addons, you simply need to run the `vela addon enable` commands as below.
1. Install the kube-state-metrics addon
```shell
vela addon enable kube-state-metrics
```
2. Install the node-exporter addon
```shell
vela addon enable node-exporter
```
3. Install the prometheus-server addon
```shell
vela addon enable prometheus-server
```
4. Install the loki addon
```shell
vela addon enable loki
```
5. Install the grafana addon
```shell
vela addon enable grafana
```
6. Access your grafana through port-forward.
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
Now you can access your grafana by access `http://localhost:8080` in your browser. The default username and password are `admin` and `kubevela` respectively.
> You can change it by adding `adminUser=super-user adminPassword=PASSWORD` to step 6.
You will see several pre-installed dashboards and use them to view your system and applications. For more details of those pre-installed dashboards, see [Out-of-the-Box](./out-of-the-box) section.
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::caution
**Resource**: The observability suite includes several addons which requires some computation resources to work properly. The recommended installation resources for you cluster are 2 cores + 4 Gi memory.
**Version**: We recommend you to use KubeVela (>= v1.6.0) to use the observability addons. For version v1.5.0, logging is not supported.
:::
:::tip
**Addon Suite**: If you want to enable these addons in one command, you can use [WorkflowRun](https://github.com/kubevela/workflow) to orchestrate the install process. It allows you to manage the addon enable process as code and make it reusable across different systems.
:::
## Multi-cluster Installation
If you want to install observability addons in multi-cluster scenario, make sure your Kubernetes clusters support LoadBalancer service and are mutatually accessible.
By default, the installation process for `kube-state-metrics`, `node-exporter` and `prometheus-server` are natually multi-cluster supported (they will be automatically installed to all clusters). But to let your `grafana` on the control plane to be able to access prometheus-server in managed clusters, you need to use the following command to enable `prometheus-server`.
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
This will install [thanos](https://github.com/thanos-io/thanos) sidecar & query along with prometheus-server. Then enable grafana, you will be able to see aggregated prometheus metrics now.
You can also choose which clusters to install addons by using commands as below
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
For `loki` addon, the storage is hosted on the hub control plane by default, and the agent ([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) or [vector](https://vector.dev/)) installation is multi-cluster supported. You can run the following command to let multi-cluster agents to send logs to the loki service on the `local` cluster.
```shell
vela addon enable loki agent=vector serviceType=LoadBalancer
```
> If you add new clusters to your control plane after addons being installed, you need to re-enable the addon to let it take effect.

View File

@ -94,10 +94,6 @@ For more details, you can refer to [vela-prism](https://github.com/kubevela/pris
It is also possible to make integrations through KubeVela's configuration management system, no matter you are using CLI or VelaUX.
### Prometheus
You can read the Configuration Management documentation for more details.
## Integrate Other Tools or Systems
There are a wide range of community tools or eco-systems that users can leverage for building their observability system, such as prometheus-operator or DataDog. By far, KubeVela does not have existing best practices for those integration. We may integrate with those popular projects through KubeVela addons in the future. We are also welcome to community contributions for broader explorations and more connections.

View File

@ -227,7 +227,7 @@ spec:
In this example, we transform nginx `combinded` format logs to json format, and adding a `new_field` json key to each log, the json value is `new value`. Please refer to [document](https://vector.dev/docs/reference/vrl/) for how to write vector VRL.
If you have a special log analysis dashboard for this processing method, you can refer to [document](./visualization#dashboard-customization) to import it into grafana.
If you have a special log analysis dashboard for this processing method, you can refer to [document](./dashboard) to import it into grafana.
## Collecting file log

View File

@ -2,6 +2,47 @@
title: Metrics
---
## Exposing Metrics in your Application
In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
You can also explicitly specify which port and which path to expose metrics.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Dashboard](./dashboard) for learning the following steps.
## Customized Prometheus Installation
If you want to make customization to your prometheus-server installation, you can put your configuration into an individual ConfigMap, like `my-prom` in namespace o11y-system. To distribute your custom config to all clusters, you can also use a KubeVela Application to do the job.
@ -78,45 +119,3 @@ vela addon enable prometheus-server storage=1G
```
This will create PersistentVolumeClaims and let the addon use the provided storage. The storage will not be automatically recycled even if the addon is disabled. You need to clean up the storage manually.
## Exposing Metrics in your Application
In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
You can also explicitly specify which port and which path to expose metrics.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Visualization](./visualization#dashboard-customization) for learning the following steps.

View File

@ -0,0 +1,94 @@
---
title: Out of the Box
---
By default, a series of dashboards are pre-installed with the `grafana` addon and provide basic panels for viewing observability data. If you follow the [installation guide](./installation), you should be able to use these dashboards without further configurations.
## Dashboards
### KubeVela Application
This dashboard shows the basic information for one application.
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::info
The KubeVela Application dashboard shows the overview of the metadata for the application. It directly accesses the Kubernetes API to retrieve the runtime application information, where you can use it as an entrance. You can navigate to detail information for application resources by clicking the `Details` link in the *Managed Resources* panel.
The **Basic Information** section extracts key information into panels and give you the most straightforward view for the current application.
The **Related Resources** section shows those resources that work together with the application itself, including the managed resources, the recorded ResourceTrackers and the revisions.
:::
### Kubernetes Deployemnt
This dashboard shows the overview of native deployments. You can navigate deployments across clusters.
URL: http://localhost:8080/d/kubernetes-deployment/kubernetes-deployment
![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
:::info
The Kubernetes Deployment dashboard gives you the detail running status for the deployment.
The **Pods** panel shows the pods that the deployment itself is currently managing.
The **Replicas** panel shows how the number of replicas changes, which can be used to diagnose when and how your deployment shifted to undesired state.
The **Resource** section includes the details for the resource usage (including the CPU / Memory / Network / Storage) which can be used to identify if the pods of the deployment are facing resource pressure or making/receiving unexpected traffics.
There are a list of dashboards for various types of Kubernetes resources, such as DaemonSet and StatefulSet. You can navigate to those dashboards depending on your workload type.
:::
### KubeVela System
This dashboard shows the overview of the KubeVela system. It can be used to see if KubeVela controller is healthy.
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../../resources/kubevela-system.jpg)
:::info
The KubeVela System dashboard gives you the running details of the KubeVela core modules, including the controller and the cluster-gateway. Other modules like velaux or prism are expected to be added in the future.
The **Computation Resource** section shows the usage for core modules. It can be used to track if there is any memory leak (if the memory usage is continuously increasing) or under high pressure (the cpu usage is always very high). If the memory usage hits the resource limit, the corresponding module will be killed and restarted, which indicates the lack of computation resource. You should add more CPU/Memory for them.
The **Controller** section includes a wide range of panels which can help you to diagnose the bottleneck of the KubeVela controller in your scenario.
The **Controller Queue** and **Controller Queue Add Rate** panels show you the controller working queue changes. If the controller queue is increasing continuously, it means there are too much applications or application changes in the system, and the controller is unable to handle them in time. Then it means there is performance issues for KubeVela controller. A temporary increase for the controller queue is tolerable, but keeping for a long time will lead to memory increase which will finally cause Out-Of-Memory problems.
**Reconcile Rate** and **Average Reconcile Time** panels give you the overview of the controller status. If reconcile rate is steady and average reconcile time is reasonable (like under 500ms, depending on your scenario), your KubeVela controller is healthy. If the controller queue add rate is increasing but the reconcile rate does not go up, it will gradually lead to increase for the controller queue and cause troubles. There are various cases that your controller is unhealthy:
1. Reconcile is healthy but there are too much applications, you will find everything is okay except the controller queue metrics increasing. Check your CPU/Memory usage for the controller. You might need to add more computation resources.
2. Reconcile is not healthy due to too much errors. You will find lots of errors in the **Reconcile Rate** panel. This means your system is continuously facing process errors for application. It could be caused by invalid application configurations or unexpected errors while running workflows. Check application details and see which applications are causing errors.
3. Reconcile is not healthy due to long reconcile times. You need to check **ApplicationController Reconcile Time** panel and see whether it is a common case (the average reconcile time is high), or only part of your applications have problems (the p95 reconcile time is high). For the former case, usually it is caused by either insufficient CPU (CPU usage is high) or too much requests and rate limited by kube-apiserver (check **ApplicationController Client Request Throughput** and **ApplicationController Client Request Average Time** panel and see which resource requests is slow or excessive). For the later case you need to check which application is large and uses lots of time for reconciliations.
Sometimes you might need refer to **ApplicationController Reconcile Stage Time** and see if there is some special reconcile stages are abnormal. For example, GCResourceTrackers use lots of time means there might be blockings for recycling resource in KubeVela system.
The **Application** section shows the overview of the applications in your whole KubeVela system. It can be used to see the changes of the application numbers and the used workflow steps. The **Workflow Initialize Rate** is an auxiliary panel which can be used to see how frequent new workflow execution is launched. The **Workflow Average Complete Time** can further show how much time it costs to finish the whole workflow.
:::
### Kubernetes APIServer
This dashboard shows the running status of all Kubernetes apiservers.
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
:::info
The Kubernetes APIServer dashboard helps you to see the most fundamental part for your Kubernetes system. If your Kubernetes APIServer is not running healthy, all of your controllers and modules in your Kubernetes system will be abnormal and unable to handle requests successfully. So it is important to make sure everything is fine in this dashboard.
The **Requests** section includes a series of panels which shows the QPS and latency for various kind of requests. Usually your APIServer could fail to respond if it is flooded by too much requests. At this time, you can see which type of requests is causing trouble.
The **WorkQueue** section shows the process status of the Kubernetes APIServer. If the **Queue Size** is large, it means the number of requests is out of the process capability of your Kubernetes APIServer.
The **Watches** section shows the number of watches in your Kubernetes APIServer. Compared to other types of requests, WATCH requests will continuously consume computation resources in Kubernetes APIServer, so it will be helpful to keep the number of watches limited.
:::

View File

@ -2,8 +2,6 @@
title: Automated Observability
---
## Introduction
Observability is critical for infrastructures and applications. Without observability system, it is hard to identify what happens when system broke down. On contrary, a strong observabilty system can not only provide confidences for operators but can also help developers quickly locate the performance bottleneck or the weak points inside the whole system.
To help users build their own observability system from scratch, KubeVela provides a list of addons, including
@ -16,102 +14,21 @@ To help users build their own observability system from scratch, KubeVela provid
**Logging**
- `loki`: A logging server which stores collected logs of Kubernetes pods and provide query interfaces.
**Visualization**
**Dashboard**
- `grafana`: A web application that provides analytics and interactive visualizations.
> More addons for alerting & tracing will be introduced in later versions.
## Quick Start
To enable the observability addons, you simply need to run the `vela addon enable` commands as below.
1. Install the kube-state-metrics addon
```shell
vela addon enable kube-state-metrics
```
2. Install the node-exporter addon
```shell
vela addon enable node-exporter
```
3. Install the prometheus-server addon
```shell
vela addon enable prometheus-server
```
4. Install the loki addon
```shell
vela addon enable loki
```
5. Install the grafana addon
```shell
vela addon enable grafana
```
6. Access your grafana through port-forward.
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
Now you can access your grafana by access `http://localhost:8080` in your browser. The default username and password are `admin` and `kubevela` respectively.
> You can change it by adding `adminUser=super-user adminPassword=PASSWORD` to step 6.
You will see several pre-installed dashboards and use them to view your system and applications. For more details of those pre-installed dashboards, see [Visualization](./o11y/visualization#pre-installed-dashboards) section.
![kubevela-application-dashboard](../../resources/kubevela-application-dashboard.jpg)
:::caution
**Resource**: The observability suite includes several addons which requires some computation resources to work properly. The recommended installation resources for you cluster are 2 cores + 4 Gi memory.
**Version**: We recommend you to use KubeVela (>= v1.6.0) to use the observability addons. For version v1.5.0, logging is not supported.
:::
:::tip
**Addon Suite**: If you want to enable these addons in one command, you can use [WorkflowRun](https://github.com/kubevela/workflow) to orchestrate the install process. It allows you to manage the addon enable process as code and make it reusable across different systems.
:::
## Multi-cluster Installation
If you want to install observability addons in multi-cluster scenario, make sure your Kubernetes clusters support LoadBalancer service and are mutatually accessible.
By default, the installation process for `kube-state-metrics`, `node-exporter` and `prometheus-server` are natually multi-cluster supported (they will be automatically installed to all clusters). But to let your `grafana` on the control plane to be able to access prometheus-server in managed clusters, you need to use the following command to enable `prometheus-server`.
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
This will install [thanos](https://github.com/thanos-io/thanos) sidecar & query along with prometheus-server. Then enable grafana, you will be able to see aggregated prometheus metrics now.
You can also choose which clusters to install addons by using commands as below
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
For `loki` addon, the storage is hosted on the hub control plane by default, and the agent ([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) or [vector](https://vector.dev/)) installation is multi-cluster supported. You can run the following command to let multi-cluster agents to send logs to the loki service on the `local` cluster.
```shell
vela addon enable loki agent=vector serviceType=LoadBalancer
```
> If you add new clusters to your control plane after addons being installed, you need to re-enable the addon to let it take effect.
## What's Next
- [**Installation**](./o11y/installation): Guide for how to install observability addons in your KubeVela system.
- [**Out of the Box**](./o11y/out-of-the-box): Guide for how to use pre-installed dashboards to monitor your system and applications.
- [**Metrics**](./o11y/metrics): Guide for customizing the process of collecting metrics for your application.
- [**Logging**](./o11y/logging): Guide for how to customize the log collecting rules for your application.
- [**Visualization**](./o11y/visualization): Guide for creating your customized dashboards for applications.
- [**Dashboard**](./o11y/dashboard): Guide for creating your customized dashboards for applications.
- [**Integration**](./o11y/integration): Guide for integrating your existing infrastructure to KubeVela, when you already have Prometheus or Grafana before installing addons.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 473 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 350 KiB

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,91 @@
---
title: 安装
---
:::tip
如果你的 KubeVela 是多集群场景,请参阅下面的 [多集群安装](#多集群安装) 章节。
:::
## 快速开始
要启用插件套件,只需运行 `vela addon enable` 命令,如下所示。
1. 安装 kube-state-metrics 插件
```shell
vela addon enable kube-state-metrics
```
2. 安装 node-exporter 插件
```shell
vela addon enable node-exporter
```
3. 安装 prometheus-server
```shell
vela addon enable prometheus-server
```
4. 安装 loki 插件
```shell
vela addon enable loki
```
5. 安装 grafana 插件
```shell
vela addon enable grafana
```
6. 通过端口转发访问 grafana
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
现在在浏览器中访问 `http://localhost:8080` 就可以访问你的 grafana。 默认的用户名和密码分别是 `admin``kubevela`
> 你可以通过在步骤 6 中添加 `adminUser=super-user adminPassword=PASSWORD` 参数来改变默认的 Grafana 用户名及密码。
安装完成后,你可以在 Grafana 上看到若干预置的监控大盘,它们可以帮助你查看整个系统及各个应用的运行状态。你可以参考 [监控大盘](./dashboard) 章节来了解这些预置监控大盘的详细信息。
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::caution
**资源**: 可观测性套件包括几个插件,它们需要一些计算资源才能正常工作。 集群的推荐安装资源是 2 核 + 4 Gi 内存。
**版本**: 安装所需的 KubeVela 版本(服务端控制器和客户端 CLI**不低于** v1.5.0-beta.4。
:::
:::tip
**可观测插件套件**: 如果你想要通过一行命令来完成所有可观测性插件的安装,你可以使用 [WorkflowRun](https://github.com/kubevela/workflow) 来编排这些安装过程。它可以帮助你将复杂的安装流程代码化,并在各个系统中复用这个流程。
:::
## 多集群安装
如果你想在多集群场景中安装可观测性插件,请确保你的 Kubernetes 集群支持 LoadBalancer 服务并且可以相互访问。
默认情况下,`kube-state-metrics`、`node-exporter` 和 `prometheus-server` 的安装过程原生支持多集群(它们将自动安装到所有集群)。 但是要让控制平面上的 `grafana` 能够访问托管集群中的 prometheus-server你需要使用以下命令来启用 `prometheus-server`
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
这将安装 [thanos](https://github.com/thanos-io/thanos) sidecar 和 prometheus-server。 然后启用 grafana你将能够看到聚合的 prometheus 指标。
你还可以使用以下命令选择要在哪个集群安装插件:
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
对于 `loki` 插件,默认情况下日志存储集中在管控面上,日志采集器([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) 或者 [vector](https://vector.dev/))的安装则是支持多集群的。你可以通过运行一下命令来将日志采集器安装在多集群中,并让这些采集器将收集到的日志存储在 `local` 集群的 Loki 服务中。
```shell
vela addon enable loki agent=vector serviceType=LoadBalancer
```
> 如果在安装插件后将新集群添加到控制平面,则需要重新启用插件才能使其生效。

View File

@ -2,11 +2,11 @@
title: 外部系统集成
---
Sometimes, you might already have Prometheus & Grafana instances. They might be built by other tools, or come from cloud providers. Follow the below guide to integrate with existing systems.
有时,你可能已经拥有 Prometheus 和 Grafana 实例。 它们可能由其他工具构建,或者来自云提供商。 按照以下指南与现有系统集成。
## Integrate Prometheus
## 集成 Prometheus
If you already have external prometheus service and you want to connect it to Grafana (established by vela addon), you can create a GrafanaDatasource to register it through KubeVela application.
如果你已经有外部 prometheus 服务,并且希望将其连接到 Grafana由 vela 插件创建),你可以使用 KubeVela Application 创建一个 GrafanaDatasource 从而注册这个外部的 prometheus 服务。
```yaml
apiVersion: core.oam.dev/v1beta1
@ -33,15 +33,15 @@ spec:
url: <my-prometheus url>
```
For example, if you are using the Prometheus service on Alibaba Cloud (ARMS), you can go to the Prometheus setting page and find the access url & access token.
例如如果你在阿里云ARMS上使用 Prometheus 服务,你可以进入 Prometheus 设置页面,找到访问的 url 和 token。
![arms-prometheus](../../../resources/arms-prometheus.jpg)
> You need to ensure your grafana access is already available. You can run `kubectl get grafana default` and see if it exists.
> 你需要确定你的 grafana 已经可以访问。你可以执行 `kubectl get grafana default` 查看它是否已经存在。
## Integrate Grafana
## 集成 Grafana
If you already have existing Grafana, similar to Prometheus integration, you can create a Grafana access through KubeVela application.
如果你已经有 Grafana与集成 Prometheus 类似,你可以通过 KubeVela Application 注册 Grafana 的访问信息。
```yaml
apiVersion: core.oam.dev/v1beta1
@ -58,38 +58,38 @@ spec:
token: <access token>
```
To get your grafana access, you can go into your Grafana instance and configure API keys.
要获得 Grafana 访问权限,你可以进入 Grafana 实例并配置 API 密钥。
![grafana-apikeys](../../../resources/grafana-apikeys.jpg)
Then copy the token into your grafana registration configuration.
然后将 token 复制到你的 grafana 注册配置中。
![grafana-added-apikeys](../../../resources/grafana-added-apikeys.jpg)
After the application is successfully dispatched, you can check the registration by running the following command.
Application 成功派发后,你可以通过运行以下命令检查注册情况。
```shell
kubectl get grafana
```
```shell
> kubectl get grafana
NAME ENDPOINT CREDENTIAL_TYPE
default http://grafana.o11y-system:3000 BasicAuth
my-grafana https://grafana-rngwzwnsuvl4s9p66m.grafana.aliyuncs.com:80/ BearerToken
```
Now you can manage your dashboard and datasource on your grafana instance through the native Kubernetes API as well.
现在,你也可以通过原生 Kubernetes API 在 grafana 实例上管理 dashboard 和数据源。
```shell
# show all the dashboard you have
# 显示你拥有的所有 dashboard
kubectl get grafanadashboard -l grafana=my-grafana
```
```shell
# show all the datasource you have
# 显示你拥有的所有数据源
kubectl get grafanadatasource -l grafana=my-grafana
```
For more details, you can refer to [vela-prism](https://github.com/kubevela/prism#grafana-related-apis).
更多详情,你可以参考 [vela-prism](https://github.com/kubevela/prism#grafana-related-apis)。
## Integrate Other Tools or Systems
## 使用配置管理进行集成
There are a wide range of community tools or eco-systems that users can leverage for building their observability system, such as prometheus-operator or DataDog. By far, KubeVela does not have existing best practices for those integration. We may integrate with those popular projects through KubeVela addons in the future. We are also welcome to community contributions for broader explorations and more connections.
除了上述较为直接的对接方法,另一种方法是使用 KubeVela 中的配置管理能力来将已有 Prometheus 或 Grafana 连接到 KubeVela 系统中来。详见配置管理章节。
## 集成其他工具和系统
用户可以利用社区的各种工具或生态系统来构建自己的可观测性系统,例如 prometheus-operator 或 DataDog。 到目前为止针对这些集成KubeVela 并没有给出最佳实践。 未来我们可能会通过 KubeVela 插件集成那些流行的项目。 我们也欢迎社区贡献更广泛的探索和更多的联系。

View File

@ -223,7 +223,7 @@ spec:
```
该例子中,除了将 nginx 输出的 `combinded` 日志转换成 json 格式,并为每条日志增加一个 `new_field` 的 json key json value 的值为 `new value`。具体 vector VRL 如何编写请参考[文档](https://vector.dev/docs/reference/vrl/)。
如果你针对这种处理方式,制作了专门的日志分析大盘,可以参考 [文档](./visualization#dashboard-customization) 提供的三种方式将其导入到 grafana 中。
如果你针对这种处理方式,制作了专门的日志分析大盘,可以参考 [文档](./dashboard) 提供的三种方式将其导入到 grafana 中。
## 应用文件日志

View File

@ -2,13 +2,54 @@
title: 指标
---
## Customized Prometheus Installation
## 采集应用指标
If you want to make customization to your prometheus-server installation, you can put your configuration into an individual ConfigMap, like `my-prom` in namespace o11y-system. To distribute your custom config to all clusters, you can also use a KubeVela Application to do the job.
在你的应用中,如果你想要将应用内组件(如 webservice的指标暴露给 Prometheus从而被指标采集器采集你只需要为其添加 `prometheus-scrape` 运维特征。
### Recording Rules
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
For example, if you want to add some recording rules to all your prometheus server configurations in all clusters, you can firstly create an application to distribute your recording rules as below.
你也可以显式指定指标的端口和路径。
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
上述配置将会帮助你让 Prometheus 采集到应用组件的指标。如果你想要在 Grafana 上看到这些指标,你需要在 Grafana 上创建相应的监控大盘。详见 [监控大盘](./dashboard) 章节来了解后续步骤。
## 自定义 Prometheus 配置
如果你想自定义安装 prometheus-server ,你可以把配置放到一个单独的 ConfigMap 中,比如在命名空间 o11y-system 中的 `my-prom`。 要将你的自定义配置分发到所有集群,你还可以使用 KubeVela Application 来完成这项工作。
### 记录规则
例如,如果你想在所有集群中的所有 prometheus 服务配置中添加一些记录规则,你可以首先创建一个 Application 来分发你的记录规则,如下所示。
```yaml
# my-prom.yaml
@ -31,10 +72,10 @@ spec:
data:
my-recording-rules.yml: |
groups:
- name: example
rules:
- record: apiserver:requests:rate5m
expr: sum(rate(apiserver_request_total{job="kubernetes-nodes"}[5m]))
- name: example
rules:
- record: apiserver:requests:rate5m
expr: sum(rate(apiserver_request_total{job="kubernetes-nodes"}[5m]))
policies:
- type: topology
name: topology
@ -42,17 +83,17 @@ spec:
clusterLabelSelector: {}
```
Then you need to add `customConfig` parameter to the enabling process of the prometheus-server addon, like
然后你需要在 prometheus-server 插件的启用过程中添加 `customConfig` 参数,比如:
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer storage=1G customConfig=my-prom
```
Then you will be able to see the recording rules configuration being delivered into all prometheus instances.
然后你将看到记录规则配置被分发到到所有 prome。
### Alerting rules & Other configurations
### 告警规则和其他配置
To make customization to other configurations like alerting rules, the process is same with the recording rules example shown above. You only need to change/add prometheus configurations in the application.
要对告警规则等其他配置进行自定义,过程与上面显示的记录规则示例相同。 你只需要在 application 中更改/添加 prometheus 配置。
```yaml
data:
@ -69,54 +110,22 @@ data:
![prometheus-rules-config](../../../resources/prometheus-rules-config.jpg)
### Custom storage
### 自定义 Grafana 凭证
If you want your prometheus-server to persist data in volumes, you can also specify `storage` parameter for your installation, like
如果要更改 Grafana 的默认用户名和密码,可以运行以下命令:
```shell
vela addon enable grafana adminUser=super-user adminPassword=PASSWORD
```
这会将你的默认管理员用户更改为 `super-user`,并将其密码更改为 `PASSWORD`
### 自定义存储
如果你希望 prometheus-server 和 grafana 将数据持久化在卷中,可以在安装时指定 `storage` 参数,例如:
```shell
vela addon enable prometheus-server storage=1G
```
This will create PersistentVolumeClaims and let the addon use the provided storage. The storage will not be automatically recycled even if the addon is disabled. You need to clean up the storage manually.
## Exposing Metrics in your Application
In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
You can also explicitly specify which port and which path to expose metrics.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Visualization](./visualization#dashboard-customization) for learning the following steps.
这将创建 PersistentVolumeClaims 并让插件使用提供的存储。 即使插件被禁用,存储也不会自动回收。 你需要手动清理存储。

View File

@ -0,0 +1,87 @@
---
title: 开箱即用
---
## 监控大盘
有四个自动化的 Dashboard 可以浏览和查看你的系统。
### KubeVela Application
这个 dashboard 展示了应用的基本信息。
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::info
KubeVela Application dashboard 显示了应用的元数据概况。 它直接访问 Kubernetes API 来检索运行时应用的信息,你可以将其作为入口。
**Basic Information** 部分将关键信息提取到面板中,并提供当前应用最直观的视图。
**Related Resource** 部分显示了与应用本身一起工作的那些资源,包括托管资源、记录的 ResourceTracker 和修正。
:::
### Kubernetes Deployemnt
这个 dashboard 显示原生 deployment 的概况。 你可以查看跨集群的 deployment 的信息。
URL: http://localhost:8080/d/deployment-overview/kubernetes-deployment
![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
:::info
Kubernetes Deployment dashboard 向你提供 deployment 的详细运行状态。
**Pods** 面板显示该 deployment 当前正在管理的 pod。
**Replicas** 面板显示副本数量如何变化,用于诊断你的 deployment 何时以及如何转变到不希望的状态。
**Pod** 部分包含资源的详细使用情况(包括 CPU / 内存 / 网络 / 存储),可用于识别 pod 是否面临资源压力或产生/接收意想不到的流量。
:::
### KubeVela 系统
这个 dashboard 显示 KubeVela 系统的概况。 它可以用来查看 KubeVela 控制器是否健康。
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../../resources/kubevela-system.jpg)
:::info
KubeVela System dashboard 提供 KubeVela 核心模块的运行详细信息,包括控制器和集群网关。 预计将来会添加其他模块,例如 velaux 或 prism。
**Computation Resource** 部分显示了核心模块的使用情况。 它可用于追踪是否存在内存泄漏如果内存使用量不断增加或处于高压状态cpu 使用率总是很高)。 如果内存使用量达到资源限制,则相应的模块将被杀死并重新启动,这表明计算资源不足。你应该为它们添加更多的 CPU/内存。
**Controller** 部分包括各种面板,可帮助诊断你的 KubeVela 控制器的瓶颈。
**Controller Queue** 和 **Controller Queue Add Rate** 面板显示控制器工作队列的变化。 如果控制器队列不断增加,说明系统中应用过多或应用的变化过多,控制器已经不能及时处理。 那么这意味着 KubeVela 控制器存在性能问题。 控制器队列的临时增长是可以容忍的,但维持很长时间将会导致内存占用的增加,最终导致内存不足的问题。
**Reconcile Rate** 和 **Average Reconcile Time** 面板显示控制器状态的概况。 如果调和速率稳定且平均调和时间合理(例如低于 500 毫秒,具体取决于你的场景),则你的 KubeVela 控制器是健康的。 如果控制器队列入队速率在增加,但调和速率没有上升,会逐渐导致控制器队列增长并引发问题。 有多种情况表明你的控制器运行状况不佳:
1. Reconcile 是健康的,但是应用太多,你会发现一切都很好,除了控制器队列指标增加。 检查控制器的 CPU/内存使用情况。 你可能需要添加更多的计算资源。
2. 由于错误太多,调和不健康。 你会在 **Reconcile Rate** 面板中发现很多错误。 这意味着你的系统正持续面临应用的处理错误。 这可能是由错误的应用配置或运行工作流时出现的意外错误引起的。 检查应用详细信息并查看哪些应用导致错误。
3. 由于调和时间过长导致的调整不健康。 你需要检查 **ApplicationController Reconcile Time** 面板看看它是常见情况平均调和时间高还是只有部分应用有问题p95 调和时间高)。 对于前一种情况,通常是由于 CPU 不足CPU 使用率高)或过多的请求和 kube-apiserver 限制了速率(检查 **ApplicationController Client Request Throughput****ApplicationController Client Request Average Time** 面板并查看哪些资源请求缓慢或过多)。 对于后一种情况,你需要检查哪个应用很大并且使用大量时间进行调和。
有时你可能需要参考 **ApplicationController Reconcile Stage Time**,看看是否有一些特殊的调和阶段异常。 例如GCResourceTracker 使用大量时间意味着在 KubeVela 系统中可能存在阻塞回收资源的情况。
**Application** 部分显示了整个 KubeVela 系统中应用的概况。 可用于查看应用数量的变化和使用的工作流步骤。 **Workflow Initialize Rate** 是一个辅助面板,可用于查看启动新工作流执行的频率。 **Workflow Average Complete Time** 可以进一步显示完成整个工作流程所需的时间。
:::
### Kubernetes APIServer
这个 dashboard 展示了所有 Kubernetes apiserver 的运行状态。
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
:::info
Kubernetes APIServer dashboard 可帮助你查看 Kubernetes 系统最基本的部分。 如果你的 Kubernetes APIServer 运行不正常,你的 Kubernetes 系统中所有控制器和模块都会出现异常,无法成功处理请求。 因此务必确保此 dashboard 中的一切正常。
**Requests** 部分包括一系列面板,用来显示各种请求的 QPS 和延迟。 通常,如果请求过多, APIServer 可能无法响应。 这时候就可以看到是哪种类型的请求出问题了。
**WorkQueue** 部分显示 Kubernetes APIServer 的处理状态。 如果 **Queue Size** 很大,则表示请求数超出了你的 Kubernetes APIServer 的处理能力。
**Watches** 部分显示 Kubernetes APIServer 中的 watch 数量。 与其他类型的请求相比WATCH 请求会持续消耗 Kubernetes APIServer 中的计算资源,因此限制 watch 的数量会有所帮助。
:::

View File

@ -1,390 +0,0 @@
---
title: 可视化
---
Visualization is one of the methods to present the observability information.
For example, metrics can be plotted into different types of graphs depending on their categories and logs can be filtered and listed.
In KubeVela, leveraging the power of Kubernetes Aggregated API layer, it is easy for users to manipulate dashboards on Grafana and make customizations to application visualizations.
## Pre-installed Dashboards
When enabling `grafana` addon to KubeVela system, a series of dashboards will be pre-installed and provide basic panels for viewing observability data.
### KubeVela Application
This dashboard shows the basic information for one application.
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.jpg)
<details>
The KubeVela Application dashboard shows the overview of the metadata for the application. It directly accesses the Kubernetes API to retrieve the runtime application information, where you can use it as an entrance. You can navigate to detail information for application resources by clicking the `detail` link in the *Managed Resources* panel.
---
The **Basic Information** section extracts key information into panels and give you the most straightforward view for the current application.
---
The **Related Resources** section shows those resources that work together with the application itself, including the managed resources, the recorded ResourceTrackers and the revisions.
</details>
### Kubernetes Deployemnt
This dashboard shows the overview of native deployments. You can navigate deployments across clusters.
URL: http://localhost:8080/d/kubernetes-deployment/kubernetes-deployment
![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
<details>
The Kubernetes Deployment dashboard gives you the detail running status for the deployment.
---
The **Pods** panel shows the pods that the deployment itself is currently managing.
---
The **Replicas** panel shows how the number of replicas changes, which can be used to diagnose when and how your deployment shifted to undesired state.
---
The **Resource** section includes the details for the resource usage (including the CPU / Memory / Network / Storage) which can be used to identify if the pods of the deployment are facing resource pressure or making/receiving unexpected traffics.
---
There are a list of dashboards for various types of Kubernetes resources, such as DaemonSet and StatefulSet. You can navigate to those dashboards depending on your workload type.
</details>
### KubeVela System
This dashboard shows the overview of the KubeVela system. It can be used to see if KubeVela controller is healthy.
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../../resources/kubevela-system.jpg)
<details>
The KubeVela System dashboard gives you the running details of the KubeVela core modules, including the controller and the cluster-gateway. Other modules like velaux or prism are expected to be added in the future.
---
The **Computation Resource** section shows the usage for core modules. It can be used to track if there is any memory leak (if the memory usage is continuously increasing) or under high pressure (the cpu usage is always very high). If the memory usage hits the resource limit, the corresponding module will be killed and restarted, which indicates the lack of computation resource. You should add more CPU/Memory for them.
---
The **Controller** section includes a wide range of panels which can help you to diagnose the bottleneck of the KubeVela controller in your scenario.
The **Controller Queue** and **Controller Queue Add Rate** panels show you the controller working queue changes. If the controller queue is increasing continuously, it means there are too much applications or application changes in the system, and the controller is unable to handle them in time. Then it means there is performance issues for KubeVela controller. A temporary increase for the controller queue is tolerable, but keeping for a long time will lead to memory increase which will finally cause Out-Of-Memory problems.
**Reconcile Rate** and **Average Reconcile Time** panels give you the overview of the controller status. If reconcile rate is steady and average reconcile time is reasonable (like under 500ms, depending on your scenario), your KubeVela controller is healthy. If the controller queue add rate is increasing but the reconcile rate does not go up, it will gradually lead to increase for the controller queue and cause troubles. There are various cases that your controller is unhealthy:
1. Reconcile is healthy but there are too much applications, you will find everything is okay except the controller queue metrics increasing. Check your CPU/Memory usage for the controller. You might need to add more computation resources.
2. Reconcile is not healthy due to too much errors. You will find lots of errors in the **Reconcile Rate** panel. This means your system is continuously facing process errors for application. It could be caused by invalid application configurations or unexpected errors while running workflows. Check application details and see which applications are causing errors.
3. Reconcile is not healthy due to long reconcile times. You need to check **ApplicationController Reconcile Time** panel and see whether it is a common case (the average reconcile time is high), or only part of your applications have problems (the p95 reconcile time is high). For the former case, usually it is caused by either insufficient CPU (CPU usage is high) or too much requests and rate limited by kube-apiserver (check **ApplicationController Client Request Throughput** and **ApplicationController Client Request Average Time** panel and see which resource requests is slow or excessive). For the later case you need to check which application is large and uses lots of time for reconciliations.
Sometimes you might need refer to **ApplicationController Reconcile Stage Time** and see if there is some special reconcile stages are abnormal. For example, GCResourceTrackers use lots of time means there might be blockings for recycling resource in KubeVela system.
---
The **Application** section shows the overview of the applications in your whole KubeVela system. It can be used to see the changes of the application numbers and the used workflow steps. The **Workflow Initialize Rate** is an auxiliary panel which can be used to see how frequent new workflow execution is launched. The **Workflow Average Complete Time** can further show how much time it costs to finish the whole workflow.
</details>
### Kubernetes APIServer
This dashboard shows the running status of all Kubernetes apiservers.
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
<details>
The Kubernetes APIServer dashboard helps you to see the most fundamental part for your Kubernetes system. If your Kubernetes APIServer is not running healthy, all of your controllers and modules in your Kubernetes system will be abnormal and unable to handle requests successfully. So it is important to make sure everything is fine in this dashboard.
---
The **Requests** section includes a series of panels which shows the QPS and latency for various kind of requests. Usually your APIServer could fail to respond if it is flooded by too much requests. At this time, you can see which type of requests is causing trouble.
---
The **WorkQueue** section shows the process status of the Kubernetes APIServer. If the **Queue Size** is large, it means the number of requests is out of the process capability of your Kubernetes APIServer.
---
The **Watches** section shows the number of watches in your Kubernetes APIServer. Compared to other types of requests, WATCH requests will continuously consume computation resources in Kubernetes APIServer, so it will be helpful to keep the number of watches limited.
</details>
## Dashboard Customization
Except for the pre-defined dashboards provided by the `grafana` addon, KubeVela users can deploy customized dashboards to their system as well.
:::tip
If you do not know how to build Grafana dashboards and export them as json data, you can refer to the following Grafana docs for details.
1. [Build your first dashboard](https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/)
2. [Exporting a dashboard](https://grafana.com/docs/grafana/latest/dashboards/export-import/#exporting-a-dashboard)
:::
### Using Dashboard as Component
One way to manage your customized dashboard is to use the component in KubeVela application like below.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-dashboard
spec:
components:
- name: my-dashboard
type: grafana-dashboard
properties:
uid: my-example-dashboard
data: |
{
"panels": [{
"gridPos": {
"h": 9,
"w": 12
},
"targets": [{
"datasource": {
"type": "prometheus",
"uid": "prometheus-vela"
},
"expr": "max(up) by (cluster)"
}],
"title": "Clusters",
"type": "timeseries"
}],
"title": "My Dashboard"
}
```
### Import Dashboard from URL
Sometimes, you might already have some Grafana dashboards stored in OSS or served by other HTTP server. To import these dashboards in your system, you can leverage the `import-grafana-dashboard` workflow step as below.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-dashboard
spec:
components: []
workflow:
steps:
- type: import-grafana-dashboard
name: import-grafana-dashboard
properties:
uid: my-dashboard
title: My Dashboard
url: https://kubevelacharts.oss-accelerate.aliyuncs.com/dashboards/up-cluster-dashboard.json
```
In the `import-grafana-dashboard` step, the application will download the dashboard json from the URL and create dashboards on Grafana with correct format.
### Using CUE to Generate Dashboards Dynamically
With CUE, you can customize the process of creating dashboards. This will empower you to construct dashboards dynamically and combined with other actions. For example, you can make a WorkflowStepDefinition called `create-dashboard`, which finds the service created by the application itself and get the metrics from the exposed endpoint. Then, the step will build Grafana dashboard panels from those metrics automatically.
```cue
import (
"vela/op"
"vela/ql"
"strconv"
"math"
"regexp"
)
"create-dashboard": {
type: "workflow-step"
annotations: {}
labels: {}
description: "Create dashboard for application."
}
template: {
resources: ql.#CollectServiceEndpoints & {
app: {
name: context.name
namespace: context.namespace
filter: {}
}
} @step(1)
status: {
endpoints: *[] | [...{...}]
if resources.err == _|_ && resources.list != _|_ {
endpoints: [ for ep in resources.list if ep.endpoint.port == parameter.port {
name: "\(ep.ref.name):\(ep.ref.namespace):\(ep.cluster)"
portStr: strconv.FormatInt(ep.endpoint.port, 10)
if ep.cluster == "local" && ep.ref.kind == "Service" {
url: "http://\(ep.ref.name).\(ep.ref.namespace):\(portStr)"
}
if ep.cluster != "local" || ep.ref.kind != "Service" {
url: "http://\(ep.endpoint.host):\(portStr)"
}
}]
}
} @step(2)
getMetrics: op.#Steps & {
for ep in status.endpoints {
"\(ep.name)": op.#HTTPGet & {
url: ep.url + "/metrics"
}
}
} @step(3)
checkErrors: op.#Steps & {
for ep in status.endpoints if getMetrics["\(ep.name)"] != _|_ {
if getMetrics["\(ep.name)"].response.statusCode != 200 {
"\(ep.name)": op.#Steps & {
src: getMetrics["\(ep.name)"]
err: op.#Fail & {
message: "failed to get metrics for \(ep.name) from \(ep.url), code \(src.response.statusCode)"
}
}
}
}
} @step(4)
createDashboards: op.#Steps & {
for ep in status.endpoints if getMetrics["\(ep.name)"] != _|_ {
if getMetrics["\(ep.name)"].response.body != "" {
"\(ep.name)": dashboard & {
title: context.name
uid: "\(context.name)-\(context.namespace)"
description: "Auto-generated Dashboard"
metrics: *[] | [...{...}]
metrics: regexp.FindAllNamedSubmatch(#"""
# HELP \w+ (?P<desc>[^\n]+)\n# TYPE (?P<name>\w+) (?P<type>\w+)
"""#, getMetrics["\(ep.name)"].response.body, -1)
}
}
}
} @step(5)
applyDashboards: op.#Steps & {
for ep in status.endpoints if createDashboards["\(ep.name)"] != _|_ {
"\(ep.name)": op.#Apply & {
db: {for k, v in createDashboards["\(ep.name)"] if k != "metrics" {
"\(k)": v
}}
value: {
apiVersion: "o11y.prism.oam.dev/v1alpha1"
kind: "GrafanaDashboard"
metadata: name: "\(db.uid)@\(parameter.grafana)"
spec: db
}
}
}
} @step(6)
dashboard: {
title: *"Example Dashboard" | string
uid: *"" | string
description: *"" | string
metrics: [...{...}]
time: {
from: *"now-1h" | string
to: *"now" | string
}
refresh: *"30s" | string
templating: list: [{
type: "datasource"
name: "datasource"
label: "Data Source"
query: "prometheus"
hide: 2
}, {
type: "interval"
name: "rate_interval"
label: "Rate"
query: "3m,5m,10m,30m"
hide: 2
}]
panels: [for i, m in metrics {
title: m.name
type: "graph"
datasource: {
uid: "${datasource}"
type: "prometheus"
}
gridPos: {
w: 6
h: 8
x: math.Floor((i - y * 4) * 6)
y: math.Floor(i / 4)
}
description: m.desc
if m.type == "gauge" {
targets: [{
expr: "sum(\(m.name))"
}]
legend: show: false
}
if m.type == "counter" {
targets: [{
expr: "sum(rate(\(m.name)[$rate_interval]))"
}]
legend: show: false
}
if m.type == "histogram" || m.type == "summary" {
targets: [{
expr: "sum(rate(\(m.name)_sum[$rate_interval])) / sum(rate(\(m.name)_count[$rate_interval]))"
legendFormat: "avg"
}, {
expr: "histogram_quantile(0.75, sum(rate(\(m.name)_bucket[$rate_interval])) by (le))"
legendFormat: "p75"
}, {
expr: "histogram_quantile(0.99, sum(rate(\(m.name)_bucket[$rate_interval])) by (le))"
legendFormat: "p99"
}]
}
}]
}
parameter: {
port: *8080 | int
grafana: *"default" | string
}
}
```
Then you can create an application as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
# the core workload
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
# deploy and create dashboard automatically
workflow:
steps:
- type: deploy
name: deploy
properties:
policies: []
- type: create-dashboard
name: create-dashboard
```
This application will deploy your webservice first, and generate dashboard automatically according to the metrics collected from the webservice.

View File

@ -6,382 +6,29 @@ title: 自动化可观测性
为了帮助用户构建自己的可观测性系统KubeVela 提供了一些插件,包括:
- prometheus-server: 以时间序列来记录指标的服务,支持灵活的查询。
- kube-state-metrics: Kubernetes 系统的指标收集器
- node-exporter: Kubernetes 运行中的节点的指标收集器。
- grafana: 提供分析和交互式可视化的 Web 应用程序
**指标**
- `prometheus-server`: 以时间序列来记录指标的服务,支持灵活的查询
- `kube-state-metrics`: Kubernetes 系统的指标收集器。
- `node-exporter`: Kubernetes 运行中的节点的指标收集器
以后的版本中将引入更多用于 logging 和 tracing 的插件。
**日志**
- `loki`: 用于存储采集日志并提供查询服务的日志服务器。
## 前提条件
**监控大盘**
- `grafana`: 提供分析和交互式可视化的 Web 应用程序。
1. 可观测性套件包括几个插件,它们需要一些计算资源才能正常工作。 集群的推荐安装资源是 2 核 + 4 Gi 内存
以后的版本中将引入更多用于 alerting 和 tracing 的插件
2. 安装所需的 KubeVela 版本(服务端控制器和客户端 CLI**不低于** v1.5.0-beta.4。
## 进阶指南
## 快速开始
- [**安装**](./o11y/installation): 关于如何在 KubeVela 系统中部署可观测性基础设施。
要启用插件套件,只需运行 `vela addon enable` 命令,如下所示
- [**开箱即用**](./o11y/out-of-the-box): 关于 KubeVela 可观测性基础设施中默认可用的系统及应用监控能力
:::tip
如果你的 KubeVela 是多集群场景,请参阅下面的 [多集群安装](#多集群安装) 章节。
:::
- [**指标**](./o11y/metrics): 关于如何为你的应用自定义指标采集过程的指南。
1. 安装 kube-state-metrics 插件
- [**日志**](./o11y/logging): 关于如何为你的应用定制日志采集规则的指南。
```shell
vela addon enable kube-state-metrics
```
- [**监控大盘**](./o11y/dashboard): 关于如何为你的应用配置自定义监控大盘的指南。
2. 安装 node-exporter 插件
```shell
vela addon enable node-exporter
```
3. 安装 prometheus-server
```shell
vela addon enable prometheus-server
```
4. 安装 grafana 插件
```shell
vela addon enable grafana
```
5. 通过端口转发访问 grafana
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
现在在浏览器中访问 `http://localhost:8080` 就可以访问你的 grafana。 默认的用户名和密码分别是 `admin``kubevela`
## 自动化的 Dashboard
有四个自动化的 Dashboard 可以浏览和查看你的系统。
### KubeVela Application
这个 dashboard 展示了应用的基本信息。
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../resources/kubevela-application-dashboard.jpg)
<details>
KubeVela Application dashboard 显示了应用的元数据概况。 它直接访问 Kubernetes API 来检索运行时应用的信息,你可以将其作为入口。
---
**Basic Information** 部分将关键信息提取到面板中,并提供当前应用最直观的视图。
---
**Related Resource** 部分显示了与应用本身一起工作的那些资源,包括托管资源、记录的 ResourceTracker 和修正。
</details>
### Kubernetes Deployemnt
这个 dashboard 显示原生 deployment 的概况。 你可以查看跨集群的 deployment 的信息。
URL: http://localhost:8080/d/deployment-overview/kubernetes-deployment
![kubernetes-deployment-dashboard](../../resources/kubernetes-deployment-dashboard.jpg)
<details>
Kubernetes Deployment dashboard 向你提供 deployment 的详细运行状态。
---
**Pods** 面板显示该 deployment 当前正在管理的 pod。
---
**Replicas** 面板显示副本数量如何变化,用于诊断你的 deployment 何时以及如何转变到不希望的状态。
---
**Pod** 部分包含资源的详细使用情况(包括 CPU / 内存 / 网络 / 存储),可用于识别 pod 是否面临资源压力或产生/接收意想不到的流量。
</details>
### KubeVela 系统
这个 dashboard 显示 KubeVela 系统的概况。 它可以用来查看 KubeVela 控制器是否健康。
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../resources/kubevela-system.jpg)
<details>
KubeVela System dashboard 提供 KubeVela 核心模块的运行详细信息,包括控制器和集群网关。 预计将来会添加其他模块,例如 velaux 或 prism。
---
**Computation Resource** 部分显示了核心模块的使用情况。 它可用于追踪是否存在内存泄漏如果内存使用量不断增加或处于高压状态cpu 使用率总是很高)。 如果内存使用量达到资源限制,则相应的模块将被杀死并重新启动,这表明计算资源不足。你应该为它们添加更多的 CPU/内存。
---
**Controller** 部分包括各种面板,可帮助诊断你的 KubeVela 控制器的瓶颈。
**Controller Queue** 和 **Controller Queue Add Rate** 面板显示控制器工作队列的变化。 如果控制器队列不断增加,说明系统中应用过多或应用的变化过多,控制器已经不能及时处理。 那么这意味着 KubeVela 控制器存在性能问题。 控制器队列的临时增长是可以容忍的,但维持很长时间将会导致内存占用的增加,最终导致内存不足的问题。
**Reconcile Rate** 和 **Average Reconcile Time** 面板显示控制器状态的概况。 如果调和速率稳定且平均调和时间合理(例如低于 500 毫秒,具体取决于你的场景),则你的 KubeVela 控制器是健康的。 如果控制器队列入队速率在增加,但调和速率没有上升,会逐渐导致控制器队列增长并引发问题。 有多种情况表明你的控制器运行状况不佳:
1. Reconcile 是健康的,但是应用太多,你会发现一切都很好,除了控制器队列指标增加。 检查控制器的 CPU/内存使用情况。 你可能需要添加更多的计算资源。
2. 由于错误太多,调和不健康。 你会在 **Reconcile Rate** 面板中发现很多错误。 这意味着你的系统正持续面临应用的处理错误。 这可能是由错误的应用配置或运行工作流时出现的意外错误引起的。 检查应用详细信息并查看哪些应用导致错误。
3. 由于调和时间过长导致的调整不健康。 你需要检查 **ApplicationController Reconcile Time** 面板看看它是常见情况平均调和时间高还是只有部分应用有问题p95 调和时间高)。 对于前一种情况,通常是由于 CPU 不足CPU 使用率高)或过多的请求和 kube-apiserver 限制了速率(检查 **ApplicationController Client Request Throughput****ApplicationController Client Request Average Time** 面板并查看哪些资源请求缓慢或过多)。 对于后一种情况,你需要检查哪个应用很大并且使用大量时间进行调和。
有时你可能需要参考 **ApplicationController Reconcile Stage Time**,看看是否有一些特殊的调和阶段异常。 例如GCResourceTracker 使用大量时间意味着在 KubeVela 系统中可能存在阻塞回收资源的情况。
---
**Application** 部分显示了整个 KubeVela 系统中应用的概况。 可用于查看应用数量的变化和使用的工作流步骤。 **Workflow Initialize Rate** 是一个辅助面板,可用于查看启动新工作流执行的频率。 **Workflow Average Complete Time** 可以进一步显示完成整个工作流程所需的时间。
</details>
### Kubernetes APIServer
这个 dashboard 展示了所有 Kubernetes apiserver 的运行状态。
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../resources/kubernetes-apiserver.jpg)
<details>
Kubernetes APIServer dashboard 可帮助你查看 Kubernetes 系统最基本的部分。 如果你的 Kubernetes APIServer 运行不正常,你的 Kubernetes 系统中所有控制器和模块都会出现异常,无法成功处理请求。 因此务必确保此 dashboard 中的一切正常。
---
**Requests** 部分包括一系列面板,用来显示各种请求的 QPS 和延迟。 通常,如果请求过多, APIServer 可能无法响应。 这时候就可以看到是哪种类型的请求出问题了。
---
**WorkQueue** 部分显示 Kubernetes APIServer 的处理状态。 如果 **Queue Size** 很大,则表示请求数超出了你的 Kubernetes APIServer 的处理能力。
---
**Watches** 部分显示 Kubernetes APIServer 中的 watch 数量。 与其他类型的请求相比WATCH 请求会持续消耗 Kubernetes APIServer 中的计算资源,因此限制 watch 的数量会有所帮助。
</details>
## 自定义
上述安装过程可以通过多种方式进行自定义。
### 多集群安装
如果你想在多集群场景中安装可观测性插件,请确保你的 Kubernetes 集群支持 LoadBalancer 服务并且可以相互访问。
默认情况下,`kube-state-metrics`、`node-exporter` 和 `prometheus-server` 的安装过程原生支持多集群(它们将自动安装到所有集群)。 但是要让控制平面上的 `grafana` 能够访问托管集群中的 prometheus-server你需要使用以下命令来启用 `prometheus-server`
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
这将安装 [thanos](https://github.com/thanos-io/thanos) sidecar 和 prometheus-server。 然后启用 grafana你将能够看到聚合的 prometheus 指标。
你还可以使用以下命令选择要在哪个集群安装插件:
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
> 如果在安装插件后将新集群添加到控制平面,则需要重新启用插件才能使其生效。
### 自定义 Prometheus 配置
如果你想自定义安装 prometheus-server ,你可以把配置放到一个单独的 ConfigMap 中,比如在命名空间 o11y-system 中的 `my-prom`。 要将你的自定义配置分发到所有集群,你还可以使用 KubeVela Application 来完成这项工作。
#### 记录规则
例如,如果你想在所有集群中的所有 prometheus 服务配置中添加一些记录规则,你可以首先创建一个 Application 来分发你的记录规则,如下所示。
```yaml
# my-prom.yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-prom
namespace: o11y-system
spec:
components:
- type: k8s-objects
name: my-prom
properties:
objects:
- apiVersion: v1
kind: ConfigMap
metadata:
name: my-prom
namespace: o11y-system
data:
my-recording-rules.yml: |
groups:
- name: example
rules:
- record: apiserver:requests:rate5m
expr: sum(rate(apiserver_request_total{job="kubernetes-nodes"}[5m]))
policies:
- type: topology
name: topology
properties:
clusterLabelSelector: {}
```
然后你需要在 prometheus-server 插件的启用过程中添加 `customConfig` 参数,比如:
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer storage=1G customConfig=my-prom
```
然后你将看到记录规则配置被分发到到所有 prome。
#### 告警规则和其他配置
要对告警规则等其他配置进行自定义,过程与上面显示的记录规则示例相同。 你只需要在 application 中更改/添加 prometheus 配置。
```yaml
data:
my-alerting-rules.yml: |
groups:
- name: example
rules:
- alert: HighApplicationQueueDepth
expr: sum(workqueue_depth{app_kubernetes_io_name="vela-core",name="application"}) > 100
for: 10m
annotations:
summary: High Application Queue Depth
```
![prometheus-rules-config](../../resources/prometheus-rules-config.jpg)
### 自定义 Grafana 凭证
如果要更改 Grafana 的默认用户名和密码,可以运行以下命令:
```shell
vela addon enable grafana adminUser=super-user adminPassword=PASSWORD
```
这会将你的默认管理员用户更改为 `super-user`,并将其密码更改为 `PASSWORD`
### 自定义存储
如果你希望 prometheus-server 和 grafana 将数据持久化在卷中,可以在安装时指定 `storage` 参数,例如:
```shell
vela addon enable prometheus-server storage=1G
```
这将创建 PersistentVolumeClaims 并让插件使用提供的存储。 即使插件被禁用,存储也不会自动回收。 你需要手动清理存储。
## 集成其他 Prometheus 和 Grafana
有时,你可能已经拥有 Prometheus 和 Grafana 实例。 它们可能由其他工具构建,或者来自云提供商。 按照以下指南与现有系统集成。
### 集成 Prometheus
如果你已经有外部 prometheus 服务,并且希望将其连接到 Grafana由 vela 插件创建),你可以使用 KubeVela Application 创建一个 GrafanaDatasource 从而注册这个外部的 prometheus 服务。
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: register-prometheus
spec:
components:
- type: grafana-datasource
name: my-prometheus
properties:
access: proxy
basicAuth: false
isDefault: false
name: MyPrometheus
readOnly: true
withCredentials: true
jsonData:
httpHeaderName1: Authorization
tlsSkipVerify: true
secureJsonFields:
httpHeaderValue1: <token of your prometheus access>
type: prometheus
url: <my-prometheus url>
```
例如如果你在阿里云ARMS上使用 Prometheus 服务,你可以进入 Prometheus 设置页面,找到访问的 url 和 token。
![arms-prometheus](../../resources/arms-prometheus.jpg)
> 你需要确定你的 grafana 已经可以访问。你可以执行 `kubectl get grafana default` 查看它是否已经存在。
### 集成 Grafana
如果你已经有 Grafana与集成 Prometheus 类似,你可以通过 KubeVela Application 注册 Grafana 的访问信息。
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: register-grafana
spec:
components:
- type: grafana-access
name: my-grafana
properties:
name: my-grafana
endpoint: <my-grafana url>
token: <access token>
```
要获得 Grafana 访问权限,你可以进入 Grafana 实例并配置 API 密钥。
![grafana-apikeys](../../resources/grafana-apikeys.jpg)
然后将 token 复制到你的 grafana 注册配置中。
![grafana-added-apikeys](../../resources/grafana-added-apikeys.jpg)
Application 成功派发后,你可以通过运行以下命令检查注册情况。
```shell
> kubectl get grafana
NAME ENDPOINT CREDENTIAL_TYPE
default http://grafana.o11y-system:3000 BasicAuth
my-grafana https://grafana-rngwzwnsuvl4s9p66m.grafana.aliyuncs.com:80/ BearerToken
```
现在,你也可以通过原生 Kubernetes API 在 grafana 实例上管理 dashboard 和数据源。
```shell
# 显示你拥有的所有 dashboard
kubectl get grafanadashboard -l grafana=my-grafana
# 显示你拥有的所有数据源
kubectl get grafanadatasource -l grafana=my-grafana
```
更多详情,你可以参考 [vela-prism](https://github.com/kubevela/prism#grafana-related-apis)。
### 集成其他工具和系统
用户可以利用社区的各种工具或生态系统来构建自己的可观测性系统,例如 prometheus-operator 或 DataDog。 到目前为止针对这些集成KubeVela 并没有给出最佳实践。 未来我们可能会通过 KubeVela 插件集成那些流行的项目。 我们也欢迎社区贡献更广泛的探索和更多的联系。
## 对比
### 与 Helm 对比
尽管可以通过 Helm 将这些资源安装到你的 Kubernetes 系统中,但使用 KubeVela 插件安装它们的主要好处之一是它原生地支持多集群交付,这意味着,一旦你将托管集群添加到 KubeVela 控制面,你就能够通过一条命令在所有集群中安装、升级或卸载这些插件。
### 与以前的可观测性插件对比
旧的 [KubeVela 可观测性插件](https://github.com/kubevela/catalog/tree/master/experimental/addons/observability) 以一个整体的方式安装 prometheus、grafana 和其他一些组件。 最新的可观测性插件套件KubeVela v1.5.0 之后)将其分为多个部分,允许用户只安装其中的一部分。
此外,旧的可观测性插件依赖于 Fluxcd 插件以 Helm Release 的方式安装组件。 最新版本使用 KubeVela 中的原生 webservice 组件,可以更灵活的进行自定义。
## 展望
KubeVela 将来会集成更多的可观测性插件,例如 logging 和 tracing 插件。 像 [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator) 这样的社区运营商也提供了管理可观测性应用程序的替代方法,这些方法也打算包含在 KubeVela 插件中。 我们也欢迎通过 KubeVela 插件生态系统进行更多的集成。
- [**外部集成**](./o11y/integration): 关于如何将已有的监控体系集成进 KubeVela 的指南,如果你的系统中已经存在了 Prometheus 或者 Grafana 服务(如自建或云厂商提供),可以阅读该文档了解集成方法。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 488 KiB

After

Width:  |  Height:  |  Size: 507 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 279 KiB

After

Width:  |  Height:  |  Size: 406 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 322 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 350 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 473 KiB

After

Width:  |  Height:  |  Size: 432 KiB

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,91 @@
---
title: 安装
---
:::tip
如果你的 KubeVela 是多集群场景,请参阅下面的 [多集群安装](#多集群安装) 章节。
:::
## 快速开始
要启用插件套件,只需运行 `vela addon enable` 命令,如下所示。
1. 安装 kube-state-metrics 插件
```shell
vela addon enable kube-state-metrics
```
2. 安装 node-exporter 插件
```shell
vela addon enable node-exporter
```
3. 安装 prometheus-server
```shell
vela addon enable prometheus-server
```
4. 安装 loki 插件
```shell
vela addon enable loki
```
5. 安装 grafana 插件
```shell
vela addon enable grafana
```
6. 通过端口转发访问 grafana
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
现在在浏览器中访问 `http://localhost:8080` 就可以访问你的 grafana。 默认的用户名和密码分别是 `admin``kubevela`
> 你可以通过在步骤 6 中添加 `adminUser=super-user adminPassword=PASSWORD` 参数来改变默认的 Grafana 用户名及密码。
安装完成后,你可以在 Grafana 上看到若干预置的监控大盘,它们可以帮助你查看整个系统及各个应用的运行状态。你可以参考 [监控大盘](./dashboard) 章节来了解这些预置监控大盘的详细信息。
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::caution
**资源**: 可观测性套件包括几个插件,它们需要一些计算资源才能正常工作。 集群的推荐安装资源是 2 核 + 4 Gi 内存。
**版本**: 安装所需的 KubeVela 版本(服务端控制器和客户端 CLI**不低于** v1.5.0-beta.4。
:::
:::tip
**可观测插件套件**: 如果你想要通过一行命令来完成所有可观测性插件的安装,你可以使用 [WorkflowRun](https://github.com/kubevela/workflow) 来编排这些安装过程。它可以帮助你将复杂的安装流程代码化,并在各个系统中复用这个流程。
:::
## 多集群安装
如果你想在多集群场景中安装可观测性插件,请确保你的 Kubernetes 集群支持 LoadBalancer 服务并且可以相互访问。
默认情况下,`kube-state-metrics`、`node-exporter` 和 `prometheus-server` 的安装过程原生支持多集群(它们将自动安装到所有集群)。 但是要让控制平面上的 `grafana` 能够访问托管集群中的 prometheus-server你需要使用以下命令来启用 `prometheus-server`
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
这将安装 [thanos](https://github.com/thanos-io/thanos) sidecar 和 prometheus-server。 然后启用 grafana你将能够看到聚合的 prometheus 指标。
你还可以使用以下命令选择要在哪个集群安装插件:
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
对于 `loki` 插件,默认情况下日志存储集中在管控面上,日志采集器([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) 或者 [vector](https://vector.dev/))的安装则是支持多集群的。你可以通过运行一下命令来将日志采集器安装在多集群中,并让这些采集器将收集到的日志存储在 `local` 集群的 Loki 服务中。
```shell
vela addon enable loki agent=vector serviceType=LoadBalancer
```
> 如果在安装插件后将新集群添加到控制平面,则需要重新启用插件才能使其生效。

View File

@ -2,11 +2,11 @@
title: 外部系统集成
---
Sometimes, you might already have Prometheus & Grafana instances. They might be built by other tools, or come from cloud providers. Follow the below guide to integrate with existing systems.
有时,你可能已经拥有 Prometheus 和 Grafana 实例。 它们可能由其他工具构建,或者来自云提供商。 按照以下指南与现有系统集成。
## Integrate Prometheus
## 集成 Prometheus
If you already have external prometheus service and you want to connect it to Grafana (established by vela addon), you can create a GrafanaDatasource to register it through KubeVela application.
如果你已经有外部 prometheus 服务,并且希望将其连接到 Grafana由 vela 插件创建),你可以使用 KubeVela Application 创建一个 GrafanaDatasource 从而注册这个外部的 prometheus 服务。
```yaml
apiVersion: core.oam.dev/v1beta1
@ -33,15 +33,15 @@ spec:
url: <my-prometheus url>
```
For example, if you are using the Prometheus service on Alibaba Cloud (ARMS), you can go to the Prometheus setting page and find the access url & access token.
例如如果你在阿里云ARMS上使用 Prometheus 服务,你可以进入 Prometheus 设置页面,找到访问的 url 和 token。
![arms-prometheus](../../../resources/arms-prometheus.jpg)
> You need to ensure your grafana access is already available. You can run `kubectl get grafana default` and see if it exists.
> 你需要确定你的 grafana 已经可以访问。你可以执行 `kubectl get grafana default` 查看它是否已经存在。
## Integrate Grafana
## 集成 Grafana
If you already have existing Grafana, similar to Prometheus integration, you can create a Grafana access through KubeVela application.
如果你已经有 Grafana与集成 Prometheus 类似,你可以通过 KubeVela Application 注册 Grafana 的访问信息。
```yaml
apiVersion: core.oam.dev/v1beta1
@ -58,38 +58,38 @@ spec:
token: <access token>
```
To get your grafana access, you can go into your Grafana instance and configure API keys.
要获得 Grafana 访问权限,你可以进入 Grafana 实例并配置 API 密钥。
![grafana-apikeys](../../../resources/grafana-apikeys.jpg)
Then copy the token into your grafana registration configuration.
然后将 token 复制到你的 grafana 注册配置中。
![grafana-added-apikeys](../../../resources/grafana-added-apikeys.jpg)
After the application is successfully dispatched, you can check the registration by running the following command.
Application 成功派发后,你可以通过运行以下命令检查注册情况。
```shell
kubectl get grafana
```
```shell
> kubectl get grafana
NAME ENDPOINT CREDENTIAL_TYPE
default http://grafana.o11y-system:3000 BasicAuth
my-grafana https://grafana-rngwzwnsuvl4s9p66m.grafana.aliyuncs.com:80/ BearerToken
```
Now you can manage your dashboard and datasource on your grafana instance through the native Kubernetes API as well.
现在,你也可以通过原生 Kubernetes API 在 grafana 实例上管理 dashboard 和数据源。
```shell
# show all the dashboard you have
# 显示你拥有的所有 dashboard
kubectl get grafanadashboard -l grafana=my-grafana
```
```shell
# show all the datasource you have
# 显示你拥有的所有数据源
kubectl get grafanadatasource -l grafana=my-grafana
```
For more details, you can refer to [vela-prism](https://github.com/kubevela/prism#grafana-related-apis).
更多详情,你可以参考 [vela-prism](https://github.com/kubevela/prism#grafana-related-apis)。
## Integrate Other Tools or Systems
## 使用配置管理进行集成
There are a wide range of community tools or eco-systems that users can leverage for building their observability system, such as prometheus-operator or DataDog. By far, KubeVela does not have existing best practices for those integration. We may integrate with those popular projects through KubeVela addons in the future. We are also welcome to community contributions for broader explorations and more connections.
除了上述较为直接的对接方法,另一种方法是使用 KubeVela 中的配置管理能力来将已有 Prometheus 或 Grafana 连接到 KubeVela 系统中来。详见配置管理章节。
## 集成其他工具和系统
用户可以利用社区的各种工具或生态系统来构建自己的可观测性系统,例如 prometheus-operator 或 DataDog。 到目前为止针对这些集成KubeVela 并没有给出最佳实践。 未来我们可能会通过 KubeVela 插件集成那些流行的项目。 我们也欢迎社区贡献更广泛的探索和更多的联系。

View File

@ -223,7 +223,7 @@ spec:
```
该例子中,除了将 nginx 输出的 `combinded` 日志转换成 json 格式,并为每条日志增加一个 `new_field` 的 json key json value 的值为 `new value`。具体 vector VRL 如何编写请参考[文档](https://vector.dev/docs/reference/vrl/)。
如果你针对这种处理方式,制作了专门的日志分析大盘,可以参考 [文档](./visualization#dashboard-customization) 提供的三种方式将其导入到 grafana 中。
如果你针对这种处理方式,制作了专门的日志分析大盘,可以参考 [文档](./dashboard) 提供的三种方式将其导入到 grafana 中。
## 应用文件日志

View File

@ -2,13 +2,54 @@
title: 指标
---
## Customized Prometheus Installation
## 采集应用指标
If you want to make customization to your prometheus-server installation, you can put your configuration into an individual ConfigMap, like `my-prom` in namespace o11y-system. To distribute your custom config to all clusters, you can also use a KubeVela Application to do the job.
在你的应用中,如果你想要将应用内组件(如 webservice的指标暴露给 Prometheus从而被指标采集器采集你只需要为其添加 `prometheus-scrape` 运维特征。
### Recording Rules
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
For example, if you want to add some recording rules to all your prometheus server configurations in all clusters, you can firstly create an application to distribute your recording rules as below.
你也可以显式指定指标的端口和路径。
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
上述配置将会帮助你让 Prometheus 采集到应用组件的指标。如果你想要在 Grafana 上看到这些指标,你需要在 Grafana 上创建相应的监控大盘。详见 [监控大盘](./dashboard) 章节来了解后续步骤。
## 自定义 Prometheus 配置
如果你想自定义安装 prometheus-server ,你可以把配置放到一个单独的 ConfigMap 中,比如在命名空间 o11y-system 中的 `my-prom`。 要将你的自定义配置分发到所有集群,你还可以使用 KubeVela Application 来完成这项工作。
### 记录规则
例如,如果你想在所有集群中的所有 prometheus 服务配置中添加一些记录规则,你可以首先创建一个 Application 来分发你的记录规则,如下所示。
```yaml
# my-prom.yaml
@ -31,10 +72,10 @@ spec:
data:
my-recording-rules.yml: |
groups:
- name: example
rules:
- record: apiserver:requests:rate5m
expr: sum(rate(apiserver_request_total{job="kubernetes-nodes"}[5m]))
- name: example
rules:
- record: apiserver:requests:rate5m
expr: sum(rate(apiserver_request_total{job="kubernetes-nodes"}[5m]))
policies:
- type: topology
name: topology
@ -42,17 +83,17 @@ spec:
clusterLabelSelector: {}
```
Then you need to add `customConfig` parameter to the enabling process of the prometheus-server addon, like
然后你需要在 prometheus-server 插件的启用过程中添加 `customConfig` 参数,比如:
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer storage=1G customConfig=my-prom
```
Then you will be able to see the recording rules configuration being delivered into all prometheus instances.
然后你将看到记录规则配置被分发到到所有 prome。
### Alerting rules & Other configurations
### 告警规则和其他配置
To make customization to other configurations like alerting rules, the process is same with the recording rules example shown above. You only need to change/add prometheus configurations in the application.
要对告警规则等其他配置进行自定义,过程与上面显示的记录规则示例相同。 你只需要在 application 中更改/添加 prometheus 配置。
```yaml
data:
@ -69,54 +110,22 @@ data:
![prometheus-rules-config](../../../resources/prometheus-rules-config.jpg)
### Custom storage
### 自定义 Grafana 凭证
If you want your prometheus-server to persist data in volumes, you can also specify `storage` parameter for your installation, like
如果要更改 Grafana 的默认用户名和密码,可以运行以下命令:
```shell
vela addon enable grafana adminUser=super-user adminPassword=PASSWORD
```
这会将你的默认管理员用户更改为 `super-user`,并将其密码更改为 `PASSWORD`
### 自定义存储
如果你希望 prometheus-server 和 grafana 将数据持久化在卷中,可以在安装时指定 `storage` 参数,例如:
```shell
vela addon enable prometheus-server storage=1G
```
This will create PersistentVolumeClaims and let the addon use the provided storage. The storage will not be automatically recycled even if the addon is disabled. You need to clean up the storage manually.
## Exposing Metrics in your Application
In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
You can also explicitly specify which port and which path to expose metrics.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Visualization](./visualization#dashboard-customization) for learning the following steps.
这将创建 PersistentVolumeClaims 并让插件使用提供的存储。 即使插件被禁用,存储也不会自动回收。 你需要手动清理存储。

View File

@ -0,0 +1,87 @@
---
title: 开箱即用
---
## 监控大盘
有四个自动化的 Dashboard 可以浏览和查看你的系统。
### KubeVela Application
这个 dashboard 展示了应用的基本信息。
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::info
KubeVela Application dashboard 显示了应用的元数据概况。 它直接访问 Kubernetes API 来检索运行时应用的信息,你可以将其作为入口。
**Basic Information** 部分将关键信息提取到面板中,并提供当前应用最直观的视图。
**Related Resource** 部分显示了与应用本身一起工作的那些资源,包括托管资源、记录的 ResourceTracker 和修正。
:::
### Kubernetes Deployemnt
这个 dashboard 显示原生 deployment 的概况。 你可以查看跨集群的 deployment 的信息。
URL: http://localhost:8080/d/deployment-overview/kubernetes-deployment
![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
:::info
Kubernetes Deployment dashboard 向你提供 deployment 的详细运行状态。
**Pods** 面板显示该 deployment 当前正在管理的 pod。
**Replicas** 面板显示副本数量如何变化,用于诊断你的 deployment 何时以及如何转变到不希望的状态。
**Pod** 部分包含资源的详细使用情况(包括 CPU / 内存 / 网络 / 存储),可用于识别 pod 是否面临资源压力或产生/接收意想不到的流量。
:::
### KubeVela 系统
这个 dashboard 显示 KubeVela 系统的概况。 它可以用来查看 KubeVela 控制器是否健康。
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../../resources/kubevela-system.jpg)
:::info
KubeVela System dashboard 提供 KubeVela 核心模块的运行详细信息,包括控制器和集群网关。 预计将来会添加其他模块,例如 velaux 或 prism。
**Computation Resource** 部分显示了核心模块的使用情况。 它可用于追踪是否存在内存泄漏如果内存使用量不断增加或处于高压状态cpu 使用率总是很高)。 如果内存使用量达到资源限制,则相应的模块将被杀死并重新启动,这表明计算资源不足。你应该为它们添加更多的 CPU/内存。
**Controller** 部分包括各种面板,可帮助诊断你的 KubeVela 控制器的瓶颈。
**Controller Queue** 和 **Controller Queue Add Rate** 面板显示控制器工作队列的变化。 如果控制器队列不断增加,说明系统中应用过多或应用的变化过多,控制器已经不能及时处理。 那么这意味着 KubeVela 控制器存在性能问题。 控制器队列的临时增长是可以容忍的,但维持很长时间将会导致内存占用的增加,最终导致内存不足的问题。
**Reconcile Rate** 和 **Average Reconcile Time** 面板显示控制器状态的概况。 如果调和速率稳定且平均调和时间合理(例如低于 500 毫秒,具体取决于你的场景),则你的 KubeVela 控制器是健康的。 如果控制器队列入队速率在增加,但调和速率没有上升,会逐渐导致控制器队列增长并引发问题。 有多种情况表明你的控制器运行状况不佳:
1. Reconcile 是健康的,但是应用太多,你会发现一切都很好,除了控制器队列指标增加。 检查控制器的 CPU/内存使用情况。 你可能需要添加更多的计算资源。
2. 由于错误太多,调和不健康。 你会在 **Reconcile Rate** 面板中发现很多错误。 这意味着你的系统正持续面临应用的处理错误。 这可能是由错误的应用配置或运行工作流时出现的意外错误引起的。 检查应用详细信息并查看哪些应用导致错误。
3. 由于调和时间过长导致的调整不健康。 你需要检查 **ApplicationController Reconcile Time** 面板看看它是常见情况平均调和时间高还是只有部分应用有问题p95 调和时间高)。 对于前一种情况,通常是由于 CPU 不足CPU 使用率高)或过多的请求和 kube-apiserver 限制了速率(检查 **ApplicationController Client Request Throughput****ApplicationController Client Request Average Time** 面板并查看哪些资源请求缓慢或过多)。 对于后一种情况,你需要检查哪个应用很大并且使用大量时间进行调和。
有时你可能需要参考 **ApplicationController Reconcile Stage Time**,看看是否有一些特殊的调和阶段异常。 例如GCResourceTracker 使用大量时间意味着在 KubeVela 系统中可能存在阻塞回收资源的情况。
**Application** 部分显示了整个 KubeVela 系统中应用的概况。 可用于查看应用数量的变化和使用的工作流步骤。 **Workflow Initialize Rate** 是一个辅助面板,可用于查看启动新工作流执行的频率。 **Workflow Average Complete Time** 可以进一步显示完成整个工作流程所需的时间。
:::
### Kubernetes APIServer
这个 dashboard 展示了所有 Kubernetes apiserver 的运行状态。
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
:::info
Kubernetes APIServer dashboard 可帮助你查看 Kubernetes 系统最基本的部分。 如果你的 Kubernetes APIServer 运行不正常,你的 Kubernetes 系统中所有控制器和模块都会出现异常,无法成功处理请求。 因此务必确保此 dashboard 中的一切正常。
**Requests** 部分包括一系列面板,用来显示各种请求的 QPS 和延迟。 通常,如果请求过多, APIServer 可能无法响应。 这时候就可以看到是哪种类型的请求出问题了。
**WorkQueue** 部分显示 Kubernetes APIServer 的处理状态。 如果 **Queue Size** 很大,则表示请求数超出了你的 Kubernetes APIServer 的处理能力。
**Watches** 部分显示 Kubernetes APIServer 中的 watch 数量。 与其他类型的请求相比WATCH 请求会持续消耗 Kubernetes APIServer 中的计算资源,因此限制 watch 的数量会有所帮助。
:::

View File

@ -1,390 +0,0 @@
---
title: 可视化
---
Visualization is one of the methods to present the observability information.
For example, metrics can be plotted into different types of graphs depending on their categories and logs can be filtered and listed.
In KubeVela, leveraging the power of Kubernetes Aggregated API layer, it is easy for users to manipulate dashboards on Grafana and make customizations to application visualizations.
## Pre-installed Dashboards
When enabling `grafana` addon to KubeVela system, a series of dashboards will be pre-installed and provide basic panels for viewing observability data.
### KubeVela Application
This dashboard shows the basic information for one application.
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.jpg)
<details>
The KubeVela Application dashboard shows the overview of the metadata for the application. It directly accesses the Kubernetes API to retrieve the runtime application information, where you can use it as an entrance. You can navigate to detail information for application resources by clicking the `detail` link in the *Managed Resources* panel.
---
The **Basic Information** section extracts key information into panels and give you the most straightforward view for the current application.
---
The **Related Resources** section shows those resources that work together with the application itself, including the managed resources, the recorded ResourceTrackers and the revisions.
</details>
### Kubernetes Deployemnt
This dashboard shows the overview of native deployments. You can navigate deployments across clusters.
URL: http://localhost:8080/d/kubernetes-deployment/kubernetes-deployment
![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
<details>
The Kubernetes Deployment dashboard gives you the detail running status for the deployment.
---
The **Pods** panel shows the pods that the deployment itself is currently managing.
---
The **Replicas** panel shows how the number of replicas changes, which can be used to diagnose when and how your deployment shifted to undesired state.
---
The **Resource** section includes the details for the resource usage (including the CPU / Memory / Network / Storage) which can be used to identify if the pods of the deployment are facing resource pressure or making/receiving unexpected traffics.
---
There are a list of dashboards for various types of Kubernetes resources, such as DaemonSet and StatefulSet. You can navigate to those dashboards depending on your workload type.
</details>
### KubeVela System
This dashboard shows the overview of the KubeVela system. It can be used to see if KubeVela controller is healthy.
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../../resources/kubevela-system.jpg)
<details>
The KubeVela System dashboard gives you the running details of the KubeVela core modules, including the controller and the cluster-gateway. Other modules like velaux or prism are expected to be added in the future.
---
The **Computation Resource** section shows the usage for core modules. It can be used to track if there is any memory leak (if the memory usage is continuously increasing) or under high pressure (the cpu usage is always very high). If the memory usage hits the resource limit, the corresponding module will be killed and restarted, which indicates the lack of computation resource. You should add more CPU/Memory for them.
---
The **Controller** section includes a wide range of panels which can help you to diagnose the bottleneck of the KubeVela controller in your scenario.
The **Controller Queue** and **Controller Queue Add Rate** panels show you the controller working queue changes. If the controller queue is increasing continuously, it means there are too much applications or application changes in the system, and the controller is unable to handle them in time. Then it means there is performance issues for KubeVela controller. A temporary increase for the controller queue is tolerable, but keeping for a long time will lead to memory increase which will finally cause Out-Of-Memory problems.
**Reconcile Rate** and **Average Reconcile Time** panels give you the overview of the controller status. If reconcile rate is steady and average reconcile time is reasonable (like under 500ms, depending on your scenario), your KubeVela controller is healthy. If the controller queue add rate is increasing but the reconcile rate does not go up, it will gradually lead to increase for the controller queue and cause troubles. There are various cases that your controller is unhealthy:
1. Reconcile is healthy but there are too much applications, you will find everything is okay except the controller queue metrics increasing. Check your CPU/Memory usage for the controller. You might need to add more computation resources.
2. Reconcile is not healthy due to too much errors. You will find lots of errors in the **Reconcile Rate** panel. This means your system is continuously facing process errors for application. It could be caused by invalid application configurations or unexpected errors while running workflows. Check application details and see which applications are causing errors.
3. Reconcile is not healthy due to long reconcile times. You need to check **ApplicationController Reconcile Time** panel and see whether it is a common case (the average reconcile time is high), or only part of your applications have problems (the p95 reconcile time is high). For the former case, usually it is caused by either insufficient CPU (CPU usage is high) or too much requests and rate limited by kube-apiserver (check **ApplicationController Client Request Throughput** and **ApplicationController Client Request Average Time** panel and see which resource requests is slow or excessive). For the later case you need to check which application is large and uses lots of time for reconciliations.
Sometimes you might need refer to **ApplicationController Reconcile Stage Time** and see if there is some special reconcile stages are abnormal. For example, GCResourceTrackers use lots of time means there might be blockings for recycling resource in KubeVela system.
---
The **Application** section shows the overview of the applications in your whole KubeVela system. It can be used to see the changes of the application numbers and the used workflow steps. The **Workflow Initialize Rate** is an auxiliary panel which can be used to see how frequent new workflow execution is launched. The **Workflow Average Complete Time** can further show how much time it costs to finish the whole workflow.
</details>
### Kubernetes APIServer
This dashboard shows the running status of all Kubernetes apiservers.
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
<details>
The Kubernetes APIServer dashboard helps you to see the most fundamental part for your Kubernetes system. If your Kubernetes APIServer is not running healthy, all of your controllers and modules in your Kubernetes system will be abnormal and unable to handle requests successfully. So it is important to make sure everything is fine in this dashboard.
---
The **Requests** section includes a series of panels which shows the QPS and latency for various kind of requests. Usually your APIServer could fail to respond if it is flooded by too much requests. At this time, you can see which type of requests is causing trouble.
---
The **WorkQueue** section shows the process status of the Kubernetes APIServer. If the **Queue Size** is large, it means the number of requests is out of the process capability of your Kubernetes APIServer.
---
The **Watches** section shows the number of watches in your Kubernetes APIServer. Compared to other types of requests, WATCH requests will continuously consume computation resources in Kubernetes APIServer, so it will be helpful to keep the number of watches limited.
</details>
## Dashboard Customization
Except for the pre-defined dashboards provided by the `grafana` addon, KubeVela users can deploy customized dashboards to their system as well.
:::tip
If you do not know how to build Grafana dashboards and export them as json data, you can refer to the following Grafana docs for details.
1. [Build your first dashboard](https://grafana.com/docs/grafana/latest/getting-started/build-first-dashboard/)
2. [Exporting a dashboard](https://grafana.com/docs/grafana/latest/dashboards/export-import/#exporting-a-dashboard)
:::
### Using Dashboard as Component
One way to manage your customized dashboard is to use the component in KubeVela application like below.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-dashboard
spec:
components:
- name: my-dashboard
type: grafana-dashboard
properties:
uid: my-example-dashboard
data: |
{
"panels": [{
"gridPos": {
"h": 9,
"w": 12
},
"targets": [{
"datasource": {
"type": "prometheus",
"uid": "prometheus-vela"
},
"expr": "max(up) by (cluster)"
}],
"title": "Clusters",
"type": "timeseries"
}],
"title": "My Dashboard"
}
```
### Import Dashboard from URL
Sometimes, you might already have some Grafana dashboards stored in OSS or served by other HTTP server. To import these dashboards in your system, you can leverage the `import-grafana-dashboard` workflow step as below.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-dashboard
spec:
components: []
workflow:
steps:
- type: import-grafana-dashboard
name: import-grafana-dashboard
properties:
uid: my-dashboard
title: My Dashboard
url: https://kubevelacharts.oss-accelerate.aliyuncs.com/dashboards/up-cluster-dashboard.json
```
In the `import-grafana-dashboard` step, the application will download the dashboard json from the URL and create dashboards on Grafana with correct format.
### Using CUE to Generate Dashboards Dynamically
With CUE, you can customize the process of creating dashboards. This will empower you to construct dashboards dynamically and combined with other actions. For example, you can make a WorkflowStepDefinition called `create-dashboard`, which finds the service created by the application itself and get the metrics from the exposed endpoint. Then, the step will build Grafana dashboard panels from those metrics automatically.
```cue
import (
"vela/op"
"vela/ql"
"strconv"
"math"
"regexp"
)
"create-dashboard": {
type: "workflow-step"
annotations: {}
labels: {}
description: "Create dashboard for application."
}
template: {
resources: ql.#CollectServiceEndpoints & {
app: {
name: context.name
namespace: context.namespace
filter: {}
}
} @step(1)
status: {
endpoints: *[] | [...{...}]
if resources.err == _|_ && resources.list != _|_ {
endpoints: [ for ep in resources.list if ep.endpoint.port == parameter.port {
name: "\(ep.ref.name):\(ep.ref.namespace):\(ep.cluster)"
portStr: strconv.FormatInt(ep.endpoint.port, 10)
if ep.cluster == "local" && ep.ref.kind == "Service" {
url: "http://\(ep.ref.name).\(ep.ref.namespace):\(portStr)"
}
if ep.cluster != "local" || ep.ref.kind != "Service" {
url: "http://\(ep.endpoint.host):\(portStr)"
}
}]
}
} @step(2)
getMetrics: op.#Steps & {
for ep in status.endpoints {
"\(ep.name)": op.#HTTPGet & {
url: ep.url + "/metrics"
}
}
} @step(3)
checkErrors: op.#Steps & {
for ep in status.endpoints if getMetrics["\(ep.name)"] != _|_ {
if getMetrics["\(ep.name)"].response.statusCode != 200 {
"\(ep.name)": op.#Steps & {
src: getMetrics["\(ep.name)"]
err: op.#Fail & {
message: "failed to get metrics for \(ep.name) from \(ep.url), code \(src.response.statusCode)"
}
}
}
}
} @step(4)
createDashboards: op.#Steps & {
for ep in status.endpoints if getMetrics["\(ep.name)"] != _|_ {
if getMetrics["\(ep.name)"].response.body != "" {
"\(ep.name)": dashboard & {
title: context.name
uid: "\(context.name)-\(context.namespace)"
description: "Auto-generated Dashboard"
metrics: *[] | [...{...}]
metrics: regexp.FindAllNamedSubmatch(#"""
# HELP \w+ (?P<desc>[^\n]+)\n# TYPE (?P<name>\w+) (?P<type>\w+)
"""#, getMetrics["\(ep.name)"].response.body, -1)
}
}
}
} @step(5)
applyDashboards: op.#Steps & {
for ep in status.endpoints if createDashboards["\(ep.name)"] != _|_ {
"\(ep.name)": op.#Apply & {
db: {for k, v in createDashboards["\(ep.name)"] if k != "metrics" {
"\(k)": v
}}
value: {
apiVersion: "o11y.prism.oam.dev/v1alpha1"
kind: "GrafanaDashboard"
metadata: name: "\(db.uid)@\(parameter.grafana)"
spec: db
}
}
}
} @step(6)
dashboard: {
title: *"Example Dashboard" | string
uid: *"" | string
description: *"" | string
metrics: [...{...}]
time: {
from: *"now-1h" | string
to: *"now" | string
}
refresh: *"30s" | string
templating: list: [{
type: "datasource"
name: "datasource"
label: "Data Source"
query: "prometheus"
hide: 2
}, {
type: "interval"
name: "rate_interval"
label: "Rate"
query: "3m,5m,10m,30m"
hide: 2
}]
panels: [for i, m in metrics {
title: m.name
type: "graph"
datasource: {
uid: "${datasource}"
type: "prometheus"
}
gridPos: {
w: 6
h: 8
x: math.Floor((i - y * 4) * 6)
y: math.Floor(i / 4)
}
description: m.desc
if m.type == "gauge" {
targets: [{
expr: "sum(\(m.name))"
}]
legend: show: false
}
if m.type == "counter" {
targets: [{
expr: "sum(rate(\(m.name)[$rate_interval]))"
}]
legend: show: false
}
if m.type == "histogram" || m.type == "summary" {
targets: [{
expr: "sum(rate(\(m.name)_sum[$rate_interval])) / sum(rate(\(m.name)_count[$rate_interval]))"
legendFormat: "avg"
}, {
expr: "histogram_quantile(0.75, sum(rate(\(m.name)_bucket[$rate_interval])) by (le))"
legendFormat: "p75"
}, {
expr: "histogram_quantile(0.99, sum(rate(\(m.name)_bucket[$rate_interval])) by (le))"
legendFormat: "p99"
}]
}
}]
}
parameter: {
port: *8080 | int
grafana: *"default" | string
}
}
```
Then you can create an application as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
# the core workload
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
# deploy and create dashboard automatically
workflow:
steps:
- type: deploy
name: deploy
properties:
policies: []
- type: create-dashboard
name: create-dashboard
```
This application will deploy your webservice first, and generate dashboard automatically according to the metrics collected from the webservice.

View File

@ -6,382 +6,29 @@ title: 自动化可观测性
为了帮助用户构建自己的可观测性系统KubeVela 提供了一些插件,包括:
- prometheus-server: 以时间序列来记录指标的服务,支持灵活的查询。
- kube-state-metrics: Kubernetes 系统的指标收集器
- node-exporter: Kubernetes 运行中的节点的指标收集器。
- grafana: 提供分析和交互式可视化的 Web 应用程序
**指标**
- `prometheus-server`: 以时间序列来记录指标的服务,支持灵活的查询
- `kube-state-metrics`: Kubernetes 系统的指标收集器。
- `node-exporter`: Kubernetes 运行中的节点的指标收集器
以后的版本中将引入更多用于 logging 和 tracing 的插件。
**日志**
- `loki`: 用于存储采集日志并提供查询服务的日志服务器。
## 前提条件
**监控大盘**
- `grafana`: 提供分析和交互式可视化的 Web 应用程序。
1. 可观测性套件包括几个插件,它们需要一些计算资源才能正常工作。 集群的推荐安装资源是 2 核 + 4 Gi 内存
以后的版本中将引入更多用于 alerting 和 tracing 的插件
2. 安装所需的 KubeVela 版本(服务端控制器和客户端 CLI**不低于** v1.5.0-beta.4。
## 进阶指南
## 快速开始
- [**安装**](./o11y/installation): 关于如何在 KubeVela 系统中部署可观测性基础设施。
要启用插件套件,只需运行 `vela addon enable` 命令,如下所示
- [**开箱即用**](./o11y/out-of-the-box): 关于 KubeVela 可观测性基础设施中默认可用的系统及应用监控能力
:::tip
如果你的 KubeVela 是多集群场景,请参阅下面的 [多集群安装](#多集群安装) 章节。
:::
- [**指标**](./o11y/metrics): 关于如何为你的应用自定义指标采集过程的指南。
1. 安装 kube-state-metrics 插件
- [**日志**](./o11y/logging): 关于如何为你的应用定制日志采集规则的指南。
```shell
vela addon enable kube-state-metrics
```
- [**监控大盘**](./o11y/dashboard): 关于如何为你的应用配置自定义监控大盘的指南。
2. 安装 node-exporter 插件
```shell
vela addon enable node-exporter
```
3. 安装 prometheus-server
```shell
vela addon enable prometheus-server
```
4. 安装 grafana 插件
```shell
vela addon enable grafana
```
5. 通过端口转发访问 grafana
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
现在在浏览器中访问 `http://localhost:8080` 就可以访问你的 grafana。 默认的用户名和密码分别是 `admin``kubevela`
## 自动化的 Dashboard
有四个自动化的 Dashboard 可以浏览和查看你的系统。
### KubeVela Application
这个 dashboard 展示了应用的基本信息。
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../resources/kubevela-application-dashboard.jpg)
<details>
KubeVela Application dashboard 显示了应用的元数据概况。 它直接访问 Kubernetes API 来检索运行时应用的信息,你可以将其作为入口。
---
**Basic Information** 部分将关键信息提取到面板中,并提供当前应用最直观的视图。
---
**Related Resource** 部分显示了与应用本身一起工作的那些资源,包括托管资源、记录的 ResourceTracker 和修正。
</details>
### Kubernetes Deployemnt
这个 dashboard 显示原生 deployment 的概况。 你可以查看跨集群的 deployment 的信息。
URL: http://localhost:8080/d/deployment-overview/kubernetes-deployment
![kubernetes-deployment-dashboard](../../resources/kubernetes-deployment-dashboard.jpg)
<details>
Kubernetes Deployment dashboard 向你提供 deployment 的详细运行状态。
---
**Pods** 面板显示该 deployment 当前正在管理的 pod。
---
**Replicas** 面板显示副本数量如何变化,用于诊断你的 deployment 何时以及如何转变到不希望的状态。
---
**Pod** 部分包含资源的详细使用情况(包括 CPU / 内存 / 网络 / 存储),可用于识别 pod 是否面临资源压力或产生/接收意想不到的流量。
</details>
### KubeVela 系统
这个 dashboard 显示 KubeVela 系统的概况。 它可以用来查看 KubeVela 控制器是否健康。
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../resources/kubevela-system.jpg)
<details>
KubeVela System dashboard 提供 KubeVela 核心模块的运行详细信息,包括控制器和集群网关。 预计将来会添加其他模块,例如 velaux 或 prism。
---
**Computation Resource** 部分显示了核心模块的使用情况。 它可用于追踪是否存在内存泄漏如果内存使用量不断增加或处于高压状态cpu 使用率总是很高)。 如果内存使用量达到资源限制,则相应的模块将被杀死并重新启动,这表明计算资源不足。你应该为它们添加更多的 CPU/内存。
---
**Controller** 部分包括各种面板,可帮助诊断你的 KubeVela 控制器的瓶颈。
**Controller Queue** 和 **Controller Queue Add Rate** 面板显示控制器工作队列的变化。 如果控制器队列不断增加,说明系统中应用过多或应用的变化过多,控制器已经不能及时处理。 那么这意味着 KubeVela 控制器存在性能问题。 控制器队列的临时增长是可以容忍的,但维持很长时间将会导致内存占用的增加,最终导致内存不足的问题。
**Reconcile Rate** 和 **Average Reconcile Time** 面板显示控制器状态的概况。 如果调和速率稳定且平均调和时间合理(例如低于 500 毫秒,具体取决于你的场景),则你的 KubeVela 控制器是健康的。 如果控制器队列入队速率在增加,但调和速率没有上升,会逐渐导致控制器队列增长并引发问题。 有多种情况表明你的控制器运行状况不佳:
1. Reconcile 是健康的,但是应用太多,你会发现一切都很好,除了控制器队列指标增加。 检查控制器的 CPU/内存使用情况。 你可能需要添加更多的计算资源。
2. 由于错误太多,调和不健康。 你会在 **Reconcile Rate** 面板中发现很多错误。 这意味着你的系统正持续面临应用的处理错误。 这可能是由错误的应用配置或运行工作流时出现的意外错误引起的。 检查应用详细信息并查看哪些应用导致错误。
3. 由于调和时间过长导致的调整不健康。 你需要检查 **ApplicationController Reconcile Time** 面板看看它是常见情况平均调和时间高还是只有部分应用有问题p95 调和时间高)。 对于前一种情况,通常是由于 CPU 不足CPU 使用率高)或过多的请求和 kube-apiserver 限制了速率(检查 **ApplicationController Client Request Throughput****ApplicationController Client Request Average Time** 面板并查看哪些资源请求缓慢或过多)。 对于后一种情况,你需要检查哪个应用很大并且使用大量时间进行调和。
有时你可能需要参考 **ApplicationController Reconcile Stage Time**,看看是否有一些特殊的调和阶段异常。 例如GCResourceTracker 使用大量时间意味着在 KubeVela 系统中可能存在阻塞回收资源的情况。
---
**Application** 部分显示了整个 KubeVela 系统中应用的概况。 可用于查看应用数量的变化和使用的工作流步骤。 **Workflow Initialize Rate** 是一个辅助面板,可用于查看启动新工作流执行的频率。 **Workflow Average Complete Time** 可以进一步显示完成整个工作流程所需的时间。
</details>
### Kubernetes APIServer
这个 dashboard 展示了所有 Kubernetes apiserver 的运行状态。
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../resources/kubernetes-apiserver.jpg)
<details>
Kubernetes APIServer dashboard 可帮助你查看 Kubernetes 系统最基本的部分。 如果你的 Kubernetes APIServer 运行不正常,你的 Kubernetes 系统中所有控制器和模块都会出现异常,无法成功处理请求。 因此务必确保此 dashboard 中的一切正常。
---
**Requests** 部分包括一系列面板,用来显示各种请求的 QPS 和延迟。 通常,如果请求过多, APIServer 可能无法响应。 这时候就可以看到是哪种类型的请求出问题了。
---
**WorkQueue** 部分显示 Kubernetes APIServer 的处理状态。 如果 **Queue Size** 很大,则表示请求数超出了你的 Kubernetes APIServer 的处理能力。
---
**Watches** 部分显示 Kubernetes APIServer 中的 watch 数量。 与其他类型的请求相比WATCH 请求会持续消耗 Kubernetes APIServer 中的计算资源,因此限制 watch 的数量会有所帮助。
</details>
## 自定义
上述安装过程可以通过多种方式进行自定义。
### 多集群安装
如果你想在多集群场景中安装可观测性插件,请确保你的 Kubernetes 集群支持 LoadBalancer 服务并且可以相互访问。
默认情况下,`kube-state-metrics`、`node-exporter` 和 `prometheus-server` 的安装过程原生支持多集群(它们将自动安装到所有集群)。 但是要让控制平面上的 `grafana` 能够访问托管集群中的 prometheus-server你需要使用以下命令来启用 `prometheus-server`
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
这将安装 [thanos](https://github.com/thanos-io/thanos) sidecar 和 prometheus-server。 然后启用 grafana你将能够看到聚合的 prometheus 指标。
你还可以使用以下命令选择要在哪个集群安装插件:
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
> 如果在安装插件后将新集群添加到控制平面,则需要重新启用插件才能使其生效。
### 自定义 Prometheus 配置
如果你想自定义安装 prometheus-server ,你可以把配置放到一个单独的 ConfigMap 中,比如在命名空间 o11y-system 中的 `my-prom`。 要将你的自定义配置分发到所有集群,你还可以使用 KubeVela Application 来完成这项工作。
#### 记录规则
例如,如果你想在所有集群中的所有 prometheus 服务配置中添加一些记录规则,你可以首先创建一个 Application 来分发你的记录规则,如下所示。
```yaml
# my-prom.yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-prom
namespace: o11y-system
spec:
components:
- type: k8s-objects
name: my-prom
properties:
objects:
- apiVersion: v1
kind: ConfigMap
metadata:
name: my-prom
namespace: o11y-system
data:
my-recording-rules.yml: |
groups:
- name: example
rules:
- record: apiserver:requests:rate5m
expr: sum(rate(apiserver_request_total{job="kubernetes-nodes"}[5m]))
policies:
- type: topology
name: topology
properties:
clusterLabelSelector: {}
```
然后你需要在 prometheus-server 插件的启用过程中添加 `customConfig` 参数,比如:
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer storage=1G customConfig=my-prom
```
然后你将看到记录规则配置被分发到到所有 prome。
#### 告警规则和其他配置
要对告警规则等其他配置进行自定义,过程与上面显示的记录规则示例相同。 你只需要在 application 中更改/添加 prometheus 配置。
```yaml
data:
my-alerting-rules.yml: |
groups:
- name: example
rules:
- alert: HighApplicationQueueDepth
expr: sum(workqueue_depth{app_kubernetes_io_name="vela-core",name="application"}) > 100
for: 10m
annotations:
summary: High Application Queue Depth
```
![prometheus-rules-config](../../resources/prometheus-rules-config.jpg)
### 自定义 Grafana 凭证
如果要更改 Grafana 的默认用户名和密码,可以运行以下命令:
```shell
vela addon enable grafana adminUser=super-user adminPassword=PASSWORD
```
这会将你的默认管理员用户更改为 `super-user`,并将其密码更改为 `PASSWORD`
### 自定义存储
如果你希望 prometheus-server 和 grafana 将数据持久化在卷中,可以在安装时指定 `storage` 参数,例如:
```shell
vela addon enable prometheus-server storage=1G
```
这将创建 PersistentVolumeClaims 并让插件使用提供的存储。 即使插件被禁用,存储也不会自动回收。 你需要手动清理存储。
## 集成其他 Prometheus 和 Grafana
有时,你可能已经拥有 Prometheus 和 Grafana 实例。 它们可能由其他工具构建,或者来自云提供商。 按照以下指南与现有系统集成。
### 集成 Prometheus
如果你已经有外部 prometheus 服务,并且希望将其连接到 Grafana由 vela 插件创建),你可以使用 KubeVela Application 创建一个 GrafanaDatasource 从而注册这个外部的 prometheus 服务。
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: register-prometheus
spec:
components:
- type: grafana-datasource
name: my-prometheus
properties:
access: proxy
basicAuth: false
isDefault: false
name: MyPrometheus
readOnly: true
withCredentials: true
jsonData:
httpHeaderName1: Authorization
tlsSkipVerify: true
secureJsonFields:
httpHeaderValue1: <token of your prometheus access>
type: prometheus
url: <my-prometheus url>
```
例如如果你在阿里云ARMS上使用 Prometheus 服务,你可以进入 Prometheus 设置页面,找到访问的 url 和 token。
![arms-prometheus](../../resources/arms-prometheus.jpg)
> 你需要确定你的 grafana 已经可以访问。你可以执行 `kubectl get grafana default` 查看它是否已经存在。
### 集成 Grafana
如果你已经有 Grafana与集成 Prometheus 类似,你可以通过 KubeVela Application 注册 Grafana 的访问信息。
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: register-grafana
spec:
components:
- type: grafana-access
name: my-grafana
properties:
name: my-grafana
endpoint: <my-grafana url>
token: <access token>
```
要获得 Grafana 访问权限,你可以进入 Grafana 实例并配置 API 密钥。
![grafana-apikeys](../../resources/grafana-apikeys.jpg)
然后将 token 复制到你的 grafana 注册配置中。
![grafana-added-apikeys](../../resources/grafana-added-apikeys.jpg)
Application 成功派发后,你可以通过运行以下命令检查注册情况。
```shell
> kubectl get grafana
NAME ENDPOINT CREDENTIAL_TYPE
default http://grafana.o11y-system:3000 BasicAuth
my-grafana https://grafana-rngwzwnsuvl4s9p66m.grafana.aliyuncs.com:80/ BearerToken
```
现在,你也可以通过原生 Kubernetes API 在 grafana 实例上管理 dashboard 和数据源。
```shell
# 显示你拥有的所有 dashboard
kubectl get grafanadashboard -l grafana=my-grafana
# 显示你拥有的所有数据源
kubectl get grafanadatasource -l grafana=my-grafana
```
更多详情,你可以参考 [vela-prism](https://github.com/kubevela/prism#grafana-related-apis)。
### 集成其他工具和系统
用户可以利用社区的各种工具或生态系统来构建自己的可观测性系统,例如 prometheus-operator 或 DataDog。 到目前为止针对这些集成KubeVela 并没有给出最佳实践。 未来我们可能会通过 KubeVela 插件集成那些流行的项目。 我们也欢迎社区贡献更广泛的探索和更多的联系。
## 对比
### 与 Helm 对比
尽管可以通过 Helm 将这些资源安装到你的 Kubernetes 系统中,但使用 KubeVela 插件安装它们的主要好处之一是它原生地支持多集群交付,这意味着,一旦你将托管集群添加到 KubeVela 控制面,你就能够通过一条命令在所有集群中安装、升级或卸载这些插件。
### 与以前的可观测性插件对比
旧的 [KubeVela 可观测性插件](https://github.com/kubevela/catalog/tree/master/experimental/addons/observability) 以一个整体的方式安装 prometheus、grafana 和其他一些组件。 最新的可观测性插件套件KubeVela v1.5.0 之后)将其分为多个部分,允许用户只安装其中的一部分。
此外,旧的可观测性插件依赖于 Fluxcd 插件以 Helm Release 的方式安装组件。 最新版本使用 KubeVela 中的原生 webservice 组件,可以更灵活的进行自定义。
## 展望
KubeVela 将来会集成更多的可观测性插件,例如 logging 和 tracing 插件。 像 [prometheus-operator](https://github.com/prometheus-operator/prometheus-operator) 这样的社区运营商也提供了管理可观测性应用程序的替代方法,这些方法也打算包含在 KubeVela 插件中。 我们也欢迎通过 KubeVela 插件生态系统进行更多的集成。
- [**外部集成**](./o11y/integration): 关于如何将已有的监控体系集成进 KubeVela 的指南,如果你的系统中已经存在了 Prometheus 或者 Grafana 服务(如自建或云厂商提供),可以阅读该文档了解集成方法。

Binary file not shown.

Before

Width:  |  Height:  |  Size: 488 KiB

After

Width:  |  Height:  |  Size: 507 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 279 KiB

After

Width:  |  Height:  |  Size: 406 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 322 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 350 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 473 KiB

After

Width:  |  Height:  |  Size: 432 KiB

View File

@ -181,9 +181,11 @@ module.exports = {
id: 'platform-engineers/operations/observability',
},
items: [
'platform-engineers/operations/o11y/installation',
'platform-engineers/operations/o11y/out-of-the-box',
'platform-engineers/operations/o11y/metrics',
'platform-engineers/operations/o11y/logging',
'platform-engineers/operations/o11y/visualization',
'platform-engineers/operations/o11y/dashboard',
'platform-engineers/operations/o11y/integration',
],
},

View File

@ -0,0 +1,91 @@
---
title: Installation
---
:::tip
Before installing observability addons, we recommend you to start from the [introduction of the observability feature](../observability).
:::
## Quick Start
To enable the observability addons, you simply need to run the `vela addon enable` commands as below.
1. Install the kube-state-metrics addon
```shell
vela addon enable kube-state-metrics
```
2. Install the node-exporter addon
```shell
vela addon enable node-exporter
```
3. Install the prometheus-server addon
```shell
vela addon enable prometheus-server
```
4. Install the loki addon
```shell
vela addon enable loki
```
5. Install the grafana addon
```shell
vela addon enable grafana
```
6. Access your grafana through port-forward.
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
Now you can access your grafana by access `http://localhost:8080` in your browser. The default username and password are `admin` and `kubevela` respectively.
> You can change it by adding `adminUser=super-user adminPassword=PASSWORD` to step 6.
You will see several pre-installed dashboards and use them to view your system and applications. For more details of those pre-installed dashboards, see [Out-of-the-Box](./out-of-the-box) section.
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::caution
**Resource**: The observability suite includes several addons which requires some computation resources to work properly. The recommended installation resources for you cluster are 2 cores + 4 Gi memory.
**Version**: We recommend you to use KubeVela (>= v1.6.0) to use the observability addons. For version v1.5.0, logging is not supported.
:::
:::tip
**Addon Suite**: If you want to enable these addons in one command, you can use [WorkflowRun](https://github.com/kubevela/workflow) to orchestrate the install process. It allows you to manage the addon enable process as code and make it reusable across different systems.
:::
## Multi-cluster Installation
If you want to install observability addons in multi-cluster scenario, make sure your Kubernetes clusters support LoadBalancer service and are mutatually accessible.
By default, the installation process for `kube-state-metrics`, `node-exporter` and `prometheus-server` are natually multi-cluster supported (they will be automatically installed to all clusters). But to let your `grafana` on the control plane to be able to access prometheus-server in managed clusters, you need to use the following command to enable `prometheus-server`.
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
This will install [thanos](https://github.com/thanos-io/thanos) sidecar & query along with prometheus-server. Then enable grafana, you will be able to see aggregated prometheus metrics now.
You can also choose which clusters to install addons by using commands as below
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
For `loki` addon, the storage is hosted on the hub control plane by default, and the agent ([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) or [vector](https://vector.dev/)) installation is multi-cluster supported. You can run the following command to let multi-cluster agents to send logs to the loki service on the `local` cluster.
```shell
vela addon enable loki agent=vector serviceType=LoadBalancer
```
> If you add new clusters to your control plane after addons being installed, you need to re-enable the addon to let it take effect.

View File

@ -94,10 +94,6 @@ For more details, you can refer to [vela-prism](https://github.com/kubevela/pris
It is also possible to make integrations through KubeVela's configuration management system, no matter you are using CLI or VelaUX.
### Prometheus
You can read the Configuration Management documentation for more details.
## Integrate Other Tools or Systems
There are a wide range of community tools or eco-systems that users can leverage for building their observability system, such as prometheus-operator or DataDog. By far, KubeVela does not have existing best practices for those integration. We may integrate with those popular projects through KubeVela addons in the future. We are also welcome to community contributions for broader explorations and more connections.

View File

@ -227,7 +227,7 @@ spec:
In this example, we transform nginx `combinded` format logs to json format, and adding a `new_field` json key to each log, the json value is `new value`. Please refer to [document](https://vector.dev/docs/reference/vrl/) for how to write vector VRL.
If you have a special log analysis dashboard for this processing method, you can refer to [document](./visualization#dashboard-customization) to import it into grafana.
If you have a special log analysis dashboard for this processing method, you can refer to [document](./dashboard) to import it into grafana.
## Collecting file log

View File

@ -2,6 +2,47 @@
title: Metrics
---
## Exposing Metrics in your Application
In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
You can also explicitly specify which port and which path to expose metrics.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Dashboard](./dashboard) for learning the following steps.
## Customized Prometheus Installation
If you want to make customization to your prometheus-server installation, you can put your configuration into an individual ConfigMap, like `my-prom` in namespace o11y-system. To distribute your custom config to all clusters, you can also use a KubeVela Application to do the job.
@ -78,45 +119,3 @@ vela addon enable prometheus-server storage=1G
```
This will create PersistentVolumeClaims and let the addon use the provided storage. The storage will not be automatically recycled even if the addon is disabled. You need to clean up the storage manually.
## Exposing Metrics in your Application
In your application, if you want to expose the metrics of your component (like webservice) to Prometheus, you just need to add the `prometheus-scrape` trait as follows.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
```
You can also explicitly specify which port and which path to expose metrics.
```yaml
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: my-app
spec:
components:
- name: my-app
type: webservice
properties:
image: somefive/prometheus-client-example:new
traits:
- type: prometheus-scrape
properties:
port: 8080
path: /metrics
```
This will let your application be scrapable by the prometheus server. If you want to see those metrics on Grafana, you need to create Grafana dashboard further. Go to [Visualization](./visualization#dashboard-customization) for learning the following steps.

View File

@ -0,0 +1,94 @@
---
title: Out of the Box
---
By default, a series of dashboards are pre-installed with the `grafana` addon and provide basic panels for viewing observability data. If you follow the [installation guide](./installation), you should be able to use these dashboards without further configurations.
## Dashboards
### KubeVela Application
This dashboard shows the basic information for one application.
URL: http://localhost:8080/d/application-overview/kubevela-applications
![kubevela-application-dashboard](../../../resources/kubevela-application-dashboard.png)
:::info
The KubeVela Application dashboard shows the overview of the metadata for the application. It directly accesses the Kubernetes API to retrieve the runtime application information, where you can use it as an entrance. You can navigate to detail information for application resources by clicking the `Details` link in the *Managed Resources* panel.
The **Basic Information** section extracts key information into panels and give you the most straightforward view for the current application.
The **Related Resources** section shows those resources that work together with the application itself, including the managed resources, the recorded ResourceTrackers and the revisions.
:::
### Kubernetes Deployemnt
This dashboard shows the overview of native deployments. You can navigate deployments across clusters.
URL: http://localhost:8080/d/kubernetes-deployment/kubernetes-deployment
![kubernetes-deployment-dashboard](../../../resources/kubernetes-deployment-dashboard.jpg)
:::info
The Kubernetes Deployment dashboard gives you the detail running status for the deployment.
The **Pods** panel shows the pods that the deployment itself is currently managing.
The **Replicas** panel shows how the number of replicas changes, which can be used to diagnose when and how your deployment shifted to undesired state.
The **Resource** section includes the details for the resource usage (including the CPU / Memory / Network / Storage) which can be used to identify if the pods of the deployment are facing resource pressure or making/receiving unexpected traffics.
There are a list of dashboards for various types of Kubernetes resources, such as DaemonSet and StatefulSet. You can navigate to those dashboards depending on your workload type.
:::
### KubeVela System
This dashboard shows the overview of the KubeVela system. It can be used to see if KubeVela controller is healthy.
URL: http://localhost:8080/d/kubevela-system/kubevela-system
![kubevela-system](../../../resources/kubevela-system.jpg)
:::info
The KubeVela System dashboard gives you the running details of the KubeVela core modules, including the controller and the cluster-gateway. Other modules like velaux or prism are expected to be added in the future.
The **Computation Resource** section shows the usage for core modules. It can be used to track if there is any memory leak (if the memory usage is continuously increasing) or under high pressure (the cpu usage is always very high). If the memory usage hits the resource limit, the corresponding module will be killed and restarted, which indicates the lack of computation resource. You should add more CPU/Memory for them.
The **Controller** section includes a wide range of panels which can help you to diagnose the bottleneck of the KubeVela controller in your scenario.
The **Controller Queue** and **Controller Queue Add Rate** panels show you the controller working queue changes. If the controller queue is increasing continuously, it means there are too much applications or application changes in the system, and the controller is unable to handle them in time. Then it means there is performance issues for KubeVela controller. A temporary increase for the controller queue is tolerable, but keeping for a long time will lead to memory increase which will finally cause Out-Of-Memory problems.
**Reconcile Rate** and **Average Reconcile Time** panels give you the overview of the controller status. If reconcile rate is steady and average reconcile time is reasonable (like under 500ms, depending on your scenario), your KubeVela controller is healthy. If the controller queue add rate is increasing but the reconcile rate does not go up, it will gradually lead to increase for the controller queue and cause troubles. There are various cases that your controller is unhealthy:
1. Reconcile is healthy but there are too much applications, you will find everything is okay except the controller queue metrics increasing. Check your CPU/Memory usage for the controller. You might need to add more computation resources.
2. Reconcile is not healthy due to too much errors. You will find lots of errors in the **Reconcile Rate** panel. This means your system is continuously facing process errors for application. It could be caused by invalid application configurations or unexpected errors while running workflows. Check application details and see which applications are causing errors.
3. Reconcile is not healthy due to long reconcile times. You need to check **ApplicationController Reconcile Time** panel and see whether it is a common case (the average reconcile time is high), or only part of your applications have problems (the p95 reconcile time is high). For the former case, usually it is caused by either insufficient CPU (CPU usage is high) or too much requests and rate limited by kube-apiserver (check **ApplicationController Client Request Throughput** and **ApplicationController Client Request Average Time** panel and see which resource requests is slow or excessive). For the later case you need to check which application is large and uses lots of time for reconciliations.
Sometimes you might need refer to **ApplicationController Reconcile Stage Time** and see if there is some special reconcile stages are abnormal. For example, GCResourceTrackers use lots of time means there might be blockings for recycling resource in KubeVela system.
The **Application** section shows the overview of the applications in your whole KubeVela system. It can be used to see the changes of the application numbers and the used workflow steps. The **Workflow Initialize Rate** is an auxiliary panel which can be used to see how frequent new workflow execution is launched. The **Workflow Average Complete Time** can further show how much time it costs to finish the whole workflow.
:::
### Kubernetes APIServer
This dashboard shows the running status of all Kubernetes apiservers.
URL: http://localhost:8080/d/kubernetes-apiserver/kubernetes-apiserver
![kubernetes-apiserver](../../../resources/kubernetes-apiserver.jpg)
:::info
The Kubernetes APIServer dashboard helps you to see the most fundamental part for your Kubernetes system. If your Kubernetes APIServer is not running healthy, all of your controllers and modules in your Kubernetes system will be abnormal and unable to handle requests successfully. So it is important to make sure everything is fine in this dashboard.
The **Requests** section includes a series of panels which shows the QPS and latency for various kind of requests. Usually your APIServer could fail to respond if it is flooded by too much requests. At this time, you can see which type of requests is causing trouble.
The **WorkQueue** section shows the process status of the Kubernetes APIServer. If the **Queue Size** is large, it means the number of requests is out of the process capability of your Kubernetes APIServer.
The **Watches** section shows the number of watches in your Kubernetes APIServer. Compared to other types of requests, WATCH requests will continuously consume computation resources in Kubernetes APIServer, so it will be helpful to keep the number of watches limited.
:::

View File

@ -2,8 +2,6 @@
title: Automated Observability
---
## Introduction
Observability is critical for infrastructures and applications. Without observability system, it is hard to identify what happens when system broke down. On contrary, a strong observabilty system can not only provide confidences for operators but can also help developers quickly locate the performance bottleneck or the weak points inside the whole system.
To help users build their own observability system from scratch, KubeVela provides a list of addons, including
@ -16,102 +14,21 @@ To help users build their own observability system from scratch, KubeVela provid
**Logging**
- `loki`: A logging server which stores collected logs of Kubernetes pods and provide query interfaces.
**Visualization**
**Dashboard**
- `grafana`: A web application that provides analytics and interactive visualizations.
> More addons for alerting & tracing will be introduced in later versions.
## Quick Start
To enable the observability addons, you simply need to run the `vela addon enable` commands as below.
1. Install the kube-state-metrics addon
```shell
vela addon enable kube-state-metrics
```
2. Install the node-exporter addon
```shell
vela addon enable node-exporter
```
3. Install the prometheus-server addon
```shell
vela addon enable prometheus-server
```
4. Install the loki addon
```shell
vela addon enable loki
```
5. Install the grafana addon
```shell
vela addon enable grafana
```
6. Access your grafana through port-forward.
```shell
kubectl port-forward svc/grafana -n o11y-system 8080:3000
```
Now you can access your grafana by access `http://localhost:8080` in your browser. The default username and password are `admin` and `kubevela` respectively.
> You can change it by adding `adminUser=super-user adminPassword=PASSWORD` to step 6.
You will see several pre-installed dashboards and use them to view your system and applications. For more details of those pre-installed dashboards, see [Visualization](./o11y/visualization#pre-installed-dashboards) section.
![kubevela-application-dashboard](../../resources/kubevela-application-dashboard.jpg)
:::caution
**Resource**: The observability suite includes several addons which requires some computation resources to work properly. The recommended installation resources for you cluster are 2 cores + 4 Gi memory.
**Version**: We recommend you to use KubeVela (>= v1.6.0) to use the observability addons. For version v1.5.0, logging is not supported.
:::
:::tip
**Addon Suite**: If you want to enable these addons in one command, you can use [WorkflowRun](https://github.com/kubevela/workflow) to orchestrate the install process. It allows you to manage the addon enable process as code and make it reusable across different systems.
:::
## Multi-cluster Installation
If you want to install observability addons in multi-cluster scenario, make sure your Kubernetes clusters support LoadBalancer service and are mutatually accessible.
By default, the installation process for `kube-state-metrics`, `node-exporter` and `prometheus-server` are natually multi-cluster supported (they will be automatically installed to all clusters). But to let your `grafana` on the control plane to be able to access prometheus-server in managed clusters, you need to use the following command to enable `prometheus-server`.
```shell
vela addon enable prometheus-server thanos=true serviceType=LoadBalancer
```
This will install [thanos](https://github.com/thanos-io/thanos) sidecar & query along with prometheus-server. Then enable grafana, you will be able to see aggregated prometheus metrics now.
You can also choose which clusters to install addons by using commands as below
```shell
vela addon enable kube-state-metrics clusters=\{local,c2\}
```
For `loki` addon, the storage is hosted on the hub control plane by default, and the agent ([promtail](https://grafana.com/docs/loki/latest/clients/promtail/) or [vector](https://vector.dev/)) installation is multi-cluster supported. You can run the following command to let multi-cluster agents to send logs to the loki service on the `local` cluster.
```shell
vela addon enable loki agent=vector serviceType=LoadBalancer
```
> If you add new clusters to your control plane after addons being installed, you need to re-enable the addon to let it take effect.
## What's Next
- [**Installation**](./o11y/installation): Guide for how to install observability addons in your KubeVela system.
- [**Out of the Box**](./o11y/out-of-the-box): Guide for how to use pre-installed dashboards to monitor your system and applications.
- [**Metrics**](./o11y/metrics): Guide for customizing the process of collecting metrics for your application.
- [**Logging**](./o11y/logging): Guide for how to customize the log collecting rules for your application.
- [**Visualization**](./o11y/visualization): Guide for creating your customized dashboards for applications.
- [**Dashboard**](./o11y/dashboard): Guide for creating your customized dashboards for applications.
- [**Integration**](./o11y/integration): Guide for integrating your existing infrastructure to KubeVela, when you already have Prometheus or Grafana before installing addons.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 473 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 350 KiB

View File

@ -193,9 +193,11 @@
"id": "platform-engineers/operations/observability"
},
"items": [
"platform-engineers/operations/o11y/installation",
"platform-engineers/operations/o11y/out-of-the-box",
"platform-engineers/operations/o11y/metrics",
"platform-engineers/operations/o11y/logging",
"platform-engineers/operations/o11y/visualization",
"platform-engineers/operations/o11y/dashboard",
"platform-engineers/operations/o11y/integration"
]
},