mirror of https://github.com/fluxcd/flagger.git
404 lines
9.8 KiB
Markdown
404 lines
9.8 KiB
Markdown
# Metrics Analysis
|
|
|
|
As part of the analysis process, Flagger can validate service level objectives
|
|
(SLOs) like availability, error rate percentage, average response time and any other objective
|
|
based on app specific metrics.
|
|
If a drop in performance is noticed during the SLOs analysis,
|
|
the release will be automatically rolled back with minimum impact to end-users.
|
|
|
|
## Builtin metrics
|
|
|
|
Flagger comes with two builtin metric checks: HTTP request success rate and duration.
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: request-success-rate
|
|
interval: 1m
|
|
# minimum req success rate (non 5xx responses)
|
|
# percentage (0-100)
|
|
thresholdRange:
|
|
min: 99
|
|
- name: request-duration
|
|
interval: 1m
|
|
# maximum req duration P99
|
|
# milliseconds
|
|
thresholdRange:
|
|
max: 500
|
|
```
|
|
|
|
For each metric you can specify a range of accepted values with `thresholdRange` and
|
|
the window size or the time series with `interval`.
|
|
The builtin checks are available for every service mesh / ingress controlle
|
|
and are implemented with [Prometheus queries](../faq.md#metrics).
|
|
|
|
## Custom metrics
|
|
|
|
The canary analysis can be extended with custom metric checks.
|
|
Using a `MetricTemplate` custom resource,
|
|
you configure Flagger to connect to a metric provider and run a query that returns a `float64` value.
|
|
The query result is used to validate the canary based on the specified threshold range.
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: my-metric
|
|
spec:
|
|
provider:
|
|
type: # can be prometheus or datadog
|
|
address: # API URL
|
|
secretRef:
|
|
name: # name of the secret containing the API credentials
|
|
query: # metric query
|
|
```
|
|
|
|
The following variables are available in query templates:
|
|
|
|
* `name` (canary.metadata.name)
|
|
* `namespace` (canary.metadata.namespace)
|
|
* `target` (canary.spec.targetRef.name)
|
|
* `service` (canary.spec.service.name)
|
|
* `ingress` (canary.spec.ingresRef.name)
|
|
* `interval` (canary.spec.analysis.metrics[].interval)
|
|
|
|
A canary analysis metric can reference a template with `templateRef`:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "my metric"
|
|
templateRef:
|
|
name: my-metric
|
|
# namespace is optional
|
|
# when not specified, the canary namespace will be used
|
|
namespace: flagger
|
|
# accepted values
|
|
thresholdRange:
|
|
min: 10
|
|
max: 1000
|
|
# metric query time window
|
|
interval: 1m
|
|
```
|
|
|
|
## Prometheus
|
|
|
|
You can create custom metric checks targeting a Prometheus server by
|
|
setting the provider type to `prometheus` and writing the query in PromQL.
|
|
|
|
Prometheus template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
spec:
|
|
provider:
|
|
type: prometheus
|
|
address: http://prometheus.istio-system:9090
|
|
query: |
|
|
100 - sum(
|
|
rate(
|
|
istio_requests_total{
|
|
reporter="destination",
|
|
destination_workload_namespace="{{ namespace }}",
|
|
destination_workload="{{ target }}",
|
|
response_code!="404"
|
|
}[{{ interval }}]
|
|
)
|
|
)
|
|
/
|
|
sum(
|
|
rate(
|
|
istio_requests_total{
|
|
reporter="destination",
|
|
destination_workload_namespace="{{ namespace }}",
|
|
destination_workload="{{ target }}"
|
|
}[{{ interval }}]
|
|
)
|
|
) * 100
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "404s percentage"
|
|
templateRef:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
thresholdRange:
|
|
max: 5
|
|
interval: 1m
|
|
```
|
|
|
|
The above configuration validates the canary by checking if the HTTP 404 req/sec percentage
|
|
is below 5 percent of the total traffic. If the 404s rate reaches the 5% threshold, then the canary fails.
|
|
|
|
Prometheus gRPC error rate example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: grpc-error-rate-percentage
|
|
namespace: flagger
|
|
spec:
|
|
provider:
|
|
type: prometheus
|
|
address: http://flagger-prometheus.flagger-system:9090
|
|
query: |
|
|
100 - sum(
|
|
rate(
|
|
grpc_server_handled_total{
|
|
grpc_code!="OK",
|
|
kubernetes_namespace="{{ namespace }}",
|
|
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
|
|
}[{{ interval }}]
|
|
)
|
|
)
|
|
/
|
|
sum(
|
|
rate(
|
|
grpc_server_started_total{
|
|
kubernetes_namespace="{{ namespace }}",
|
|
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
|
|
}[{{ interval }}]
|
|
)
|
|
) * 100
|
|
```
|
|
|
|
The above template is for gRPC services instrumented with
|
|
[go-grpc-prometheus](https://github.com/grpc-ecosystem/go-grpc-prometheus).
|
|
|
|
## Prometheus authentication
|
|
|
|
If your Prometheus API requires basic authentication, you can create a secret in the same namespace
|
|
as the `MetricTemplate` with the basic-auth credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: prom-basic-auth
|
|
namespace: flagger
|
|
data:
|
|
username: your-user
|
|
password: your-password
|
|
```
|
|
|
|
Then reference the secret in the `MetricTemplate`:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: my-metric
|
|
namespace: flagger
|
|
spec:
|
|
provider:
|
|
type: prometheus
|
|
address: http://prometheus.monitoring:9090
|
|
secretRef:
|
|
name: prom-basic-auth
|
|
```
|
|
|
|
## Datadog
|
|
|
|
You can create custom metric checks using the Datadog provider.
|
|
|
|
Create a secret with your Datadog API credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: datadog
|
|
namespace: istio-system
|
|
data:
|
|
datadog_api_key: your-datadog-api-key
|
|
datadog_application_key: your-datadog-application-key
|
|
```
|
|
|
|
Datadog template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
spec:
|
|
provider:
|
|
type: datadog
|
|
address: https://api.datadoghq.com
|
|
secretRef:
|
|
name: datadog
|
|
query: |
|
|
100 - (
|
|
sum:istio.mesh.request.count{
|
|
reporter:destination,
|
|
destination_workload_namespace:{{ namespace }},
|
|
destination_workload:{{ target }},
|
|
!response_code:404
|
|
}.as_count()
|
|
/
|
|
sum:istio.mesh.request.count{
|
|
reporter:destination,
|
|
destination_workload_namespace:{{ namespace }},
|
|
destination_workload:{{ target }}
|
|
}.as_count()
|
|
) * 100
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "404s percentage"
|
|
templateRef:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
thresholdRange:
|
|
max: 5
|
|
interval: 1m
|
|
```
|
|
|
|
## Amazon CloudWatch
|
|
|
|
You can create custom metric checks using the CloudWatch metrics provider.
|
|
|
|
CloudWatch template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1alpha1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: cloudwatch-error-rate
|
|
spec:
|
|
provider:
|
|
type: cloudwatch
|
|
region: ap-northeast-1 # specify the region of your metrics
|
|
query: |
|
|
[
|
|
{
|
|
"Id": "e1",
|
|
"Expression": "m1 / m2",
|
|
"Label": "ErrorRate"
|
|
},
|
|
{
|
|
"Id": "m1",
|
|
"MetricStat": {
|
|
"Metric": {
|
|
"Namespace": "MyKubernetesCluster",
|
|
"MetricName": "ErrorCount",
|
|
"Dimensions": [
|
|
{
|
|
"Name": "appName",
|
|
"Value": "{{ name }}.{{ namespace }}"
|
|
}
|
|
]
|
|
},
|
|
"Period": 60,
|
|
"Stat": "Sum",
|
|
"Unit": "Count"
|
|
},
|
|
"ReturnData": false
|
|
},
|
|
{
|
|
"Id": "m2",
|
|
"MetricStat": {
|
|
"Metric": {
|
|
"Namespace": "MyKubernetesCluster",
|
|
"MetricName": "RequestCount",
|
|
"Dimensions": [
|
|
{
|
|
"Name": "appName",
|
|
"Value": "{{ name }}.{{ namespace }}"
|
|
}
|
|
]
|
|
},
|
|
"Period": 60,
|
|
"Stat": "Sum",
|
|
"Unit": "Count"
|
|
},
|
|
"ReturnData": false
|
|
}
|
|
]
|
|
```
|
|
|
|
The query format documentation can be found
|
|
[here](https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-getmetricdata-api/).
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "app error rate"
|
|
templateRef:
|
|
name: cloudwatch-error-rate
|
|
thresholdRange:
|
|
max: 0.1
|
|
interval: 1m
|
|
```
|
|
|
|
**Note** that Flagger need AWS IAM permission to perform `cloudwatch:GetMetricData` to use this provider.
|
|
|
|
## New Relic
|
|
|
|
You can create custom metric checks using the New Relic provider.
|
|
|
|
Create a secret with your New Relic Insights credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: newrelic
|
|
namespace: istio-system
|
|
data:
|
|
newrelic_account_id: your-account-id
|
|
newrelic_query_key: your-insights-query-key
|
|
```
|
|
|
|
New Relic template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: newrelic-error-rate
|
|
namespace: ingress-nginx
|
|
spec:
|
|
provider:
|
|
type: newrelic
|
|
secretRef:
|
|
name: newrelic
|
|
query: |
|
|
SELECT
|
|
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
|
|
sum(nginx_ingress_controller_requests) * 100
|
|
FROM Metric
|
|
WHERE metricName = 'nginx_ingress_controller_requests'
|
|
AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "error rate"
|
|
templateRef:
|
|
name: newrelic-error-rate
|
|
namespace: ingress-nginx
|
|
thresholdRange:
|
|
max: 5
|
|
interval: 1m
|
|
```
|