mirror of https://github.com/fluxcd/flagger.git
614 lines
15 KiB
Markdown
614 lines
15 KiB
Markdown
# Metrics Analysis
|
|
|
|
As part of the analysis process, Flagger can validate service level objectives
|
|
(SLOs) like availability, error rate percentage, average response time and any other objective
|
|
based on app specific metrics.
|
|
If a drop in performance is noticed during the SLOs analysis,
|
|
the release will be automatically rolled back with minimum impact to end-users.
|
|
|
|
## Builtin metrics
|
|
|
|
Flagger comes with two builtin metric checks: HTTP request success rate and duration.
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: request-success-rate
|
|
interval: 1m
|
|
# minimum req success rate (non 5xx responses)
|
|
# percentage (0-100)
|
|
thresholdRange:
|
|
min: 99
|
|
- name: request-duration
|
|
interval: 1m
|
|
# maximum req duration P99
|
|
# milliseconds
|
|
thresholdRange:
|
|
max: 500
|
|
```
|
|
|
|
For each metric you can specify a range of accepted values with `thresholdRange` and
|
|
the window size or the time series with `interval`.
|
|
The builtin checks are available for every service mesh / ingress controller
|
|
and are implemented with [Prometheus queries](../faq.md#metrics).
|
|
|
|
## Custom metrics
|
|
|
|
The canary analysis can be extended with custom metric checks.
|
|
Using a `MetricTemplate` custom resource,
|
|
you configure Flagger to connect to a metric provider and run a query that returns a `float64` value.
|
|
The query result is used to validate the canary based on the specified threshold range.
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: my-metric
|
|
spec:
|
|
provider:
|
|
type: # can be prometheus, datadog, etc
|
|
address: # API URL
|
|
insecureSkipVerify: # if set to true, disables the TLS cert validation
|
|
secretRef:
|
|
name: # name of the secret containing the API credentials
|
|
query: # metric query
|
|
```
|
|
|
|
The following variables are available in query templates:
|
|
|
|
* `name` (canary.metadata.name)
|
|
* `namespace` (canary.metadata.namespace)
|
|
* `target` (canary.spec.targetRef.name)
|
|
* `service` (canary.spec.service.name)
|
|
* `ingress` (canary.spec.ingresRef.name)
|
|
* `interval` (canary.spec.analysis.metrics[].interval)
|
|
|
|
A canary analysis metric can reference a template with `templateRef`:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "my metric"
|
|
templateRef:
|
|
name: my-metric
|
|
# namespace is optional
|
|
# when not specified, the canary namespace will be used
|
|
namespace: flagger
|
|
# accepted values
|
|
thresholdRange:
|
|
min: 10
|
|
max: 1000
|
|
# metric query time window
|
|
interval: 1m
|
|
```
|
|
|
|
## Prometheus
|
|
|
|
You can create custom metric checks targeting a Prometheus server by
|
|
setting the provider type to `prometheus` and writing the query in PromQL.
|
|
|
|
Prometheus template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
spec:
|
|
provider:
|
|
type: prometheus
|
|
address: http://prometheus.istio-system:9090
|
|
query: |
|
|
100 - sum(
|
|
rate(
|
|
istio_requests_total{
|
|
reporter="destination",
|
|
destination_workload_namespace="{{ namespace }}",
|
|
destination_workload="{{ target }}",
|
|
response_code!="404"
|
|
}[{{ interval }}]
|
|
)
|
|
)
|
|
/
|
|
sum(
|
|
rate(
|
|
istio_requests_total{
|
|
reporter="destination",
|
|
destination_workload_namespace="{{ namespace }}",
|
|
destination_workload="{{ target }}"
|
|
}[{{ interval }}]
|
|
)
|
|
) * 100
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "404s percentage"
|
|
templateRef:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
thresholdRange:
|
|
max: 5
|
|
interval: 1m
|
|
```
|
|
|
|
The above configuration validates the canary by checking if the HTTP 404 req/sec percentage
|
|
is below 5 percent of the total traffic. If the 404s rate reaches the 5% threshold, then the canary fails.
|
|
|
|
Prometheus gRPC error rate example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: grpc-error-rate-percentage
|
|
namespace: flagger
|
|
spec:
|
|
provider:
|
|
type: prometheus
|
|
address: http://flagger-prometheus.flagger-system:9090
|
|
query: |
|
|
100 - sum(
|
|
rate(
|
|
grpc_server_handled_total{
|
|
grpc_code!="OK",
|
|
kubernetes_namespace="{{ namespace }}",
|
|
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
|
|
}[{{ interval }}]
|
|
)
|
|
)
|
|
/
|
|
sum(
|
|
rate(
|
|
grpc_server_started_total{
|
|
kubernetes_namespace="{{ namespace }}",
|
|
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
|
|
}[{{ interval }}]
|
|
)
|
|
) * 100
|
|
```
|
|
|
|
The above template is for gRPC services instrumented with
|
|
[go-grpc-prometheus](https://github.com/grpc-ecosystem/go-grpc-prometheus).
|
|
|
|
## Prometheus authentication
|
|
|
|
If your Prometheus API requires basic authentication, you can create a secret in the same namespace
|
|
as the `MetricTemplate` with the basic-auth credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: prom-basic-auth
|
|
namespace: flagger
|
|
data:
|
|
username: your-user
|
|
password: your-password
|
|
```
|
|
|
|
Then reference the secret in the `MetricTemplate`:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: my-metric
|
|
namespace: flagger
|
|
spec:
|
|
provider:
|
|
type: prometheus
|
|
address: http://prometheus.monitoring:9090
|
|
secretRef:
|
|
name: prom-basic-auth
|
|
```
|
|
|
|
## Datadog
|
|
|
|
You can create custom metric checks using the Datadog provider.
|
|
|
|
Create a secret with your Datadog API credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: datadog
|
|
namespace: istio-system
|
|
data:
|
|
datadog_api_key: your-datadog-api-key
|
|
datadog_application_key: your-datadog-application-key
|
|
```
|
|
|
|
Datadog template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
spec:
|
|
provider:
|
|
type: datadog
|
|
address: https://api.datadoghq.com
|
|
secretRef:
|
|
name: datadog
|
|
query: |
|
|
100 - (
|
|
sum:istio.mesh.request.count{
|
|
reporter:destination,
|
|
destination_workload_namespace:{{ namespace }},
|
|
destination_workload:{{ target }},
|
|
!response_code:404
|
|
}.as_count()
|
|
/
|
|
sum:istio.mesh.request.count{
|
|
reporter:destination,
|
|
destination_workload_namespace:{{ namespace }},
|
|
destination_workload:{{ target }}
|
|
}.as_count()
|
|
) * 100
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "404s percentage"
|
|
templateRef:
|
|
name: not-found-percentage
|
|
namespace: istio-system
|
|
thresholdRange:
|
|
max: 5
|
|
interval: 1m
|
|
```
|
|
|
|
## Amazon CloudWatch
|
|
|
|
You can create custom metric checks using the CloudWatch metrics provider.
|
|
|
|
CloudWatch template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1alpha1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: cloudwatch-error-rate
|
|
spec:
|
|
provider:
|
|
type: cloudwatch
|
|
region: ap-northeast-1 # specify the region of your metrics
|
|
query: |
|
|
[
|
|
{
|
|
"Id": "e1",
|
|
"Expression": "m1 / m2",
|
|
"Label": "ErrorRate"
|
|
},
|
|
{
|
|
"Id": "m1",
|
|
"MetricStat": {
|
|
"Metric": {
|
|
"Namespace": "MyKubernetesCluster",
|
|
"MetricName": "ErrorCount",
|
|
"Dimensions": [
|
|
{
|
|
"Name": "appName",
|
|
"Value": "{{ name }}.{{ namespace }}"
|
|
}
|
|
]
|
|
},
|
|
"Period": 60,
|
|
"Stat": "Sum",
|
|
"Unit": "Count"
|
|
},
|
|
"ReturnData": false
|
|
},
|
|
{
|
|
"Id": "m2",
|
|
"MetricStat": {
|
|
"Metric": {
|
|
"Namespace": "MyKubernetesCluster",
|
|
"MetricName": "RequestCount",
|
|
"Dimensions": [
|
|
{
|
|
"Name": "appName",
|
|
"Value": "{{ name }}.{{ namespace }}"
|
|
}
|
|
]
|
|
},
|
|
"Period": 60,
|
|
"Stat": "Sum",
|
|
"Unit": "Count"
|
|
},
|
|
"ReturnData": false
|
|
}
|
|
]
|
|
```
|
|
|
|
The query format documentation can be found
|
|
[here](https://aws.amazon.com/premiumsupport/knowledge-center/cloudwatch-getmetricdata-api/).
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "app error rate"
|
|
templateRef:
|
|
name: cloudwatch-error-rate
|
|
thresholdRange:
|
|
max: 0.1
|
|
interval: 1m
|
|
```
|
|
|
|
**Note** that Flagger need AWS IAM permission to perform `cloudwatch:GetMetricData` to use this provider.
|
|
|
|
## New Relic
|
|
|
|
You can create custom metric checks using the New Relic provider.
|
|
|
|
Create a secret with your New Relic Insights credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: newrelic
|
|
namespace: istio-system
|
|
data:
|
|
newrelic_account_id: your-account-id
|
|
newrelic_query_key: your-insights-query-key
|
|
```
|
|
|
|
New Relic template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: newrelic-error-rate
|
|
namespace: ingress-nginx
|
|
spec:
|
|
provider:
|
|
type: newrelic
|
|
secretRef:
|
|
name: newrelic
|
|
query: |
|
|
SELECT
|
|
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
|
|
sum(nginx_ingress_controller_requests) * 100
|
|
FROM Metric
|
|
WHERE metricName = 'nginx_ingress_controller_requests'
|
|
AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "error rate"
|
|
templateRef:
|
|
name: newrelic-error-rate
|
|
namespace: ingress-nginx
|
|
thresholdRange:
|
|
max: 5
|
|
interval: 1m
|
|
```
|
|
|
|
## Graphite
|
|
|
|
You can create custom metric checks using the Graphite provider.
|
|
|
|
Graphite template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: graphite-request-success-rate
|
|
spec:
|
|
provider:
|
|
type: graphite
|
|
address: http://graphite.monitoring
|
|
query: |
|
|
target=summarize(
|
|
asPercent(
|
|
sumSeries(
|
|
stats.timers.httpServerRequests.app.{{target}}.exception.*.method.*.outcome.{CLIENT_ERROR,INFORMATIONAL,REDIRECTION,SUCCESS}.status.*.uri.*.count
|
|
),
|
|
sumSeries(
|
|
stats.timers.httpServerRequests.app.{{target}}.exception.*.method.*.outcome.*.status.*.uri.*.count
|
|
)
|
|
),
|
|
{{interval}},
|
|
'avg'
|
|
)
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "success rate"
|
|
templateRef:
|
|
name: graphite-request-success-rate
|
|
thresholdRange:
|
|
min: 90
|
|
interval: 1min
|
|
```
|
|
|
|
## Graphite authentication
|
|
|
|
If your Graphite API requires basic authentication, you can create a secret in the same namespace
|
|
as the `MetricTemplate` with the basic-auth credentials:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: graphite-basic-auth
|
|
namespace: flagger
|
|
data:
|
|
username: your-user
|
|
password: your-password
|
|
```
|
|
|
|
Then, reference the secret in the `MetricTemplate`:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: my-metric
|
|
namespace: flagger
|
|
spec:
|
|
provider:
|
|
type: graphite
|
|
address: http://graphite.monitoring
|
|
secretRef:
|
|
name: graphite-basic-auth
|
|
```
|
|
|
|
## Google Cloud Monitoring (Stackdriver)
|
|
|
|
Enable Workload Identity on your cluster, create a service account key that has read access to the
|
|
Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger
|
|
service account on Kubernetes. You can take a look at this [guide](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity)
|
|
|
|
Annotate the flagger service account
|
|
```shell script
|
|
kubectl annotate serviceaccount flagger \
|
|
--namespace <namespace> \
|
|
iam.gke.io/gcp-service-account=<gcp-serviceaccount-name>@<project-id>.iam.gserviceaccount.com
|
|
```
|
|
|
|
Alternatively, you can download the json keys and add it to your secret with the key `serviceAccountKey` (This method is not recommended).
|
|
|
|
Create a secret that contains your project-id (and, if workload identity is not enabled on your cluster,
|
|
your [service account json](https://cloud.google.com/docs/authentication/production#create_service_account)).
|
|
|
|
```
|
|
kubectl create secret generic gcloud-sa --from-literal=project=<project-id>
|
|
```
|
|
|
|
Then reference the secret in the metric template.
|
|
Note: The particular MQL query used here works if [Istio is installed on GKE](https://cloud.google.com/istio/docs/istio-on-gke/installing).
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: bytes-sent
|
|
namespace: test
|
|
spec:
|
|
provider:
|
|
type: stackdriver
|
|
secretRef:
|
|
name: gcloud-sa
|
|
query: |
|
|
fetch k8s_container
|
|
| metric 'istio.io/service/server/response_latencies'
|
|
| filter
|
|
(metric.destination_service_name == '{{ service }}-canary'
|
|
&& metric.destination_service_namespace == '{{ namespace }}')
|
|
| align delta(1m)
|
|
| every 1m
|
|
| group_by [],
|
|
[value_response_latencies_percentile:
|
|
percentile(value.response_latencies, 99)]
|
|
```
|
|
|
|
The reference for the query language can be found [here](https://cloud.google.com/monitoring/mql/reference)
|
|
|
|
## InfluxDB
|
|
|
|
The InfluxDB provider uses the [flux](https://docs.influxdata.com/influxdb/v2.0/query-data/get-started/) query language.
|
|
|
|
Create a secret that contains your authentication token that can be found in the InfluxDB UI.
|
|
|
|
```
|
|
kubectl create secret generic influx-token --from-literal=token=<token>
|
|
```
|
|
|
|
Then reference the secret in the metric template.
|
|
|
|
Note: The particular MQL query used here works if [Istio is installed on GKE](https://cloud.google.com/istio/docs/istio-on-gke/installing).
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: not-found
|
|
namespace: test
|
|
spec:
|
|
provider:
|
|
type: influxdb
|
|
secretRef:
|
|
name: influx-token
|
|
query: |
|
|
from(bucket: "default")
|
|
|> range(start: -2h)
|
|
|> filter(fn: (r) => r["_measurement"] == "istio_requests_total")
|
|
|> filter(fn: (r) => r[" destination_workload_namespace"] == "{{ namespace }}")
|
|
|> filter(fn: (r) => r["destination_workload"] == "{{ target }}")
|
|
|> filter(fn: (r) => r["response_code"] == "500")
|
|
|> count()
|
|
|> yield(name: "count")
|
|
```
|
|
|
|
## Dynatrace
|
|
|
|
You can create custom metric checks using the Dynatrace provider.
|
|
|
|
Create a secret with your Dynatrace token:
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Secret
|
|
metadata:
|
|
name: dynatrace
|
|
namespace: istio-system
|
|
data:
|
|
dynatrace_token: ZHQwYz...
|
|
```
|
|
|
|
Dynatrace metric template example:
|
|
|
|
```yaml
|
|
apiVersion: flagger.app/v1beta1
|
|
kind: MetricTemplate
|
|
metadata:
|
|
name: response-time-95pct
|
|
namespace: istio-system
|
|
spec:
|
|
provider:
|
|
type: dynatrace
|
|
address: https://xxxxxxxx.live.dynatrace.com
|
|
secretRef:
|
|
name: dynatrace
|
|
query: |
|
|
builtin:service.response.time:filter(eq(dt.entity.service,SERVICE-ABCDEFG0123456789)):percentile(95)
|
|
```
|
|
|
|
Reference the template in the canary analysis:
|
|
|
|
```yaml
|
|
analysis:
|
|
metrics:
|
|
- name: "response-time-95pct"
|
|
templateRef:
|
|
name: response-time-95pct
|
|
namespace: istio-system
|
|
thresholdRange:
|
|
max: 1000
|
|
interval: 1m
|
|
```
|