Update docs for Flagger v1beta1 API

This commit is contained in:
stefanprodan 2020-02-26 10:52:25 +02:00
parent 8f12128aaf
commit 0e81b5f4d2
10 changed files with 1090 additions and 1039 deletions

View File

@ -17,12 +17,12 @@ contribution.
## Chat
The project uses Slack: To join the conversation, simply join the
[Weave community](https://slack.weave.works/) Slack workspace.
[Weave community](https://slack.weave.works/) Slack workspace #flagger channel.
## Getting Started
- Fork the repository on GitHub
- If you want to contribute as a developer, continue reading this document for further instructions
- If you want to contribute as a developer, read [Flagger Development Guide](https://docs.flagger.app/dev-guide)
- If you have questions, concerns, get stuck or need a hand, let us know
on the Slack channel. We are happy to help and look forward to having
you part of the team. No matter in which capacity.
@ -59,7 +59,7 @@ get asked to resubmit the PR or divide the changes into more than one PR.
### Format of the Commit Message
For Flux we prefer the following rules for good commit messages:
For Flagger we prefer the following rules for good commit messages:
- Limit the subject to 50 characters and write as the continuation
of the sentence "If applied, this commit will ..."
@ -69,4 +69,4 @@ For Flux we prefer the following rules for good commit messages:
The [following article](https://chris.beams.io/posts/git-commit/#seven-rules)
has some more helpful advice on documenting your work.
This doc is adapted from the [Weaveworks Flux](https://github.com/weaveworks/flux/blob/master/CONTRIBUTING.md)
This doc is adapted from [FluxCD](https://github.com/fluxcd/flux/blob/master/CONTRIBUTING.md).

View File

@ -87,7 +87,7 @@ spec:
kind: HorizontalPodAutoscaler
name: podinfo
service:
# service name (optional)
# service name (defaults to targetRef.name)
name: podinfo
# ClusterIP port number
port: 9898
@ -95,6 +95,9 @@ spec:
targetPort: 9898
# port name can be http or grpc (default http)
portName: http
# add all the other container ports
# to the ClusterIP services (default false)
portDiscovery: true
# HTTP match conditions (optional)
match:
- uri:
@ -118,36 +121,57 @@ spec:
# canary increment step
# percentage (0-100)
stepWeight: 5
# Istio Prometheus checks
# validation (optional)
metrics:
# builtin checks
- name: request-success-rate
# builtin Prometheus check
# minimum req success rate (non 5xx responses)
# percentage (0-100)
threshold: 99
interval: 1m
- name: request-duration
# builtin Prometheus check
# maximum req duration P99
# milliseconds
threshold: 500
interval: 30s
# custom check
- name: "kafka lag"
threshold: 100
query: |
avg_over_time(
kafka_consumergroup_lag{
consumergroup=~"podinfo-consumer-.*",
topic="podinfo"
}[1m]
)
- name: "database connections"
# custom Prometheus check
templateRef:
name: db-connections
thresholdRange:
min: 2
max: 100
interval: 1m
# testing (optional)
webhooks:
- name: load-test
- name: "conformance test"
type: pre-rollout
url: http://flagger-helmtester.test/
timeout: 5m
metadata:
type: "helmv3"
cmd: "test run podinfo -n test"
- name: "load test"
type: rollout
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"
# alerting (optional)
alerts:
- name: "dev team Slack"
severity: error
providerRef:
name: dev-slack
namespace: flagger
- name: "qa team Discord"
severity: warn
providerRef:
name: qa-discord
- name: "on-call MS Teams"
severity: info
providerRef:
name: on-call-msteams
```
For more details on how the canary analysis and promotion works please [read the docs](https://docs.flagger.app/how-it-works).

View File

@ -15,6 +15,8 @@
## Usage
* [Deployment Strategies](usage/deployment-strategies.md)
* [Metrics Analysis](usage/metrics.md)
* [Webhooks](usage/webhooks.md)
* [Alerting](usage/alerting.md)
* [Monitoring](usage/monitoring.md)

View File

@ -1,54 +1,70 @@
# Development guide
# Flagger Development Guide
This document describes how to build, test and run Flagger from source.
## Setup dev environment
### Setup dev environment
Flagger is written in Go and uses Go modules for dependency management.
On your dev machine install the following tools:
* go >= 1.13
* git >= 2.20
* bash >= 5.0
* make >= 3.81
* kubectl >= 1.16
* kustomize >= 3.5
* helm >= 3.0
* docker >= 19.03
* go >= 1.13
* git >= 2.20
* bash >= 5.0
* make >= 3.81
* kubectl >= 1.16
* kustomize >= 3.5
* helm >= 3.0
* docker >= 19.03
You'll also need a Kubernetes cluster for testing Flagger.
You can use Minikube, Kind, Docker desktop or any remote cluster
(AKS/EKS/GKE/etc) Kubernetes version 1.14 or newer.
You'll also need a Kubernetes cluster for testing Flagger. You can use Minikube, Kind, Docker desktop or any remote cluster \(AKS/EKS/GKE/etc\) Kubernetes version 1.14 or newer.
To start contributing to Flagger, fork the [repository](https://github.com/weaveworks/flagger) on GitHub.
## Build
To start contributing to Flagger, fork the repository and clone it locally:
Create a dir inside your `GOPATH`:
```bash
git clone https://github.com/<YOUR-USERNAME>/flagger
mkdir -p $GOPATH/src/github.com/weaveworks
```
Clone your fork:
```bash
cd $GOPATH/src/github.com/weaveworks
git clone https://github.com/YOUR_USERNAME/flagger
cd flagger
```
Set Flagger repository as upstream:
```bash
git remote add upstream https://github.com/weaveworks/flagger.git
```
Sync your fork regularly to keep it up-to-date with upstream:
```bash
git fetch upstream
git checkout master
git merge upstream/master
```
### Build
Download Go modules:
```bash
go mod download
```
Build Flagger binary:
```bash
CGO_ENABLED=0 go build -o ./bin/flagger ./cmd/flagger/
```
Build Flagger container image:
```bash
make build
```
## Unit testing
Make a change to the source code and run the linter and unit tests:
Run unit tests:
```bash
make test
@ -66,9 +82,10 @@ If you made changes to `pkg/apis` regenerate Kubernetes client sets with:
./hack/update-codegen.sh
```
## Manual testing
### Manual testing
Install a service mesh and/or an ingress controller on your cluster and deploy Flagger using one of the install options [listed here](https://docs.flagger.app/install/flagger-install-on-kubernetes).
Install a service mesh and/or an ingress controller on your cluster and deploy Flagger
using one of the install options [listed here](https://docs.flagger.app/install/flagger-install-on-kubernetes).
If you made changes to the CRDs, apply your local copy with:
@ -76,7 +93,7 @@ If you made changes to the CRDs, apply your local copy with:
kubectl apply -f artifacts/flagger/crd.yaml
```
Shutdown the Flagger instance installed on your cluster \(replace the namespace with your mesh/ingress one\):
Shutdown the Flagger instance installed on your cluster (replace the namespace with your mesh/ingress one):
```bash
kubectl -n istio-system scale deployment/flagger --replicas=0
@ -112,9 +129,9 @@ kubectl -n istio-system set image deployment/flagger flagger=<YOUR-DOCKERHUB-USE
kubectl -n istio-system scale deployment/flagger --replicas=1
```
Now you can use one of the [tutorials](dev-guide.md) to manually test your changes.
Now you can use one of the [tutorials](https://docs.flagger.app/) to manually test your changes.
## Integration testing
### Integration testing
Flagger end-to-end tests can be run locally with [Kubernetes Kind](https://github.com/kubernetes-sigs/kind).
@ -155,14 +172,14 @@ Run the Linkerd e2e tests:
./test/e2e-linkerd-tests.sh
```
For each service mesh and ingress controller there is a dedicated e2e test suite, chose one that matches your changes from this [list](https://github.com/weaveworks/flagger/tree/master/test).
For each service mesh and ingress controller there is a dedicated e2e test suite,
chose one that matches your changes from this [list](https://github.com/weaveworks/flagger/tree/master/test).
When you open a pull request on Flagger repo, the unit and integration tests will be run in CI.
## Release
To release a new Flagger version \(e.g. `2.0.0`\) follow these steps:
### Release
To release a new Flagger version (e.g. `2.0.0`) follow these steps:
* create a branch `git checkout -b prep-2.0.0`
* set the version in code and manifests `TAG=2.0.0 make version-set`
* commit changes and merge PR
@ -170,10 +187,8 @@ To release a new Flagger version \(e.g. `2.0.0`\) follow these steps:
* tag master `make release`
After the tag has been pushed to GitHub, the CI release pipeline does the following:
* creates a GitHub release
* pushes the Flagger binary and change log to GitHub release
* pushes the Flagger container image to Docker Hub
* pushes the Helm chart to github-pages branch
* GitHub pages publishes the new chart version on the Helm repository

View File

@ -1,159 +1,32 @@
# FAQ
# Frequently asked questions
## Deployment Strategies
### Deployment Strategies
**Which deployment strategies are supported by Flagger?**
Flagger can run automated application analysis, promotion and rollback for the following deployment strategies:
* Canary \(progressive traffic shifting\)
* Istio, Linkerd, App Mesh, NGINX, Contour, Gloo
* Canary \(traffic mirroring\)
* Istio
* A/B Testing \(HTTP headers and cookies traffic routing\)
* Istio, App Mesh, NGINX, Contour
* Blue/Green \(traffic switch\)
* Kubernetes CNI, Istio, Linkerd, App Mesh, NGINX, Contour, Gloo
For Canary deployments and A/B testing you'll need a Layer 7 traffic management solution like a service mesh or an ingress controller. For Blue/Green deployments no service mesh or ingress controller is required.
Flagger implements the following deployment strategies:
* [Canary Release](usage/deployment-strategies.md#canary-release)
* [A/B Testing](usage/deployment-strategies.md#a-b-testing)
* [Blue/Green](usage/deployment-strategies.md#blue-green-deployments)
* [Blue/Green Mirroring](usage/deployment-strategies.md#blue-green-with-traffic-mirroring)
**When should I use A/B testing instead of progressive traffic shifting?**
For frontend applications that require session affinity you should use HTTP headers or cookies match conditions to ensure a set of users will stay on the same version for the whole duration of the canary analysis. A/B testing is supported by Istio and NGINX only.
Istio example:
```yaml
canaryAnalysis:
# schedule interval (default 60s)
interval: 1m
# total number of iterations
iterations: 10
# max number of failed iterations before rollback
threshold: 2
# canary match condition
match:
- headers:
x-canary:
regex: ".*insider.*"
- headers:
cookie:
regex: "^(.*?;)?(canary=always)(;.*)?$"
```
App Mesh example:
```yaml
canaryAnalysis:
interval: 1m
threshold: 10
iterations: 2
match:
- headers:
user-agent:
regex: ".*Chrome.*"
```
Note that App Mesh supports a single condition.
Contour example:
```yaml
canaryAnalysis:
interval: 1m
threshold: 10
iterations: 2
match:
- headers:
user-agent:
prefix: "Chrome"
```
Note that Contour does not support regex, you can use prefix, suffix or exact.
NGINX example:
```yaml
canaryAnalysis:
interval: 1m
threshold: 10
iterations: 2
match:
- headers:
x-canary:
exact: "insider"
- headers:
cookie:
exact: "canary"
```
Note that the NGINX ingress controller supports only exact matching for a single header and the cookie value is set to `always`.
The above configurations will route users with the x-canary header or canary cookie to the canary instance during analysis:
```bash
curl -H 'X-Canary: insider' http://app.example.com
curl -b 'canary=always' http://app.example.com
```
For frontend applications that require session affinity you should use HTTP headers or cookies match conditions
to ensure a set of users will stay on the same version for the whole duration of the canary analysis.
**Can I use Flagger to manage applications that live outside of a service mesh?**
For applications that are not deployed on a service mesh, Flagger can orchestrate Blue/Green style deployments with Kubernetes L4 networking.
Blue/Green example:
```yaml
apiVersion: flagger.app/v1alpha3
kind: Canary
spec:
provider: kubernetes
canaryAnalysis:
interval: 30s
threshold: 2
iterations: 10
metrics:
- name: request-success-rate
threshold: 99
interval: 1m
- name: request-duration
threshold: 500
interval: 30s
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
```
The above configuration will run an analysis for five minutes. Flagger starts the load test for the canary service \(green version\) and checks the Prometheus metrics every 30 seconds. If the analysis result is positive, Flagger will promote the canary \(green version\) to primary \(blue version\).
For applications that are not deployed on a service mesh, Flagger can orchestrate Blue/Green style deployments
with Kubernetes L4 networking.
**When can I use traffic mirroring?**
Traffic Mirroring is a pre-stage in a Canary \(progressive traffic shifting\) or Blue/Green deployment strategy. Traffic mirroring will copy each incoming request, sending one request to the primary and one to the canary service. The response from the primary is sent back to the user. The response from the canary is discarded. Metrics are collected on both requests so that the deployment will only proceed if the canary metrics are healthy.
Traffic mirroring can be used for Blue/Green deployment strategy or a pre-stage in a Canary release.
Traffic mirroring will copy each incoming request, sending one request to the primary and one to the canary service.
Mirroring should be used for requests that are **idempotent** or capable of being processed twice (once by the primary and once by the canary).
Mirroring is supported by Istio only.
In Istio, mirrored requests have `-shadow` appended to the `Host` \(HTTP\) or `Authority` \(HTTP/2\) header; for example requests to `podinfo.test` that are mirrored will be reported in telemetry with a destination host `podinfo.test-shadow`.
Mirroring must only be used for requests that are **idempotent** or capable of being processed twice \(once by the primary and once by the canary\). Reads are idempotent. Before using mirroring on requests that may be writes, you should consider what will happen if a write is duplicated and handled by the primary and canary.
To use mirroring, set `spec.canaryAnalysis.mirror` to `true`. Example for traffic shifting:
```yaml
apiVersion: flagger.app/v1alpha3
kind: Canary
spec:
provider: istio
canaryAnalysis:
mirror: true
interval: 30s
stepWeight: 20
maxWeight: 50
```
## Kubernetes services
### Kubernetes services
**How is an application exposed inside the cluster?**
@ -181,23 +54,20 @@ spec:
portName: http
```
If the `service.name` is not specified, then `targetRef.name` is used for the apex domain and canary/primary services name prefix. You should treat the service name as an immutable field, changing it could result in routing conflicts.
If the `service.name` is not specified, then `targetRef.name` is used for the apex domain and canary/primary services name prefix.
You should treat the service name as an immutable field, changing it could result in routing conflicts.
Based on the canary spec service, Flagger generates the following Kubernetes ClusterIP service:
* `<service.name>.<namespace>.svc.cluster.local`
selector `app=<name>-primary`
* `<service.name>-primary.<namespace>.svc.cluster.local`
selector `app=<name>-primary`
* `<service.name>-canary.<namespace>.svc.cluster.local`
selector `app=<name>`
This ensures that traffic coming from a namespace outside the mesh to `podinfo.test:9898` will be routed to the latest stable release of your app.
This ensures that traffic coming from a namespace outside the mesh to `podinfo.test:9898`
will be routed to the latest stable release of your app.
```yaml
apiVersion: v1
@ -243,13 +113,16 @@ spec:
targetPort: http
```
The `podinfo-canary.test:9898` address is available only during the canary analysis and can be used for conformance testing or load testing.
The `podinfo-canary.test:9898` address is available only during the
canary analysis and can be used for conformance testing or load testing.
## Multiple ports
### Multiple ports
**My application listens on multiple ports, how can I expose them inside the cluster?**
If port discovery is enabled, Flagger scans the deployment spec and extracts the containers ports excluding the port specified in the canary service and Envoy sidecar ports. \`These ports will be used when generating the ClusterIP services.
If port discovery is enabled, Flagger scans the deployment spec and extracts the containers
ports excluding the port specified in the canary service and Envoy sidecar ports.
`These ports will be used when generating the ClusterIP services.
For a deployment that exposes two ports:
@ -291,7 +164,7 @@ spec:
Both port `8080` and `9090` will be added to the ClusterIP services.
## Label selectors
### Label selectors
**What labels selectors are supported by Flagger?**
@ -312,7 +185,8 @@ spec:
app: podinfo
```
Besides `app` Flagger supports `name` and `app.kubernetes.io/name` selectors. If you use a different convention you can specify your label with the `-selector-labels` flag.
Besides `app` Flagger supports `name` and `app.kubernetes.io/name` selectors. If you use a different
convention you can specify your label with the `-selector-labels` flag.
**Is pod affinity and anti affinity supported?**
@ -347,13 +221,131 @@ spec:
topologyKey: kubernetes.io/hostname
```
## Istio routing
### Metrics
**How does Flagger measures the request success rate and duration?**
Flagger measures the request success rate and duration using Prometheus queries.
**HTTP requests success rate percentage**
Spec:
```yaml
canaryAnalysis:
metrics:
- name: request-success-rate
# minimum req success rate (non 5xx responses)
# percentage (0-100)
threshold: 99
interval: 1m
```
Istio query:
```javascript
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace=~"$namespace",
destination_workload=~"$workload",
response_code!~"5.*"
}[$interval]
)
)
/
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace=~"$namespace",
destination_workload=~"$workload"
}[$interval]
)
)
```
Envoy query (App Mesh, Contour or Gloo):
```javascript
sum(
rate(
envoy_cluster_upstream_rq{
kubernetes_namespace="$namespace",
kubernetes_pod_name=~"$workload",
envoy_response_code!~"5.*"
}[$interval]
)
)
/
sum(
rate(
envoy_cluster_upstream_rq{
kubernetes_namespace="$namespace",
kubernetes_pod_name=~"$workload"
}[$interval]
)
)
```
**HTTP requests milliseconds duration P99**
Spec:
```yaml
canaryAnalysis:
metrics:
- name: request-duration
# maximum req duration P99
# milliseconds
threshold: 500
interval: 1m
```
Istio query:
```javascript
histogram_quantile(0.99,
sum(
irate(
istio_request_duration_seconds_bucket{
reporter="destination",
destination_workload=~"$workload",
destination_workload_namespace=~"$namespace"
}[$interval]
)
) by (le)
)
```
Envoy query (App Mesh, Contour or Gloo):
```javascript
histogram_quantile(0.99,
sum(
irate(
envoy_cluster_upstream_rq_time_bucket{
kubernetes_pod_name=~"$workload",
kubernetes_namespace=~"$namespace"
}[$interval]
)
) by (le)
)
```
> **Note** that the metric interval should be lower or equal to the control loop interval.
### Istio routing
**How does Flagger interact with Istio?**
Flagger creates an Istio Virtual Service and Destination Rules based on the Canary service spec. The service configuration lets you expose an app inside or outside the mesh. You can also define traffic policies, HTTP match conditions, URI rewrite rules, CORS policies, timeout and retries.
Flagger creates an Istio Virtual Service and Destination Rules based on the Canary service spec.
The service configuration lets you expose an app inside or outside the mesh.
You can also define traffic policies, HTTP match conditions, URI rewrite rules, CORS policies, timeout and retries.
The following spec exposes the `frontend` workload inside the mesh on `frontend.test.svc.cluster.local:9898` and outside the mesh on `frontend.example.com`. You'll have to specify an Istio ingress gateway for external hosts.
The following spec exposes the `frontend` workload inside the mesh on `frontend.test.svc.cluster.local:9898`
and outside the mesh on `frontend.example.com`. You'll have to specify an Istio ingress gateway for external hosts.
```yaml
apiVersion: flagger.app/v1alpha3
@ -487,9 +479,11 @@ spec:
mode: DISABLE
```
Flagger keeps in sync the virtual service and destination rules with the canary service spec. Any direct modification to the virtual service spec will be overwritten.
Flagger keeps in sync the virtual service and destination rules with the canary service spec.
Any direct modification to the virtual service spec will be overwritten.
To expose a workload inside the mesh on `http://backend.test.svc.cluster.local:9898`, the service spec can contain only the container port and the traffic policy:
To expose a workload inside the mesh on `http://backend.test.svc.cluster.local:9898`,
the service spec can contain only the container port and the traffic policy:
```yaml
apiVersion: flagger.app/v1alpha3
@ -530,13 +524,15 @@ spec:
app: backend-primary
```
Flagger works for user facing apps exposed outside the cluster via an ingress gateway and for backend HTTP APIs that are accessible only from inside the mesh.
Flagger works for user facing apps exposed outside the cluster via an ingress gateway
and for backend HTTP APIs that are accessible only from inside the mesh.
## Istio Ingress Gateway
### Istio Ingress Gateway
**How can I expose multiple canaries on the same external domain?**
Assuming you have two apps, one that servers the main website and one that serves the REST API. For each app you can define a canary object as:
Assuming you have two apps, one that servers the main website and one that serves the REST API.
For each app you can define a canary object as:
```yaml
apiVersion: flagger.app/v1alpha3
@ -574,11 +570,13 @@ spec:
uri: /
```
Based on the above configuration, Flagger will create two virtual services bounded to the same ingress gateway and external host. Istio Pilot will [merge](https://istio.io/help/ops/traffic-management/deploy-guidelines/#multiple-virtual-services-and-destination-rules-for-the-same-host) the two services and the website rule will be moved to the end of the list in the merged configuration.
Based on the above configuration, Flagger will create two virtual services bounded to the same ingress gateway and external host.
Istio Pilot will [merge](https://istio.io/help/ops/traffic-management/deploy-guidelines/#multiple-virtual-services-and-destination-rules-for-the-same-host)
the two services and the website rule will be moved to the end of the list in the merged configuration.
Note that host merging only works if the canaries are bounded to a ingress gateway other than the `mesh` gateway.
## Istio Mutual TLS
### Istio Mutual TLS
**How can I enable mTLS for a canary?**
@ -633,4 +631,3 @@ spec:
ports:
- number: 80
```

View File

@ -1,8 +1,11 @@
# How it works
[Flagger](https://github.com/weaveworks/flagger) takes a Kubernetes deployment and optionally a horizontal pod autoscaler \(HPA\) and creates a series of objects \(Kubernetes deployments, ClusterIP services, virtual service, traffic split or ingress\) to drive the canary analysis and promotion.
[Flagger](https://github.com/weaveworks/flagger) takes a Kubernetes deployment and optionally
a horizontal pod autoscaler (HPA) and creates a series of objects
(Kubernetes deployments, ClusterIP services, virtual service, traffic split or ingress)
to drive the canary analysis and promotion.
## Canary Custom Resource
### Canary Custom Resource
For a deployment named _podinfo_, a canary promotion can be defined using Flagger's custom resource:
@ -11,71 +14,57 @@ apiVersion: flagger.app/v1alpha3
kind: Canary
metadata:
name: podinfo
namespace: test
spec:
# service mesh provider (optional)
# can be: kubernetes, istio, linkerd, appmesh, nginx, gloo, supergloo
provider: linkerd
# deployment reference
targetRef:
apiVersion: apps/v1
kind: Deployment
name: podinfo
# the maximum time in seconds for the canary deployment
# to make progress before it is rollback (default 600s)
progressDeadlineSeconds: 60
# HPA reference (optional)
autoscalerRef:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
name: podinfo
service:
# service name (optional)
name: podinfo
# ClusterIP port number
port: 9898
# ClusterIP port name can be http or grpc (default http)
portName: http
# container port number or name (optional)
targetPort: 9898
# add all the other container ports
# to the ClusterIP services (default false)
portDiscovery: false
# promote the canary without analysing it (default false)
skipAnalysis: false
# define the canary analysis timing and KPIs
portDiscovery: true
canaryAnalysis:
# schedule interval (default 60s)
interval: 1m
# max number of failed metric checks before rollback
threshold: 10
# max traffic percentage routed to canary
# percentage (0-100)
maxWeight: 50
# canary increment step
# percentage (0-100)
stepWeight: 5
# Prometheus checks
metrics:
- name: request-success-rate
# minimum req success rate (non 5xx responses)
# percentage (0-100)
threshold: 99
interval: 1m
- name: request-duration
# maximum req duration P99
# milliseconds
threshold: 500
interval: 30s
# testing (optional)
- name: request-success-rate
threshold: 99
interval: 1m
- name: request-duration
threshold: 99
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"
cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
```
Based on the above configuration, Flagger generates the following Kubernetes objects:
* `deployment/<targetRef.name>-primary`
* `hpa/<autoscalerRef.name>-primary`
The primary deployment is considered the stable release of your app, by default all traffic is routed to this version
and the target deployment is scaled to zero.
Flagger will detect changes to the target deployment (including secrets and configmaps) and will perform a
canary analysis before promoting the new version as primary.
The autoscaler reference is optional, when specified, Flagger will pause the traffic increase while the
target and primary deployments are scaled up or down. HPA can help reduce the resource usage during the canary analysis.
If the target deployment uses secrets and/or configmaps, Flagger will create a copy of each object using the `-primary`
prefix and will reference these objects in the primary deployment. You can disable the secrets/configmaps tracking
with the `-enable-config-tracking=false` command flag in the Flagger deployment manifest under containers args
or by setting `--set configTracking.enabled=false` when installing Flagger with Helm.
**Note** that the target deployment must have a single label selector in the format `app: <DEPLOYMENT-NAME>`:
```yaml
@ -93,13 +82,30 @@ spec:
app: podinfo
```
Besides `app` Flagger supports `name` and `app.kubernetes.io/name` selectors. If you use a different convention you can specify your label with the `-selector-labels=my-app-label` command flag in the Flagger deployment manifest under containers args or by setting `--set selectorLabels=my-app-label` when installing Flagger with Helm.
Besides `app` Flagger supports `name` and `app.kubernetes.io/name` selectors.
If you use a different convention you can specify your label with
the `-selector-labels=my-app-label` command flag in the Flagger deployment manifest under containers args
or by setting `--set selectorLabels=my-app-label` when installing Flagger with Helm.
The target deployment should expose a TCP port that will be used by Flagger to create the ClusterIP Services. The container port from the target deployment should match the `service.port` or `service.targetPort`.
The target deployment should expose a TCP port that will be used by Flagger to create the ClusterIP Services.
The container port from the target deployment should match the `service.port` or `service.targetPort`.
## Canary status
Based on the canary spec service, Flagger generates the following Kubernetes ClusterIP service:
Get the current status of canary deployments cluster wide:
* `<service.name>.<namespace>.svc.cluster.local`
selector `app=<name>-primary`
* `<service.name>-primary.<namespace>.svc.cluster.local`
selector `app=<name>-primary`
* `<service.name>-canary.<namespace>.svc.cluster.local`
selector `app=<name>`
This ensures that traffic to `podinfo.test:9898` will be routed to the latest stable release of your app.
The `podinfo-canary.test:9898` address is available only during the
canary analysis and can be used for conformance testing or load testing.
### Canary status
You can use kubectl to get the current status of canary deployments cluster wide:
```bash
kubectl get canaries --all-namespaces
@ -110,7 +116,7 @@ prod frontend Succeeded 0 2019-06-30T16:15:07Z
prod backend Failed 0 2019-06-30T17:05:07Z
```
The status condition reflects the last know state of the canary analysis:
The status condition reflects the last known state of the canary analysis:
```bash
kubectl -n test get canary/podinfo -oyaml | awk '/status/,0'
@ -134,7 +140,10 @@ status:
type: Promoted
```
The `Promoted` status condition can have one of the following reasons: Initialized, Waiting, Progressing, Promoting, Finalising, Succeeded or Failed. A failed canary will have the promoted status set to `false`, the reason to `failed` and the last applied spec will be different to the last promoted one.
The `Promoted` status condition can have one of the following reasons:
Initialized, Waiting, Progressing, Promoting, Finalising, Succeeded or Failed.
A failed canary will have the promoted status set to `false`,
the reason to `failed` and the last applied spec will be different to the last promoted one.
Wait for a successful rollout:
@ -162,747 +171,47 @@ kubectl wait canary/podinfo --for=condition=promoted --timeout=5m
kubectl get canary/podinfo | grep Succeeded
```
## Canary Stages
### Canary Analysis
![Flagger Canary Stages](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/diagrams/flagger-canary-steps.png)
A canary deployment is triggered by changes in any of the following objects:
* Deployment PodSpec \(container image, command, ports, env, resources, etc\)
* ConfigMaps mounted as volumes or mapped to environment variables
* Secrets mounted as volumes or mapped to environment variables
Gated canary promotion stages:
* scan for canary deployments
* check primary and canary deployment status
* halt advancement if a rolling update is underway
* halt advancement if pods are unhealthy
* call confirm-rollout webhooks and check results
* halt advancement if any hook returns a non HTTP 2xx result
* call pre-rollout webhooks and check results
* halt advancement if any hook returns a non HTTP 2xx result
* increment the failed checks counter
* increase canary traffic weight percentage from 0% to 5% \(step weight\)
* call rollout webhooks and check results
* check canary HTTP request success rate and latency
* halt advancement if any metric is under the specified threshold
* increment the failed checks counter
* check if the number of failed checks reached the threshold
* route all traffic to primary
* scale to zero the canary deployment and mark it as failed
* call post-rollout webhooks
* post the analysis result to Slack
* wait for the canary deployment to be updated and start over
* increase canary traffic weight by 5% \(step weight\) till it reaches 50% \(max weight\)
* halt advancement if any webhook call fails
* halt advancement while canary request success rate is under the threshold
* halt advancement while canary request duration P99 is over the threshold
* halt advancement while any custom metric check fails
* halt advancement if the primary or canary deployment becomes unhealthy
* halt advancement while canary deployment is being scaled up/down by HPA
* call confirm-promotion webhooks and check results
* halt advancement if any hook returns a non HTTP 2xx result
* promote canary to primary
* copy ConfigMaps and Secrets from canary to primary
* copy canary deployment spec template over primary
* wait for primary rolling update to finish
* halt advancement if pods are unhealthy
* route all traffic to primary
* scale to zero the canary deployment
* mark rollout as finished
* call post-rollout webhooks
* post the analysis result to Slack or MS Teams
* wait for the canary deployment to be updated and start over
## Canary Analysis
The canary analysis runs periodically until it reaches the maximum traffic weight or the failed checks threshold.
The canary analysis defines:
* the type of [deployment strategy](usage/deployment-strategies.md)
* the [metrics](usage/metrics.md) used to validate the canary version
* the [webhooks](usage/webhooks.md) used for conformance testing, load testing and manual gating
* the [alerting settings](usage/alerting.md)
Spec:
```yaml
canaryAnalysis:
# schedule interval (default 60s)
interval: 1m
interval:
# max number of failed metric checks before rollback
threshold: 10
threshold:
# max traffic percentage routed to canary
# percentage (0-100)
maxWeight: 50
maxWeight:
# canary increment step
# percentage (0-100)
stepWeight: 2
# deploy straight to production without
# the metrics and webhook checks
skipAnalysis: false
```
The above analysis, if it succeeds, will run for 25 minutes while validating the HTTP metrics and webhooks every minute. You can determine the minimum time that it takes to validate and promote a canary deployment using this formula:
```text
interval * (maxWeight / stepWeight)
```
And the time it takes for a canary to be rollback when the metrics or webhook checks are failing:
```text
interval * threshold
```
In emergency cases, you may want to skip the analysis phase and ship changes directly to production. At any time you can set the `spec.skipAnalysis: true`. When skip analysis is enabled, Flagger checks if the canary deployment is healthy and promotes it without analysing it. If an analysis is underway, Flagger cancels it and runs the promotion.
## A/B Testing
Besides weighted routing, Flagger can be configured to route traffic to the canary based on HTTP match conditions. In an A/B testing scenario, you'll be using HTTP headers or cookies to target a certain segment of your users. This is particularly useful for frontend applications that require session affinity.
You can enable A/B testing by specifying the HTTP match conditions and the number of iterations:
```yaml
canaryAnalysis:
# schedule interval (default 60s)
interval: 1m
stepWeight:
# total number of iterations
iterations: 10
# max number of failed iterations before rollback
threshold: 2
# canary match condition
# used for A/B Testing and Blue/Green
iterations:
# canary match conditions
# used for A/B Testing
match:
- headers:
user-agent:
regex: "^(?!.*Chrome).*Safari.*"
- headers:
cookie:
regex: "^(.*?;)?(user=test)(;.*)?$"
```
If Flagger finds a HTTP match condition, it will ignore the `maxWeight` and `stepWeight` settings.
The above configuration will run an analysis for ten minutes targeting the Safari users and those that have a test cookie. You can determine the minimum time that it takes to validate and promote a canary deployment using this formula:
```text
interval * iterations
```
And the time it takes for a canary to be rollback when the metrics or webhook checks are failing:
```text
interval * threshold
```
Make sure that the analysis threshold is lower than the number of iterations.
## Blue/Green deployments
For applications that are not deployed on a service mesh, Flagger can orchestrate blue/green style deployments with Kubernetes L4 networking. When using Istio you have the option to mirror traffic between blue and green.
You can use the blue/green deployment strategy by replacing `stepWeight/maxWeight` with `iterations` in the `canaryAnalysis` spec:
```yaml
canaryAnalysis:
# schedule interval (default 60s)
interval: 1m
# total number of iterations
iterations: 10
# max number of failed iterations before rollback
threshold: 2
# Traffic shadowing (compatible with Istio only)
mirror: true
```
With the above configuration Flagger will run conformance and load tests on the canary pods for ten minutes. If the metrics analysis succeeds, live traffic will be switched from the old version to the new one when the canary is promoted.
The blue/green deployment strategy is supported for all service mesh providers.
Blue/Green rollout steps for service mesh:
* scale up the canary \(green\)
* run conformance tests for the canary pods
* run load tests and metric checks for the canary pods
* route traffic to canary
* promote canary spec over primary \(blue\)
* wait for primary rollout
* route traffic to primary
* scale down canary
After the analysis finishes, the traffic is routed to the canary \(green\) before triggering the primary \(blue\) rolling update, this ensures a smooth transition to the new version avoiding dropping in-flight requests during the Kubernetes deployment rollout.
## HTTP Metrics
The canary analysis is using the following Prometheus queries:
**HTTP requests success rate percentage**
Spec:
```yaml
canaryAnalysis:
- # HTTP header
# key performance indicators
metrics:
- name: request-success-rate
# minimum req success rate (non 5xx responses)
# percentage (0-100)
threshold: 99
interval: 1m
```
Istio query:
```javascript
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace=~"$namespace",
destination_workload=~"$workload",
response_code!~"5.*"
}[$interval]
)
)
/
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace=~"$namespace",
destination_workload=~"$workload"
}[$interval]
)
)
```
Envoy query \(App Mesh, Contour or Gloo\):
```javascript
sum(
rate(
envoy_cluster_upstream_rq{
kubernetes_namespace="$namespace",
kubernetes_pod_name=~"$workload",
envoy_response_code!~"5.*"
}[$interval]
)
)
/
sum(
rate(
envoy_cluster_upstream_rq{
kubernetes_namespace="$namespace",
kubernetes_pod_name=~"$workload"
}[$interval]
)
)
```
**HTTP requests milliseconds duration P99**
Spec:
```yaml
canaryAnalysis:
metrics:
- name: request-duration
# maximum req duration P99
# milliseconds
threshold: 500
interval: 1m
```
Istio query:
```javascript
histogram_quantile(0.99,
sum(
irate(
istio_request_duration_seconds_bucket{
reporter="destination",
destination_workload=~"$workload",
destination_workload_namespace=~"$namespace"
}[$interval]
)
) by (le)
)
```
Envoy query \(App Mesh, Contour or Gloo\):
```javascript
histogram_quantile(0.99,
sum(
irate(
envoy_cluster_upstream_rq_time_bucket{
kubernetes_pod_name=~"$workload",
kubernetes_namespace=~"$namespace"
}[$interval]
)
) by (le)
)
```
> **Note** that the metric interval should be lower or equal to the control loop interval.
## Custom Metrics
The canary analysis can be extended with custom Prometheus queries.
```yaml
canaryAnalysis:
threshold: 1
maxWeight: 50
stepWeight: 5
metrics:
- name: "404s percentage"
threshold: 5
query: |
100 - sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace="test",
destination_workload="podinfo",
response_code!="404"
}[1m]
)
)
/
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace="test",
destination_workload="podinfo"
}[1m]
)
) * 100
```
The above configuration validates the canary by checking if the HTTP 404 req/sec percentage is below 5 percent of the total traffic. If the 404s rate reaches the 5% threshold, then the canary fails.
```yaml
canaryAnalysis:
threshold: 1
maxWeight: 50
stepWeight: 5
metrics:
- name: "rpc error rate"
threshold: 5
query: |
100 - (sum
rate(
grpc_server_handled_total{
grpc_service="my.TestService",
grpc_code!="OK"
}[1m]
)
)
/
sum(
rate(
grpc_server_started_total{
grpc_service="my.TestService"
}[1m]
)
) * 100
```
The above configuration validates the canary by checking if the percentage of non-OK GRPC req/sec is below 5 percent of the total requests. If the non-OK rate reaches the 5% threshold, then the canary fails.
When specifying a query, Flagger will run the promql query and convert the result to float64. Then it compares the query result value with the metric threshold value.
## Webhooks
The canary analysis can be extended with webhooks. Flagger will call each webhook URL and determine from the response status code \(HTTP 2xx\) if the canary is failing or not.
There are several types of hooks:
* **confirm-rollout** hooks are executed before scaling up the canary deployment and can be used for manual approval.
The rollout is paused until the hook returns a successful HTTP status code.
* **pre-rollout** hooks are executed before routing traffic to canary.
The canary advancement is paused if a pre-rollout hook fails and if the number of failures reach the
threshold the canary will be rollback.
* **rollout** hooks are executed during the analysis on each iteration before the metric checks.
If a rollout hook call fails the canary advancement is paused and eventfully rolled back.
* **confirm-promotion** hooks are executed before the promotion step.
The canary promotion is paused until the hooks return HTTP 200.
While the promotion is paused, Flagger will continue to run the metrics checks and rollout hooks.
* **post-rollout** hooks are executed after the canary has been promoted or rolled back.
If a post rollout hook fails the error is logged.
* **rollback** hooks are executed while a canary deployment is in either Progressing or Waiting status.
This provides the ability to rollback during analysis or while waiting for a confirmation. If a rollback hook
returns a successful HTTP status code, Flagger will stop the analysis and mark the canary release as failed.
* **event** hooks are executed every time Flagger emits a Kubernetes event. When configured,
every action that Flagger takes during a canary deployment will be sent as JSON via an HTTP POST request.
Spec:
```yaml
canaryAnalysis:
- # metric check
# alerting
alerts:
- # alert provider
# external checks
webhooks:
- name: "start gate"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/approve
- name: "smoke test"
type: pre-rollout
url: http://flagger-helmtester.kube-system/
timeout: 3m
metadata:
type: "helm"
cmd: "test podinfo --cleanup"
- name: "load test"
type: rollout
url: http://flagger-loadtester.test/
timeout: 15s
metadata:
cmd: "hey -z 1m -q 5 -c 2 http://podinfo-canary.test:9898/"
- name: "promotion gate"
type: confirm-promotion
url: http://flagger-loadtester.test/gate/approve
- name: "notify"
type: post-rollout
url: http://telegram.bot:8080/
timeout: 5s
metadata:
some: "message"
- name: "rollback gate"
type: rollback
url: http://flagger-loadtester.test/rollback/check
- name: "send to Slack"
type: event
url: http://event-recevier.notifications/slack
- # hook
```
> **Note** that the sum of all rollout webhooks timeouts should be lower than the analysis interval.
Webhook payload \(HTTP POST\):
```javascript
{
"name": "podinfo",
"namespace": "test",
"phase": "Progressing",
"metadata": {
"test": "all",
"token": "16688eb5e9f289f1991c"
}
}
```
Response status codes:
* 200-202 - advance canary by increasing the traffic weight
* timeout or non-2xx - halt advancement and increment failed checks
On a non-2xx response Flagger will include the response body \(if any\) in the failed checks log and Kubernetes events.
Event payload \(HTTP POST\):
```javascript
{
"name": "string (canary name)",
"namespace": "string (canary namespace)",
"phase": "string (canary phase)",
"metadata": {
"eventMessage": "string (canary event message)",
"eventType": "string (canary event type)",
"timestamp": "string (unix timestamp ms)"
}
}
```
The event receiver can create alerts based on the received phase \(possible values: `Initialized`, `Waiting`, `Progressing`, `Promoting`, `Finalising`, `Succeeded` or `Failed`\).
## Load Testing
For workloads that are not receiving constant traffic Flagger can be configured with a webhook, that when called, will start a load test for the target workload. If the target workload doesn't receive any traffic during the canary analysis, Flagger metric checks will fail with "no values found for metric request-success-rate".
Flagger comes with a load testing service based on [rakyll/hey](https://github.com/rakyll/hey) that generates traffic during analysis when configured as a webhook.
![Flagger Load Testing Webhook](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/diagrams/flagger-load-testing.png)
First you need to deploy the load test runner in a namespace with sidecar injection enabled:
```bash
export REPO=https://raw.githubusercontent.com/weaveworks/flagger/master
kubectl -n test apply -f ${REPO}/artifacts/loadtester/deployment.yaml
kubectl -n test apply -f ${REPO}/artifacts/loadtester/service.yaml
```
Or by using Helm:
```bash
helm repo add flagger https://flagger.app
helm upgrade -i flagger-loadtester flagger/loadtester \
--namespace=test \
--set cmd.timeout=1h
```
When deployed the load tester API will be available at `http://flagger-loadtester.test/`.
Now you can add webhooks to the canary analysis spec:
```yaml
webhooks:
- name: load-test-get
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
- name: load-test-post
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 -m POST -d '{test: 2}' http://podinfo-canary.test:9898/echo"
```
When the canary analysis starts, Flagger will call the webhooks and the load tester will run the `hey` commands in the background, if they are not already running. This will ensure that during the analysis, the `podinfo-canary.test` service will receive a steady stream of GET and POST requests.
If your workload is exposed outside the mesh you can point `hey` to the public URL and use HTTP2.
```yaml
webhooks:
- name: load-test-get
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 -h2 https://podinfo.example.com/"
```
For gRPC services you can use [bojand/ghz](https://github.com/bojand/ghz) which is a similar tool to Hey but for gPRC:
```yaml
webhooks:
- name: grpc-load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "ghz -z 1m -q 10 -c 2 --insecure podinfo.test:9898"
```
`ghz` uses reflection to identify which gRPC method to call. If you do not wish to enable reflection for your gRPC service you can implement a standardized health check from the [grpc-proto](https://github.com/grpc/grpc-proto) library. To use this [health check schema](https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto) without reflection you can pass a parameter to `ghz` like this
```yaml
webhooks:
- name: grpc-load-test-no-reflection
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "ghz --insecure --proto=/tmp/ghz/health.proto --call=grpc.health.v1.Health/Check podinfo.test:9898"
```
The load tester can run arbitrary commands as long as the binary is present in the container image. For example if you you want to replace `hey` with another CLI, you can create your own Docker image:
```text
FROM weaveworks/flagger-loadtester:<VER>
RUN curl -Lo /usr/local/bin/my-cli https://github.com/user/repo/releases/download/ver/my-cli \
&& chmod +x /usr/local/bin/my-cli
```
## Load Testing Delegation
The load tester can also forward testing tasks to external tools, by now [nGrinder](https://github.com/naver/ngrinder) is supported.
To use this feature, add a load test task of type 'ngrinder' to the canary analysis spec:
```yaml
webhooks:
- name: load-test-post
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
# type of this load test task, cmd or ngrinder
type: ngrinder
# base url of your nGrinder controller server
server: http://ngrinder-server:port
# id of the test to clone from, the test must have been defined.
clone: 100
# user name and base64 encoded password to authenticate against the nGrinder server
username: admin
passwd: YWRtaW4=
# the interval between between nGrinder test status polling, default to 1s
pollInterval: 5s
```
When the canary analysis starts, the load tester will initiate a [clone\_and\_start request](https://github.com/naver/ngrinder/wiki/REST-API-PerfTest) to the nGrinder server and start a new performance test. the load tester will periodically poll the nGrinder server for the status of the test, and prevent duplicate requests from being sent in subsequent analysis loops.
## Integration Testing
Flagger comes with a testing service that can run Helm tests or Bats tests when configured as a webhook.
Deploy the Helm test runner in the `kube-system` namespace using the `tiller` service account:
```bash
helm repo add flagger https://flagger.app
helm upgrade -i flagger-helmtester flagger/loadtester \
--namespace=kube-system \
--set serviceAccountName=tiller
```
When deployed the Helm tester API will be available at `http://flagger-helmtester.kube-system/`.
Now you can add pre-rollout webhooks to the canary analysis spec:
```yaml
canaryAnalysis:
webhooks:
- name: "smoke test"
type: pre-rollout
url: http://flagger-helmtester.kube-system/
timeout: 3m
metadata:
type: "helm"
cmd: "test {{ .Release.Name }} --cleanup"
```
When the canary analysis starts, Flagger will call the pre-rollout webhooks before routing traffic to the canary. If the helm test fails, Flagger will retry until the analysis threshold is reached and the canary is rolled back.
If you are using Helm v3, you'll have to create a dedicated service account and add the release namespace to the test command:
```yaml
canaryAnalysis:
webhooks:
- name: "smoke test"
type: pre-rollout
url: http://flagger-helmtester.kube-system/
timeout: 3m
metadata:
type: "helmv3"
cmd: "test run {{ .Release.Name }} --cleanup -n {{ .Release.Namespace }}"
```
As an alternative to Helm you can use the [Bash Automated Testing System](https://github.com/bats-core/bats-core) to run your tests.
```yaml
canaryAnalysis:
webhooks:
- name: "acceptance tests"
type: pre-rollout
url: http://flagger-batstester.default/
timeout: 5m
metadata:
type: "bash"
cmd: "bats /tests/acceptance.bats"
```
Note that you should create a ConfigMap with your Bats tests and mount it inside the tester container.
## Manual Gating
For manual approval of a canary deployment you can use the `confirm-rollout` and `confirm-promotion` webhooks. The confirmation rollout hooks are executed before the pre-rollout hooks. Flagger will halt the canary traffic shifting and analysis until the confirm webhook returns HTTP status 200.
For manual rollback of a canary deployment you can use the `rollback` webhook. The rollback hook will be called during the analysis and confirmation states. If a rollback webhook returns a successful HTTP status code, Flagger will shift all traffic back to the primary instance and fail the canary.
Manual gating with Flagger's tester:
```yaml
canaryAnalysis:
webhooks:
- name: "gate"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/halt
```
The `/gate/halt` returns HTTP 403 thus blocking the rollout.
If you have notifications enabled, Flagger will post a message to Slack or MS Teams if a canary rollout is waiting for approval.
Change the URL to `/gate/approve` to start the canary analysis:
```yaml
canaryAnalysis:
webhooks:
- name: "gate"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/approve
```
Manual gating can be driven with Flagger's tester API. Set the confirmation URL to `/gate/check`:
```yaml
canaryAnalysis:
webhooks:
- name: "ask for confirmation"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/check
```
By default the gate is closed, you can start or resume the canary rollout with:
```bash
kubectl -n test exec -it flagger-loadtester-xxxx-xxxx sh
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/gate/open
```
You can pause the rollout at any time with:
```bash
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/gate/close
```
If a canary analysis is paused the status will change to waiting:
```bash
kubectl get canary/podinfo
NAME STATUS WEIGHT
podinfo Waiting 0
```
The `confirm-promotion` hook type can be used to manually approve the canary promotion. While the promotion is paused, Flagger will continue to run the metrics checks and load tests.
```yaml
canaryAnalysis:
webhooks:
- name: "promotion gate"
type: confirm-promotion
url: http://flagger-loadtester.test/gate/halt
```
The `rollback` hook type can be used to manually rollback the canary promotion. As with gating, rollbacks can be driven with Flagger's tester API by setting the rollback URL to `/rollback/check`
```yaml
canaryAnalysis:
webhooks:
- name: "rollback"
type: rollback
url: http://flagger-loadtester.test/rollback/check
```
By default rollback is closed, you can rollback a canary rollout with:
```bash
kubectl -n test exec -it flagger-loadtester-xxxx-xxxx sh
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/rollback/open
```
You can close the rollback with:
```bash curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/rollback/close``
If you have notifications enabled, Flagger will post a message to Slack or MS Teams if a canary promotion is waiting for approval.
The canary analysis runs periodically until it reaches the maximum traffic weight or the number of iterations.
On each run, Flagger calls the webhooks, checks the metrics and if the failed checks threshold is reached, stops the
analysis and rolls back the canary. If alerting is configured, Flagger will post the analysis result using the alert providers.

View File

@ -1,6 +1,9 @@
# Alerting
## Slack
Flagger can be configured to send alerts to various chat platforms. You can define a global alert provider at
install time or configure alerts on a per canary basis.
### Global configuration
Flagger can be configured to send Slack notifications:
@ -11,16 +14,16 @@ helm upgrade -i flagger flagger/flagger \
--set slack.user=flagger
```
Once configured with a Slack incoming **webhook**, Flagger will post messages when a canary deployment has been initialised, when a new revision has been detected and if the canary analysis failed or succeeded.
Once configured with a Slack incoming **webhook**, Flagger will post messages when a canary deployment
has been initialised, when a new revision has been detected and if the canary analysis failed or succeeded.
![Slack Notifications](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/screens/slack-canary-notifications.png)
A canary deployment will be rolled back if the progress deadline exceeded or if the analysis reached the maximum number of failed checks:
A canary deployment will be rolled back if the progress deadline exceeded or if the analysis reached the
maximum number of failed checks:
![Slack Notifications](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/screens/slack-canary-failed.png)
## Microsoft Teams
Flagger can be configured to send notifications to Microsoft Teams:
```bash
@ -28,17 +31,89 @@ helm upgrade -i flagger flagger/flagger \
--set msteams.url=https://outlook.office.com/webhook/YOUR/TEAMS/WEBHOOK
```
Flagger will post a message card to MS Teams when a new revision has been detected and if the canary analysis failed or succeeded:
Similar to Slack, Flagger alerts on canary analysis events:
![MS Teams Notifications](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/screens/flagger-ms-teams-notifications.png)
And you'll get a notification on rollback:
![MS Teams Notifications](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/screens/flagger-ms-teams-failed.png)
## Prometheus Alert Manager
### Canary configuration
Besides Slack, you can use Alertmanager to trigger alerts when a canary deployment failed:
Configuring alerting globally has several limitations as it's not possible to specify different channels
or configure the verbosity on a per canary basis.
To make the alerting move flexible, the canary analysis can be extended
with a list of alerts that reference an alert provider.
For each alert, users can configure the severity level.
The alerts section overrides the global setting.
Slack example:
```yaml
apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
name: on-call
namespace: flagger
spec:
type: slack
channel: on-call-alerts
username: flagger
# webhook address (ignored if secretRef is specified)
address: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
# secret containing the webhook address (optional)
secretRef:
name: on-call-url
---
apiVersion: v1
kind: Secret
metadata:
name: on-call-url
namespace: flagger
data:
address: <encoded-url>
```
The alert provider **type** can be: `slack`, `msteams`, `rocket` or `discord`. When set to `discord`,
Flagger will use [Slack formatting](https://birdie0.github.io/discord-webhooks-guide/other/slack_formatting.html)
and will append `/slack` to the Discord address.
When not specified, **channel** defaults to `general` and **username** defaults to `flagger`.
When **secretRef** is specified, the Kubernetes secret must contain a data field named `address`,
the address in the secret will take precedence over the **address** field in the provider spec.
The canary analysis can have a list of alerts, each alert referencing an alert provider:
```yaml
canaryAnalysis:
alerts:
- name: "on-call Slack"
severity: error
providerRef:
name: on-call
namespace: flagger
- name: "qa Discord"
severity: warn
providerRef:
name: qa-discord
- name: "dev MS Teams"
severity: info
providerRef:
name: dev-msteams
```
Alert fields:
* **name** (required)
* **severity** levels: `info`, `warn`, `error` (default info)
* **providerRef.name** alert provider name (required)
* **providerRef.namespace** alert provider namespace (defaults to the canary namespace)
When the severity is set to `warn`, Flagger will alert when waiting on manual confirmation or if the analysis fails.
When the severity is set to `error`, Flagger will alert only if the canary analysis fails.
### Prometheus Alert Manager
You can use Alertmanager to trigger alerts when a canary deployment failed:
```yaml
- alert: canary_rollback

View File

@ -1,31 +1,33 @@
# Deployment Strategies
Flagger can run automated application analysis, promotion and rollback for the following deployment strategies:
* **Canary Release** (progressive traffic shifting)
* Istio, Linkerd, App Mesh, NGINX, Contour, Gloo
* **A/B Testing** (HTTP headers and cookies traffic routing)
* Istio, App Mesh, NGINX, Contour
* **Blue/Green** (traffic switching)
* Kubernetes CNI, Istio, Linkerd, App Mesh, NGINX, Contour, Gloo
* **Blue/Green Mirroring** (traffic shadowing)
* Istio
* Canary release \(progressive traffic shifting\)
* Istio, Linkerd, App Mesh, NGINX, Contour, Gloo
* A/B Testing \(HTTP headers and cookies traffic routing\)
* Istio, App Mesh, NGINX, Contour
* Blue/Green \(traffic switch\)
* Kubernetes CNI, Istio, Linkerd, App Mesh, NGINX, Contour, Gloo
* Blue/Green \(traffic mirroring\)
* Istio
For Canary releases and A/B testing you'll need a Layer 7 traffic management solution like a service mesh or an ingress controller. For Blue/Green deployments no service mesh or ingress controller is required.
For Canary releases and A/B testing you'll need a Layer 7 traffic management solution like a service mesh or an ingress controller.
For Blue/Green deployments no service mesh or ingress controller is required.
A canary analysis is triggered by changes in any of the following objects:
* Deployment PodSpec \(container image, command, ports, env, resources, etc\)
* Deployment PodSpec (container image, command, ports, env, resources, etc)
* ConfigMaps mounted as volumes or mapped to environment variables
* Secrets mounted as volumes or mapped to environment variables
## Canary Release
### Canary Release
Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators like HTTP requests success rate, requests average duration and pod health. Based on analysis of the KPIs a canary is promoted or aborted.
Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance
indicators like HTTP requests success rate, requests average duration and pod health.
Based on analysis of the KPIs a canary is promoted or aborted.
![Flagger Canary Stages](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/diagrams/flagger-canary-steps.png)
The canary analysis runs periodically until it reaches the maximum traffic weight or the failed checks threshold.
The canary analysis runs periodically until it reaches the maximum traffic weight or the failed checks threshold.
Spec:
@ -46,27 +48,76 @@ Spec:
skipAnalysis: false
```
The above analysis, if it succeeds, will run for 25 minutes while validating the HTTP metrics and webhooks every minute. You can determine the minimum time that it takes to validate and promote a canary deployment using this formula:
The above analysis, if it succeeds, will run for 25 minutes while validating the HTTP metrics and webhooks every minute.
You can determine the minimum time that it takes to validate and promote a canary deployment using this formula:
```text
```
interval * (maxWeight / stepWeight)
```
And the time it takes for a canary to be rollback when the metrics or webhook checks are failing:
```text
interval * threshold
```
interval * threshold
```
In emergency cases, you may want to skip the analysis phase and ship changes directly to production. At any time you can set the `spec.skipAnalysis: true`. When skip analysis is enabled, Flagger checks if the canary deployment is healthy and promotes it without analysing it. If an analysis is underway, Flagger cancels it and runs the promotion.
In emergency cases, you may want to skip the analysis phase and ship changes directly to production.
At any time you can set the `spec.skipAnalysis: true`.
When skip analysis is enabled, Flagger checks if the canary deployment is healthy and
promotes it without analysing it. If an analysis is underway, Flagger cancels it and runs the promotion.
## A/B Testing
Gated canary promotion stages:
For frontend applications that require session affinity you should use HTTP headers or cookies match conditions to ensure a set of users will stay on the same version for the whole duration of the canary analysis.
* scan for canary deployments
* check primary and canary deployment status
* halt advancement if a rolling update is underway
* halt advancement if pods are unhealthy
* call confirm-rollout webhooks and check results
* halt advancement if any hook returns a non HTTP 2xx result
* call pre-rollout webhooks and check results
* halt advancement if any hook returns a non HTTP 2xx result
* increment the failed checks counter
* increase canary traffic weight percentage from 0% to 2% (step weight)
* call rollout webhooks and check results
* check canary HTTP request success rate and latency
* halt advancement if any metric is under the specified threshold
* increment the failed checks counter
* check if the number of failed checks reached the threshold
* route all traffic to primary
* scale to zero the canary deployment and mark it as failed
* call post-rollout webhooks
* post the analysis result to Slack
* wait for the canary deployment to be updated and start over
* increase canary traffic weight by 2% (step weight) till it reaches 50% (max weight)
* halt advancement if any webhook call fails
* halt advancement while canary request success rate is under the threshold
* halt advancement while canary request duration P99 is over the threshold
* halt advancement while any custom metric check fails
* halt advancement if the primary or canary deployment becomes unhealthy
* halt advancement while canary deployment is being scaled up/down by HPA
* call confirm-promotion webhooks and check results
* halt advancement if any hook returns a non HTTP 2xx result
* promote canary to primary
* copy ConfigMaps and Secrets from canary to primary
* copy canary deployment spec template over primary
* wait for primary rolling update to finish
* halt advancement if pods are unhealthy
* route all traffic to primary
* scale to zero the canary deployment
* mark rollout as finished
* call post-rollout webhooks
* send notification with the canary analysis result
* wait for the canary deployment to be updated and start over
### A/B Testing
For frontend applications that require session affinity you should use HTTP headers or cookies match conditions
to ensure a set of users will stay on the same version for the whole duration of the canary analysis.
![Flagger A/B Testing Stages](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/diagrams/flagger-abtest-steps.png)
You can enable A/B testing by specifying the HTTP match conditions and the number of iterations. If Flagger finds a HTTP match condition, it will ignore the `maxWeight` and `stepWeight` settings.
You can enable A/B testing by specifying the HTTP match conditions and the number of iterations.
If Flagger finds a HTTP match condition, it will ignore the `maxWeight` and `stepWeight` settings.
Istio example:
@ -88,16 +139,17 @@ Istio example:
regex: "^(.*?;)?(canary=always)(;.*)?$"
```
The above configuration will run an analysis for ten minutes targeting the Safari users and those that have a test cookie. You can determine the minimum time that it takes to validate and promote a canary deployment using this formula:
The above configuration will run an analysis for ten minutes targeting the Safari users and those that have a test cookie.
You can determine the minimum time that it takes to validate and promote a canary deployment using this formula:
```text
```
interval * iterations
```
And the time it takes for a canary to be rollback when the metrics or webhook checks are failing:
```text
interval * threshold
```
interval * threshold
```
App Mesh example:
@ -155,9 +207,10 @@ curl -H 'X-Canary: insider' http://app.example.com
curl -b 'canary=always' http://app.example.com
```
## Blue/Green Deployments
### Blue/Green Deployments
For applications that are not deployed on a service mesh, Flagger can orchestrate blue/green style deployments with Kubernetes L4 networking. When using Istio you have the option to mirror traffic between blue and green.
For applications that are not deployed on a service mesh, Flagger can orchestrate blue/green style deployments
with Kubernetes L4 networking. When using Istio you have the option to mirror traffic between blue and green.
![Flagger Blue/Green Stages](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/diagrams/flagger-bluegreen-steps.png)
@ -173,30 +226,44 @@ You can use the blue/green deployment strategy by replacing `stepWeight/maxWeigh
threshold: 2
```
With the above configuration Flagger will run conformance and load tests on the canary pods for ten minutes. If the metrics analysis succeeds, live traffic will be switched from the old version to the new one when the canary is promoted.
With the above configuration Flagger will run conformance and load tests on the canary pods for ten minutes.
If the metrics analysis succeeds, live traffic will be switched from the old version to the new one when the
canary is promoted.
The blue/green deployment strategy is supported for all service mesh providers.
Blue/Green rollout steps for service mesh:
* scale up the canary \(green\)
* detect new revision (deployment spec, secrets or configmaps changes)
* scale up the canary (green)
* run conformance tests for the canary pods
* run load tests and metric checks for the canary pods
* run load tests and metric checks for the canary pods every minute
* abort the canary release if the failure threshold is reached
* route traffic to canary
* promote canary spec over primary \(blue\)
* promote canary spec over primary (blue)
* wait for primary rollout
* route traffic to primary
* scale down canary
After the analysis finishes, the traffic is routed to the canary \(green\) before triggering the primary \(blue\) rolling update, this ensures a smooth transition to the new version avoiding dropping in-flight requests during the Kubernetes deployment rollout.
After the analysis finishes, the traffic is routed to the canary (green) before triggering the primary (blue)
rolling update, this ensures a smooth transition to the new version avoiding dropping in-flight requests during
the Kubernetes deployment rollout.
## Blue/Green with Traffic Mirroring
### Blue/Green with Traffic Mirroring
Traffic Mirroring is a pre-stage in a Canary \(progressive traffic shifting\) or Blue/Green deployment strategy. Traffic mirroring will copy each incoming request, sending one request to the primary and one to the canary service. The response from the primary is sent back to the user. The response from the canary is discarded. Metrics are collected on both requests so that the deployment will only proceed if the canary metrics are healthy.
Traffic Mirroring is a pre-stage in a Canary (progressive traffic shifting) or
Blue/Green deployment strategy. Traffic mirroring will copy each incoming
request, sending one request to the primary and one to the canary service.
The response from the primary is sent back to the user. The response from the canary
is discarded. Metrics are collected on both requests so that the deployment will
only proceed if the canary metrics are healthy.
Mirroring must only be used for requests that are **idempotent** or capable of being processed twice \(once by the primary and once by the canary\). Reads are idempotent. Before using mirroring on requests that may be writes, you should consider what will happen if a write is duplicated and handled by the primary and canary.
Mirroring should be used for requests that are **idempotent** or capable of
being processed twice (once by the primary and once by the canary). Reads are
idempotent. Before using mirroring on requests that may be writes, you should
consider what will happen if a write is duplicated and handled by the primary
and canary.
To use mirroring, set `spec.canaryAnalysis.mirror` to `true`.
To use mirroring, set `spec.canaryAnalysis.mirror` to `true`.
Istio example:
@ -212,3 +279,27 @@ Istio example:
mirror: true
```
Mirroring rollout steps for service mesh:
* detect new revision (deployment spec, secrets or configmaps changes)
* scale from zero the canary deployment
* wait for the HPA to set the canary minimum replicas
* check canary pods health
* run the acceptance tests
* abort the canary release if tests fail
* start the load tests
* mirror traffic from primary to canary
* check request success rate and request duration every minute
* abort the canary release if the failure threshold is reached
* stop traffic mirroring after the number of iterations is reached
* route live traffic to the canary pods
* promote the canary (update the primary secrets, configmaps and deployment spec)
* wait for the primary deployment rollout to finish
* wait for the HPA to set the primary minimum replicas
* check primary pods health
* switch live traffic back to primary
* scale to zero the canary
* send notification with the canary analysis result
After the analysis finishes, the traffic is routed to the canary (green) before triggering the primary (blue)
rolling update, this ensures a smooth transition to the new version avoiding dropping in-flight requests during
the Kubernetes deployment rollout.

View File

@ -0,0 +1,137 @@
# Metrics Analysis
As part of the analysis process, Flagger can validate service level objectives (SLOs) like
availability, error rate percentage, average response time and any other objective based on app specific metrics.
If a drop in performance is noticed during the SLOs analysis,
the release will be automatically rolled back with minimum impact to end-users.
### Builtin Metrics
Flagger comes with two builtin metric checks: HTTP request success rate and duration.
```yaml
canaryAnalysis:
metrics:
- name: request-success-rate
interval: 1m
# minimum req success rate (non 5xx responses)
# percentage (0-100)
thresholdRange:
min: 99
- name: request-duration
interval: 1m
# maximum req duration P99
# milliseconds
thresholdRange:
max: 500
```
For each metric you can specify a range of accepted values with `thresholdRange`
and the window size or the time series with `interval`.
The builtin checks are available for every service mesh / ingress controller
and are implemented with [Prometheus queries](../faq.md#metrics).
### Custom Metrics
The canary analysis can be extended with custom metric checks. Using a `MetricTemplate` custom resource, you
configure Flagger to connect to a metric provider and run a query that returns a `float64` value.
The query result is used to validate the canary based on the specified threshold range.
Prometheus template example:
```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: not-found-percentage
namespace: istio-system
spec:
provider:
type: prometheus
address: http://promethues.istio-system:9090
query: |
100 - sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace="{{ namespace }}",
destination_workload="{{ target }}",
response_code!="404"
}[{{ interval }}]
)
)
/
sum(
rate(
istio_requests_total{
reporter="destination",
destination_workload_namespace="{{ namespace }}",
destination_workload="{{ target }}"
}[{{ interval }}]
)
) * 100
```
The following variables are available in templates:
- `name` (canary.metadata.name)
- `namespace` (canary.metadata.namespace)
- `target` (canary.spec.targetRef.name)
- `service` (canary.spec.service.name)
- `ingress` (canary.spec.ingresRef.name)
- `interval` (canary.spec.canaryAnalysis.metrics[].interval)
A canary analysis metric can reference a template with `templateRef`:
```yaml
canaryAnalysis:
metrics:
- name: "404s percentage"
templateRef:
name: not-found-percentage
# namespace is optional
# when not specified, the canary namespace will be used
namespace: istio-system
thresholdRange:
max: 5
interval: 1m
```
The above configuration validates the canary by checking
if the HTTP 404 req/sec percentage is below 5 percent of the total traffic.
If the 404s rate reaches the 5% threshold, then the canary fails.
Prometheus gRPC error rate example:
```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: grpc-error-rate-percentage
namespace: flagger
spec:
provider:
type: prometheus
address: http://flagger-promethues.flagger-system:9090
query: |
100 - sum(
rate(
grpc_server_handled_total{
grpc_code!="OK",
kubernetes_namespace="{{ namespace }}",
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
}[{{ interval }}]
)
)
/
sum(
rate(
grpc_server_started_total{
kubernetes_namespace="{{ namespace }}",
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
}[{{ interval }}]
)
) * 100
```
The above template is for gPRC services instrumented with [go-grpc-prometheus](https://github.com/grpc-ecosystem/go-grpc-prometheus).

View File

@ -0,0 +1,401 @@
# Webhooks
The canary analysis can be extended with webhooks. Flagger will call each webhook URL and
determine from the response status code (HTTP 2xx) if the canary is failing or not.
There are several types of hooks:
* **confirm-rollout** hooks are executed before scaling up the canary deployment and can be used for manual approval.
The rollout is paused until the hook returns a successful HTTP status code.
* **pre-rollout** hooks are executed before routing traffic to canary.
The canary advancement is paused if a pre-rollout hook fails and if the number of failures reach the
threshold the canary will be rollback.
* **rollout** hooks are executed during the analysis on each iteration before the metric checks.
If a rollout hook call fails the canary advancement is paused and eventfully rolled back.
* **confirm-promotion** hooks are executed before the promotion step.
The canary promotion is paused until the hooks return HTTP 200.
While the promotion is paused, Flagger will continue to run the metrics checks and rollout hooks.
* **post-rollout** hooks are executed after the canary has been promoted or rolled back.
If a post rollout hook fails the error is logged.
* **rollback** hooks are executed while a canary deployment is in either Progressing or Waiting status.
This provides the ability to rollback during analysis or while waiting for a confirmation. If a rollback hook
returns a successful HTTP status code, Flagger will stop the analysis and mark the canary release as failed.
* **event** hooks are executed every time Flagger emits a Kubernetes event. When configured,
every action that Flagger takes during a canary deployment will be sent as JSON via an HTTP POST request.
Spec:
```yaml
canaryAnalysis:
webhooks:
- name: "start gate"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/approve
- name: "smoke test"
type: pre-rollout
url: http://flagger-helmtester.kube-system/
timeout: 3m
metadata:
type: "helm"
cmd: "test podinfo --cleanup"
- name: "load test"
type: rollout
url: http://flagger-loadtester.test/
timeout: 15s
metadata:
cmd: "hey -z 1m -q 5 -c 2 http://podinfo-canary.test:9898/"
- name: "promotion gate"
type: confirm-promotion
url: http://flagger-loadtester.test/gate/approve
- name: "notify"
type: post-rollout
url: http://telegram.bot:8080/
timeout: 5s
metadata:
some: "message"
- name: "rollback gate"
type: rollback
url: http://flagger-loadtester.test/rollback/check
- name: "send to Slack"
type: event
url: http://event-recevier.notifications/slack
```
> **Note** that the sum of all rollout webhooks timeouts should be lower than the analysis interval.
Webhook payload (HTTP POST):
```json
{
"name": "podinfo",
"namespace": "test",
"phase": "Progressing",
"metadata": {
"test": "all",
"token": "16688eb5e9f289f1991c"
}
}
```
Response status codes:
* 200-202 - advance canary by increasing the traffic weight
* timeout or non-2xx - halt advancement and increment failed checks
On a non-2xx response Flagger will include the response body (if any) in the failed checks log and Kubernetes events.
Event payload (HTTP POST):
```json
{
"name": "string (canary name)",
"namespace": "string (canary namespace)",
"phase": "string (canary phase)",
"metadata": {
"eventMessage": "string (canary event message)",
"eventType": "string (canary event type)",
"timestamp": "string (unix timestamp ms)"
}
}
```
The event receiver can create alerts based on the received phase
(possible values: ` Initialized`, `Waiting`, `Progressing`, `Promoting`, `Finalising`, `Succeeded` or `Failed`).
### Load Testing
For workloads that are not receiving constant traffic Flagger can be configured with a webhook,
that when called, will start a load test for the target workload.
If the target workload doesn't receive any traffic during the canary analysis,
Flagger metric checks will fail with "no values found for metric request-success-rate".
Flagger comes with a load testing service based on [rakyll/hey](https://github.com/rakyll/hey)
that generates traffic during analysis when configured as a webhook.
![Flagger Load Testing Webhook](https://raw.githubusercontent.com/weaveworks/flagger/master/docs/diagrams/flagger-load-testing.png)
First you need to deploy the load test runner in a namespace with sidecar injection enabled:
```bash
export REPO=https://raw.githubusercontent.com/weaveworks/flagger/master
kubectl -n test apply -f ${REPO}/artifacts/loadtester/deployment.yaml
kubectl -n test apply -f ${REPO}/artifacts/loadtester/service.yaml
```
Or by using Helm:
```bash
helm repo add flagger https://flagger.app
helm upgrade -i flagger-loadtester flagger/loadtester \
--namespace=test \
--set cmd.timeout=1h
```
When deployed the load tester API will be available at `http://flagger-loadtester.test/`.
Now you can add webhooks to the canary analysis spec:
```yaml
webhooks:
- name: load-test-get
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 http://podinfo-canary.test:9898/"
- name: load-test-post
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 -m POST -d '{test: 2}' http://podinfo-canary.test:9898/echo"
```
When the canary analysis starts, Flagger will call the webhooks and the load tester will run the `hey` commands
in the background, if they are not already running. This will ensure that during the
analysis, the `podinfo-canary.test` service will receive a steady stream of GET and POST requests.
If your workload is exposed outside the mesh you can point `hey` to the
public URL and use HTTP2.
```yaml
webhooks:
- name: load-test-get
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "hey -z 1m -q 10 -c 2 -h2 https://podinfo.example.com/"
```
For gRPC services you can use [bojand/ghz](https://github.com/bojand/ghz) which is a similar tool to Hey but for gPRC:
```yaml
webhooks:
- name: grpc-load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "ghz -z 1m -q 10 -c 2 --insecure podinfo.test:9898"
```
`ghz` uses reflection to identify which gRPC method to call. If you do not wish to enable reflection for your gRPC service you can implement a standardized health check from the [grpc-proto](https://github.com/grpc/grpc-proto) library. To use this [health check schema](https://github.com/grpc/grpc-proto/blob/master/grpc/health/v1/health.proto) without reflection you can pass a parameter to `ghz` like this
```yaml
webhooks:
- name: grpc-load-test-no-reflection
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
type: cmd
cmd: "ghz --insecure --proto=/tmp/ghz/health.proto --call=grpc.health.v1.Health/Check podinfo.test:9898"
```
The load tester can run arbitrary commands as long as the binary is present in the container image.
For example if you you want to replace `hey` with another CLI, you can create your own Docker image:
```dockerfile
FROM weaveworks/flagger-loadtester:<VER>
RUN curl -Lo /usr/local/bin/my-cli https://github.com/user/repo/releases/download/ver/my-cli \
&& chmod +x /usr/local/bin/my-cli
```
### Load Testing Delegation
The load tester can also forward testing tasks to external tools, by now [nGrinder](https://github.com/naver/ngrinder)
is supported.
To use this feature, add a load test task of type 'ngrinder' to the canary analysis spec:
```yaml
webhooks:
- name: load-test-post
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
# type of this load test task, cmd or ngrinder
type: ngrinder
# base url of your nGrinder controller server
server: http://ngrinder-server:port
# id of the test to clone from, the test must have been defined.
clone: 100
# user name and base64 encoded password to authenticate against the nGrinder server
username: admin
passwd: YWRtaW4=
# the interval between between nGrinder test status polling, default to 1s
pollInterval: 5s
```
When the canary analysis starts, the load tester will initiate a [clone_and_start request](https://github.com/naver/ngrinder/wiki/REST-API-PerfTest)
to the nGrinder server and start a new performance test. the load tester will periodically poll the nGrinder server
for the status of the test, and prevent duplicate requests from being sent in subsequent analysis loops.
### Integration Testing
Flagger comes with a testing service that can run Helm tests or Bats tests when configured as a webhook.
Deploy the Helm test runner in the `kube-system` namespace using the `tiller` service account:
```bash
helm repo add flagger https://flagger.app
helm upgrade -i flagger-helmtester flagger/loadtester \
--namespace=kube-system \
--set serviceAccountName=tiller
```
When deployed the Helm tester API will be available at `http://flagger-helmtester.kube-system/`.
Now you can add pre-rollout webhooks to the canary analysis spec:
```yaml
canaryAnalysis:
webhooks:
- name: "smoke test"
type: pre-rollout
url: http://flagger-helmtester.kube-system/
timeout: 3m
metadata:
type: "helm"
cmd: "test {{ .Release.Name }} --cleanup"
```
When the canary analysis starts, Flagger will call the pre-rollout webhooks before routing traffic to the canary.
If the helm test fails, Flagger will retry until the analysis threshold is reached and the canary is rolled back.
If you are using Helm v3, you'll have to create a dedicated service account and add the release namespace to the test command:
```yaml
canaryAnalysis:
webhooks:
- name: "smoke test"
type: pre-rollout
url: http://flagger-helmtester.kube-system/
timeout: 3m
metadata:
type: "helmv3"
cmd: "test run {{ .Release.Name }} --timeout 3m -n {{ .Release.Namespace }}"
```
As an alternative to Helm you can use the [Bash Automated Testing System](https://github.com/bats-core/bats-core) to run your tests.
```yaml
canaryAnalysis:
webhooks:
- name: "acceptance tests"
type: pre-rollout
url: http://flagger-batstester.default/
timeout: 5m
metadata:
type: "bash"
cmd: "bats /tests/acceptance.bats"
```
Note that you should create a ConfigMap with your Bats tests and mount it inside the tester container.
### Manual Gating
For manual approval of a canary deployment you can use the `confirm-rollout` and `confirm-promotion` webhooks.
The confirmation rollout hooks are executed before the pre-rollout hooks.
Flagger will halt the canary traffic shifting and analysis until the confirm webhook returns HTTP status 200.
For manual rollback of a canary deployment you can use the `rollback` webhook. The rollback hook will be called
during the analysis and confirmation states. If a rollback webhook returns a successful HTTP status code, Flagger
will shift all traffic back to the primary instance and fail the canary.
Manual gating with Flagger's tester:
```yaml
canaryAnalysis:
webhooks:
- name: "gate"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/halt
```
The `/gate/halt` returns HTTP 403 thus blocking the rollout.
If you have notifications enabled, Flagger will post a message to Slack or MS Teams if a canary rollout is waiting for approval.
Change the URL to `/gate/approve` to start the canary analysis:
```yaml
canaryAnalysis:
webhooks:
- name: "gate"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/approve
```
Manual gating can be driven with Flagger's tester API. Set the confirmation URL to `/gate/check`:
```yaml
canaryAnalysis:
webhooks:
- name: "ask for confirmation"
type: confirm-rollout
url: http://flagger-loadtester.test/gate/check
```
By default the gate is closed, you can start or resume the canary rollout with:
```bash
kubectl -n test exec -it flagger-loadtester-xxxx-xxxx sh
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/gate/open
```
You can pause the rollout at any time with:
```bash
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/gate/close
```
If a canary analysis is paused the status will change to waiting:
```bash
kubectl get canary/podinfo
NAME STATUS WEIGHT
podinfo Waiting 0
```
The `confirm-promotion` hook type can be used to manually approve the canary promotion.
While the promotion is paused, Flagger will continue to run the metrics checks and load tests.
```yaml
canaryAnalysis:
webhooks:
- name: "promotion gate"
type: confirm-promotion
url: http://flagger-loadtester.test/gate/halt
```
The `rollback` hook type can be used to manually rollback the canary promotion. As with gating, rollbacks can be driven
with Flagger's tester API by setting the rollback URL to `/rollback/check`
```yaml
canaryAnalysis:
webhooks:
- name: "rollback"
type: rollback
url: http://flagger-loadtester.test/rollback/check
```
By default rollback is closed, you can rollback a canary rollout with:
```bash
kubectl -n test exec -it flagger-loadtester-xxxx-xxxx sh
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/rollback/open
```
You can close the rollback with:
```bash
curl -d '{"name": "podinfo","namespace":"test"}' http://localhost:8080/rollback/close
```
If you have notifications enabled, Flagger will post a message to Slack or MS Teams if a canary has been rolled back.