flagger/docs/gitbook/tutorials/canary-helm-gitops.md

# Canary Deployments with Helm Charts and GitOps

This guide shows you how to package a web app into a Helm chart, trigger canary deployments on Helm upgrade
and automate the chart release process with Weave Flux.

### Packaging

You'll be using the [podinfo](https://github.com/stefanprodan/k8s-podinfo) chart.
This chart packages a web app made with Go, it's configuration, a horizontal pod autoscaler (HPA)
and the canary configuration file.

```
├── Chart.yaml
├── README.md
├── templates
│   ├── NOTES.txt
│   ├── _helpers.tpl
│   ├── canary.yaml
│   ├── configmap.yaml
│   ├── deployment.yaml
│   └── hpa.yaml
└── values.yaml
```

You can find the chart source [here](https://github.com/stefanprodan/flagger/tree/master/charts/podinfo).

### Install

Create a test namespace with Istio sidecar injection enabled:

```bash
export REPO=https://raw.githubusercontent.com/stefanprodan/flagger/master

kubectl apply -f ${REPO}/artifacts/namespaces/test.yaml
```

Add Flagger Helm repository:

```bash
helm repo add flagger https://flagger.app
```

Install podinfo with the release name `frontend` (replace `example.com` with your own domain):

```bash
helm upgrade -i frontend flagger/podinfo \
--namespace test \
--set nameOverride=frontend \
--set backend=http://backend.test:9898/echo \
--set canary.enabled=true \
--set canary.istioIngress.enabled=true \
--set canary.istioIngress.gateway=public-gateway.istio-system.svc.cluster.local \
--set canary.istioIngress.host=frontend.istio.example.com
```

Flagger takes a Kubernetes deployment and a horizontal pod autoscaler (HPA),
then creates a series of objects (Kubernetes deployments, ClusterIP services and Istio virtual services).
These objects expose the application on the mesh and drive the canary analysis and promotion.

```bash
# generated by Helm
configmap/frontend
deployment.apps/frontend
horizontalpodautoscaler.autoscaling/frontend
canary.flagger.app/frontend

# generated by Flagger
configmap/frontend-primary
deployment.apps/frontend-primary
horizontalpodautoscaler.autoscaling/frontend-primary
service/frontend
service/frontend-canary
service/frontend-primary
virtualservice.networking.istio.io/frontend
```

When the `frontend-primary` deployment comes online,
Flagger will route all traffic to the primary pods and scale to zero the `frontend` deployment.

Open your browser and navigate to the frontend URL:

![Podinfo Frontend](https://raw.githubusercontent.com/stefanprodan/flagger/master/docs/screens/demo-frontend.png)

Now let's install the `backend` release without exposing it outside the mesh:

```bash
helm upgrade -i backend flagger/podinfo \
--namespace test \
--set nameOverride=backend \
--set canary.enabled=true \
--set canary.istioIngress.enabled=false
```

Check if Flagger has successfully deployed the canaries:

```
kubectl -n test get canaries

NAME       STATUS        WEIGHT   LASTTRANSITIONTIME
backend    Initialized   0        2019-02-12T18:53:18Z
frontend   Initialized   0        2019-02-12T17:50:50Z
```

Click on the ping button in the `frontend` UI to trigger a HTTP POST request
that will reach the `backend` app:

![Jaeger Tracing](https://raw.githubusercontent.com/stefanprodan/flagger/master/docs/screens/demo-frontend-jaeger.png)

We'll use the `/echo` endpoint (same as the one the ping button calls)
to generate load on both apps during a canary deployment.

### Upgrade

First let's install a load testing service that will generate traffic during analysis:

```bash
helm upgrade -i flagger-loadtester flagger/loadtester \
--namespace=test
```

Enable the load tester and deploy a new `frontend` version:

```bash
helm upgrade -i frontend flagger/podinfo/ \
--namespace test \
--reuse-values \
--set canary.loadtest.enabled=true \
--set image.tag=1.4.1
```

Flagger detects that the deployment revision changed and starts the canary analysis along with the load test:

```
kubectl -n istio-system logs deployment/flagger -f | jq .msg

New revision detected! Scaling up frontend.test
Halt advancement frontend.test waiting for rollout to finish: 0 of 2 updated replicas are available
Starting canary analysis for frontend.test
Advance frontend.test canary weight 5
Advance frontend.test canary weight 10
Advance frontend.test canary weight 15
Advance frontend.test canary weight 20
Advance frontend.test canary weight 25
Advance frontend.test canary weight 30
Advance frontend.test canary weight 35
Advance frontend.test canary weight 40
Advance frontend.test canary weight 45
Advance frontend.test canary weight 50
Copying frontend.test template spec to frontend-primary.test
Halt advancement frontend-primary.test waiting for rollout to finish: 1 old replicas are pending termination
Promotion completed! Scaling down frontend.test
```

You can monitor the canary deployment with Grafana. Open the Flagger dashboard,
select `test` from the namespace dropdown, `frontend-primary` from the primary dropdown and `frontend` from the
canary dropdown.

![Flagger Grafana Dashboard](https://raw.githubusercontent.com/stefanprodan/flagger/master/docs/screens/demo-frontend-dashboard.png)

Now trigger a canary deployment for the `backend` app, but this time you'll change a value in the configmap:

```bash
helm upgrade -i backend flagger/podinfo/ \
--namespace test \
--reuse-values \
--set canary.loadtest.enabled=true \
--set httpServer.timeout=25s
```

Generate HTTP 500 errors:

```bash
kubectl -n test exec -it flagger-loadtester-xxx-yyy sh

watch curl http://backend-canary:9898/status/500
```

Generate latency:

```bash
kubectl -n test exec -it flagger-loadtester-xxx-yyy sh

watch curl http://backend-canary:9898/delay/1
```

Flagger detects the config map change and starts a canary analysis. Flagger will pause the advancement
when the HTTP success rate drops under 99% or when the average request duration in the last minute is over 500ms:

```
kubectl -n test describe canary backend

Events:

ConfigMap backend has changed
New revision detected! Scaling up backend.test
Starting canary analysis for backend.test
Advance backend.test canary weight 5
Advance backend.test canary weight 10
Advance backend.test canary weight 15
Advance backend.test canary weight 20
Advance backend.test canary weight 25
Advance backend.test canary weight 30
Advance backend.test canary weight 35
Halt backend.test advancement success rate 62.50% < 99%
Halt backend.test advancement success rate 88.24% < 99%
Advance backend.test canary weight 40
Advance backend.test canary weight 45
Halt backend.test advancement request duration 2.415s > 500ms
Halt backend.test advancement request duration 2.42s > 500ms
Advance backend.test canary weight 50
ConfigMap backend-primary synced
Copying backend.test template spec to backend-primary.test
Promotion completed! Scaling down backend.test
```

![Flagger Grafana Dashboard](https://raw.githubusercontent.com/stefanprodan/flagger/master/docs/screens/demo-backend-dashboard.png)

If the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary,
the canary is scaled to zero and the rollout is marked as failed.

```bash
kubectl -n test get canary

NAME       STATUS        WEIGHT   LASTTRANSITIONTIME
backend    Succeeded     0        2019-02-12T19:33:11Z
frontend   Failed        0        2019-02-12T19:47:20Z
```

If you've enabled the Slack notifications, you'll receive an alert with the reason why the `backend` promotion failed.

### GitOps automation

Instead of using Helm CLI from a CI tool to perform the install and upgrade,
you could use a Git based approach. GitOps is a way to do Continuous Delivery,
it works by using Git as a source of truth for declarative infrastructure and workloads.
In the [GitOps model](https://www.weave.works/technologies/gitops/),
any change to production must be committed in source control
prior to being applied on the cluster. This way rollback and audit logs are provided by Git.

![Helm GitOps Canary Deployment](https://raw.githubusercontent.com/stefanprodan/flagger/master/docs/diagrams/flagger-flux-gitops.png)

In order to apply the GitOps pipeline model to Flagger canary deployments you'll need
a Git repository with your workloads definitions in YAML format,
a container registry where your CI system pushes immutable images and
an operator that synchronizes the Git repo with the cluster state.

Create a git repository with the following content:

```
├── namespaces
│   └── test.yaml
└── releases
    └── test
        ├── backend.yaml
        ├── frontend.yaml
        └── loadtester.yaml
```

You can find the git source [here](https://github.com/stefanprodan/flagger/tree/master/artifacts/cluster).

Define the `frontend` release using Flux `HelmRelease` custom resource:

```yaml
apiVersion: flux.weave.works/v1beta1
kind: HelmRelease
metadata:
  name: frontend
  namespace: test
  annotations:
    flux.weave.works/automated: "true"
    flux.weave.works/tag.chart-image: semver:~1.4
spec:
  releaseName: frontend
  chart:
    repository: https://stefanprodan.github.io/flagger/
    name: podinfo
    version: 2.0.0
  values:
    image:
      repository: quay.io/stefanprodan/podinfo
      tag: 1.4.0
      backend: http://backend-podinfo:9898/echo
      canary:
        enabled: true
        istioIngress:
          enabled: true
          gateway: public-gateway.istio-system.svc.cluster.local
          host: frontend.istio.example.com
        loadtest:
          enabled: true
```

In the `chart` section I've defined the release source by specifying the Helm repository (hosted on GitHub Pages), chart name and version.
In the `values` section I've overwritten the defaults set in values.yaml.

With the `flux.weave.works` annotations I instruct Flux to automate this release.
When an image tag in the sem ver range of `1.4.0 - 1.4.99` is pushed to Quay,
Flux will upgrade the Helm release and from there Flagger will pick up the change and start a canary deployment.

Install [Weave Flux](https://github.com/weaveworks/flux) and its Helm Operator by specifying your Git repo URL:

```bash
helm repo add weaveworks https://weaveworks.github.io/flux

helm install --name flux \
--set helmOperator.create=true \
--set git.url=git@github.com:<USERNAME>/<REPOSITORY> \
--namespace flux \
weaveworks/flux
```

At startup Flux generates a SSH key and logs the public key. Find the SSH public key with:

```bash
kubectl -n flux logs deployment/flux | grep identity.pub | cut -d '"' -f2
```

In order to sync your cluster state with Git you need to copy the public key and create a
deploy key with write access on your GitHub repository.

Open GitHub, navigate to your fork, go to _Setting > Deploy keys_ click on _Add deploy key_,
check _Allow write access_, paste the Flux public key and click _Add key_.

After a couple of seconds Flux will apply the Kubernetes resources from Git and Flagger will
launch the `frontend` and `backend` apps.

A CI/CD pipeline for the `frontend` release could look like this:

* cut a release from the master branch of the podinfo code repo with the git tag `1.4.1`
* CI builds the image and pushes the `podinfo:1.4.1` image to the container registry
* Flux scans the registry and updates the Helm release `image.tag` to `1.4.1`
* Flux commits and push the change to the cluster repo
* Flux applies the updated Helm release on the cluster
* Flux Helm Operator picks up the change and calls Tiller to upgrade the release
* Flagger detects a revision change and scales up the `frontend` deployment
* Flagger starts the load test and runs the canary analysis
* Based on the analysis result the canary deployment is promoted to production or rolled back
* Flagger sends a Slack notification with the canary result

If the canary fails, fix the bug, do another patch release eg `1.4.2` and the whole process will run again.

A canary deployment can fail due to any of the following reasons:

* the container image can't be downloaded
* the deployment replica set is stuck for more then ten minutes (eg. due to a container crash loop)
* the webooks (acceptance tests, load tests, etc) are returning a non 2xx response
* the HTTP success rate (non 5xx responses) metric drops under the threshold
* the HTTP average duration metric goes over the threshold
* the Istio telemetry service is unable to collect traffic metrics
* the metrics server (Prometheus) can't be reached

If you want to find out more about managing Helm releases with Flux here is an in-depth guide
[github.com/stefanprodan/gitops-helm](https://github.com/stefanprodan/gitops-helm).