litmus-docs/kafka-broker-disk-failure.md at cccf7c6a7635737bf2fe2cc4fe937a2b3c84673b

13 KiB

Raw Blame History

id	title	sidebar_label	original_id
kafka-broker-disk-failure	Kafka Broker Disk Failure Experiment Details	Broker Disk Failure	kafka-broker-disk-failure

Experiment Metadata

Type	Description	Kafka Distribution	Tested K8s Platform
Kafka	Fail kafka broker disk/storage	Confluent, Kudo-Kafka	GKE

Prerequisites

Ensure that the Litmus Chaos Operator is running by executing kubectl get pods in operator namespace (typically, litmus). If not, install from here
Ensure that Kafka & Zookeeper are deployed as Statefulsets
If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of KAFKA_INSTANCE_NAME experiment environment variable
- In case of Confluent, specified by the --name flag
- In case of Kudo, specified by the --instance flag
Zookeeper uses this to construct a path in which kafka cluster data is stored.
Ensure that the kafka-broker-disk failure experiment resource is available in the cluster by executing kubectl get chaosexperiments in the desired namespace. If not, install from here
Create a secret with the gcloud serviceaccount key (placed in a file cloud_config.yml) named kafka-broker-disk-failure in the namespace where the experiment CRs are created. This is necessary to perform the disk-detach steps from the litmus experiment container.

kubectl create secret generic kafka-broker-disk-failure --from-file=cloud_config.yml -n <kafka-namespace>

Entry Criteria

Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy

Exit Criteria

Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
Kafka Message stream (if enabled) is unbroken

Details

Causes forced detach of specified disk serving as storage for the Kafka broker pod
Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
Tests unbroken message stream when KAFKA_LIVENESS_STREAM experiment environment variable is set to enabled

Integrations

Currently, the disk detach is supported only on GKE using LitmusLib, which internally uses the gcloud tools.

Steps to Execute the Chaos Experiment

This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.

Prepare chaosServiceAccount

Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.

Sample Rbac Manifest

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kafka-broker-disk-failure-sa
  namespace: default
  labels:
    name: kafka-broker-disk-failure-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kafka-broker-disk-failure-sa
  labels:
    name: kafka-broker-disk-failure-sa
    app.kubernetes.io/part-of: litmus
rules:
  - apiGroups: ["", "litmuschaos.io", "batch", "apps"]
    resources:
      [
        "pods",
        "jobs",
        "pods/log",
        "events",
        "pods/exec",
        "statefulsets",
        "secrets",
        "chaosengines",
        "chaosexperiments",
        "chaosresults",
      ]
    verbs: ["create", "list", "get", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kafka-broker-disk-failure-sa
  labels:
    name: kafka-broker-disk-failure-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kafka-broker-disk-failure-sa
subjects:
  - kind: ServiceAccount
    name: kafka-broker-disk-failure-sa
    namespace: default

Prepare ChaosEngine

Provide the application info in spec.appinfo
Provide the experiment tunables. While many tunables have default values specified in the ChaosExperiment CR, some need to be explicitly supplied in experiments.spec.components.env
To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts

Supported Experiment Tunables

Parameter	Description	Specify In ChaosEngine	Notes
KAFKA_NAMESPACE	Namespace of Kafka Brokers	Mandatory	May be same as value for `spec.appinfo.appns`
KAFKA_LABEL	Unique label of Kafka Brokers	Mandatory	May be same as value for `spec.appinfo.applabel`
KAFKA_SERVICE	Headless service of the Kafka Statefulset	Mandatory
KAFKA_PORT	Port of the Kafka ClusterIP service	Mandatory
ZOOKEEPER_NAMESPACE	Namespace of Zookeeper Cluster	Mandatory	May be same as value for KAFKA_NAMESPACE or other
ZOOKEEPER_LABEL	Unique label of Zokeeper statefulset	Mandatory
ZOOKEEPER_SERVICE	Headless service of the Zookeeper Statefulset	Mandatory
ZOOKEEPER_PORT	Port of the Zookeeper ClusterIP service	Mandatory
CLOUD_PLATFORM	Cloud platform type on which to inject disk loss	Mandatory	Supported platforms: GKE
PROJECT_ID	GCP Project ID in which the Cluster is created	Mandatory
DISK_NAME	GCloud Disk attached to the Cluster Node where specified broker is scheduled	Mandatory
ZONE_NAME	Zone in which the Disks/Cluster are created	Mandatory
KAFKA_BROKER	Kafka broker pod which is using the specified disk	Mandatory	Experiment verifies this by mapping node details
KAFKA_KIND	Kafka deployment type	Optional	Same as `spec.appinfo.appkind`. Supported: `statefulset`
KAFKA_LIVENESS_STREAM	Kafka liveness message stream	Optional	Supported: `enabled`, `disabled`
KAFKA_LIVENESS_IMAGE	Image used for liveness message stream	Optional	Image as `<registry_url>/<repository>/<image>:<tag>`
KAFKA_REPLICATION_FACTOR	Number of partition replicas for liveness topic partition	Optional	Necessary if KAFKA_LIVENESS_STREAM is `enabled`
KAFKA_INSTANCE_NAME	Name of the Kafka chroot path on zookeeper	Optional	Necessary if installation involves use of such path
KAFKA_CONSUMER_TIMEOUT	Kafka consumer message timeout, post which it terminates	Optional	Defaults to 30000ms
TOTAL_CHAOS_DURATION	The time duration for chaos insertion (seconds)	Optional	Defaults to 15s
INSTANCE_ID	A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name.	Optional	Ensure that the overall length of the chaosresult CR is still < 64 characters

Sample ChaosEngine Manifest

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: kafka-chaos
  namespace: default
spec:
  # It can be true/false
  annotationCheck: "true"
  # It can be active/stop
  engineState: "active"
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ""
  appinfo:
    appns: "default"
    applabel: "app=cp-kafka"
    appkind: "statefulset"
  chaosServiceAccount: kafka-broker-disk-failure-sa
  monitoring: false
  # It can be delete/retain
  jobCleanUpPolicy: "delete"
  experiments:
    - name: kafka-broker-disk-failure
      spec:
        components:
          env:
            # choose based on available kafka broker replicas
            - name: KAFKA_REPLICATION_FACTOR
              value: "3"

            # get via 'kubectl get pods --show-labels -n <kafka-namespace>'
            - name: KAFKA_LABEL
              value: "app=cp-kafka"

            - name: KAFKA_NAMESPACE
              value: "default"

            # get via 'kubectl get svc -n <kafka-namespace>'
            - name: KAFKA_SERVICE
              value: "kafka-cp-kafka-headless"

            # get via 'kubectl get svc -n <kafka-namespace>'
            - name: KAFKA_PORT
              value: "9092"

            # in milliseconds
            - name: KAFKA_CONSUMER_TIMEOUT
              value: "70000"

            # ensure to set the instance name if using KUDO operator
            - name: KAFKA_INSTANCE_NAME
              value: ""

            - name: ZOOKEEPER_NAMESPACE
              value: "default"

            # get via 'kubectl get pods --show-labels -n <zk-namespace>'
            - name: ZOOKEEPER_LABEL
              value: "app=cp-zookeeper"

            # get via 'kubectl get svc -n <zk-namespace>
            - name: ZOOKEEPER_SERVICE
              value: "kafka-cp-zookeeper-headless"

            # get via 'kubectl get svc -n <zk-namespace>
            - name: ZOOKEEPER_PORT
              value: "2181"

            # get from google cloud console or 'gcloud projects list'
            - name: PROJECT_ID
              value: "argon-tractor-237811"

            # attached to (in use by) node where 'kafka-0' is scheduled
            - name: DISK_NAME
              value: "disk-1"

            - name: ZONE_NAME
              value: "us-central1-a"

            # Uses 'disk-1' attached to the node on which it is scheduled
            - name: KAFKA_BROKER
              value: "kafka-0"

            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: "60"

Create the ChaosEngine Resource

Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.

kubectl apply -f chaosengine.yml
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.

Watch Chaos progress

View broker pod termination upon disk loss by setting up a watch on the pods in the Kafka namespace

watch -n 1 kubectl get pods -n <kafka-namespace>

Check Chaos Experiment Result

Check whether the kafka deployment is resilient to the broker disk failure, once the experiment (job) is completed. The ChaosResult resource name is derived like this: <ChaosEngine-Name>-<ChaosExperiment-Name>.

kubectl describe chaosresult kafka-chaos-kafka-broker-disk-failure -n <kafka-namespace>

Kafka Broker Recovery Post Experiment Execution

The experiment re-attaches the detached disk to the same node as part of recovery steps. However, if the disk is not provisioned as a Persistent Volume & instead provides the backing store to a PV carved out of it, the brokers may continue to stay in CrashLoopBackOff state (example: as hostPath directory for a Kubernetes Local PV)
The complete recovery steps involve:
- Remounting the disk into the desired mount point
- Deleting the affected broker pod to force reschedule

Kafka Broker Disk Failure Demo

TODO: A sample recording of this experiment execution is provided here.

13 KiB Raw Blame History

Experiment Metadata

Prerequisites

Entry Criteria

Exit Criteria

Details

Integrations

Steps to Execute the Chaos Experiment

Prepare chaosServiceAccount

Sample Rbac Manifest

Prepare ChaosEngine

Supported Experiment Tunables

Sample ChaosEngine Manifest

Create the ChaosEngine Resource

Watch Chaos progress

Check Chaos Experiment Result

Kafka Broker Recovery Post Experiment Execution

Kafka Broker Disk Failure Demo

13 KiB

Raw Blame History