13 KiB
| id | title | sidebar_label | original_id |
|---|---|---|---|
| kafka-broker-disk-failure | Kafka Broker Disk Failure Experiment Details | Broker Disk Failure | kafka-broker-disk-failure |
Experiment Metadata
| Type | Description | Kafka Distribution | Tested K8s Platform |
|---|---|---|---|
| Kafka | Fail kafka broker disk/storage | Confluent, Kudo-Kafka | GKE |
Prerequisites
-
Ensure that the Litmus Chaos Operator is running by executing
kubectl get podsin operator namespace (typically,litmus). If not, install from here -
Ensure that Kafka & Zookeeper are deployed as Statefulsets
-
If Confluent/Kudo Operators have been used to deploy Kafka, note the instance name, which will be used as the value of
KAFKA_INSTANCE_NAMEexperiment environment variable- In case of Confluent, specified by the
--nameflag - In case of Kudo, specified by the
--instanceflag
Zookeeper uses this to construct a path in which kafka cluster data is stored.
- In case of Confluent, specified by the
-
Ensure that the kafka-broker-disk failure experiment resource is available in the cluster by executing
kubectl get chaosexperimentsin the desired namespace. If not, install from here -
Create a secret with the gcloud serviceaccount key (placed in a file
cloud_config.yml) namedkafka-broker-disk-failurein the namespace where the experiment CRs are created. This is necessary to perform the disk-detach steps from the litmus experiment container.kubectl create secret generic kafka-broker-disk-failure --from-file=cloud_config.yml -n <kafka-namespace>
Entry Criteria
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
Exit Criteria
- Kafka Cluster (comprising the Kafka-broker & Zookeeper Statefulsets) is healthy
- Kafka Message stream (if enabled) is unbroken
Details
- Causes forced detach of specified disk serving as storage for the Kafka broker pod
- Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the Kafka cluster
- Tests unbroken message stream when
KAFKA_LIVENESS_STREAMexperiment environment variable is set toenabled
Integrations
- Currently, the disk detach is supported only on GKE using LitmusLib, which internally uses the gcloud tools.
Steps to Execute the Chaos Experiment
-
This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
-
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.
Prepare chaosServiceAccount
- Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.
Sample Rbac Manifest
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kafka-broker-disk-failure-sa
namespace: default
labels:
name: kafka-broker-disk-failure-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kafka-broker-disk-failure-sa
labels:
name: kafka-broker-disk-failure-sa
app.kubernetes.io/part-of: litmus
rules:
- apiGroups: ["", "litmuschaos.io", "batch", "apps"]
resources:
[
"pods",
"jobs",
"pods/log",
"events",
"pods/exec",
"statefulsets",
"secrets",
"chaosengines",
"chaosexperiments",
"chaosresults",
]
verbs: ["create", "list", "get", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kafka-broker-disk-failure-sa
labels:
name: kafka-broker-disk-failure-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kafka-broker-disk-failure-sa
subjects:
- kind: ServiceAccount
name: kafka-broker-disk-failure-sa
namespace: default
Prepare ChaosEngine
- Provide the application info in
spec.appinfo - Provide the experiment tunables. While many tunables have default values specified in the ChaosExperiment CR, some need to be explicitly supplied in
experiments.spec.components.env - To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts
Supported Experiment Tunables
| Parameter | Description | Specify In ChaosEngine | Notes |
|---|---|---|---|
| KAFKA_NAMESPACE | Namespace of Kafka Brokers | Mandatory | May be same as value for `spec.appinfo.appns` |
| KAFKA_LABEL | Unique label of Kafka Brokers | Mandatory | May be same as value for `spec.appinfo.applabel` |
| KAFKA_SERVICE | Headless service of the Kafka Statefulset | Mandatory | |
| KAFKA_PORT | Port of the Kafka ClusterIP service | Mandatory | |
| ZOOKEEPER_NAMESPACE | Namespace of Zookeeper Cluster | Mandatory | May be same as value for KAFKA_NAMESPACE or other |
| ZOOKEEPER_LABEL | Unique label of Zokeeper statefulset | Mandatory | |
| ZOOKEEPER_SERVICE | Headless service of the Zookeeper Statefulset | Mandatory | |
| ZOOKEEPER_PORT | Port of the Zookeeper ClusterIP service | Mandatory | |
| CLOUD_PLATFORM | Cloud platform type on which to inject disk loss | Mandatory | Supported platforms: GKE |
| PROJECT_ID | GCP Project ID in which the Cluster is created | Mandatory | |
| DISK_NAME | GCloud Disk attached to the Cluster Node where specified broker is scheduled | Mandatory | |
| ZONE_NAME | Zone in which the Disks/Cluster are created | Mandatory | |
| KAFKA_BROKER | Kafka broker pod which is using the specified disk | Mandatory | Experiment verifies this by mapping node details |
| KAFKA_KIND | Kafka deployment type | Optional | Same as `spec.appinfo.appkind`. Supported: `statefulset` |
| KAFKA_LIVENESS_STREAM | Kafka liveness message stream | Optional | Supported: `enabled`, `disabled` |
| KAFKA_LIVENESS_IMAGE | Image used for liveness message stream | Optional | Image as `<registry_url>/<repository>/<image>:<tag>` |
| KAFKA_REPLICATION_FACTOR | Number of partition replicas for liveness topic partition | Optional | Necessary if KAFKA_LIVENESS_STREAM is `enabled` |
| KAFKA_INSTANCE_NAME | Name of the Kafka chroot path on zookeeper | Optional | Necessary if installation involves use of such path |
| KAFKA_CONSUMER_TIMEOUT | Kafka consumer message timeout, post which it terminates | Optional | Defaults to 30000ms |
| TOTAL_CHAOS_DURATION | The time duration for chaos insertion (seconds) | Optional | Defaults to 15s |
| INSTANCE_ID | A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name. | Optional | Ensure that the overall length of the chaosresult CR is still < 64 characters |
Sample ChaosEngine Manifest
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: kafka-chaos
namespace: default
spec:
# It can be true/false
annotationCheck: "true"
# It can be active/stop
engineState: "active"
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ""
appinfo:
appns: "default"
applabel: "app=cp-kafka"
appkind: "statefulset"
chaosServiceAccount: kafka-broker-disk-failure-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: "delete"
experiments:
- name: kafka-broker-disk-failure
spec:
components:
env:
# choose based on available kafka broker replicas
- name: KAFKA_REPLICATION_FACTOR
value: "3"
# get via 'kubectl get pods --show-labels -n <kafka-namespace>'
- name: KAFKA_LABEL
value: "app=cp-kafka"
- name: KAFKA_NAMESPACE
value: "default"
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_SERVICE
value: "kafka-cp-kafka-headless"
# get via 'kubectl get svc -n <kafka-namespace>'
- name: KAFKA_PORT
value: "9092"
# in milliseconds
- name: KAFKA_CONSUMER_TIMEOUT
value: "70000"
# ensure to set the instance name if using KUDO operator
- name: KAFKA_INSTANCE_NAME
value: ""
- name: ZOOKEEPER_NAMESPACE
value: "default"
# get via 'kubectl get pods --show-labels -n <zk-namespace>'
- name: ZOOKEEPER_LABEL
value: "app=cp-zookeeper"
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_SERVICE
value: "kafka-cp-zookeeper-headless"
# get via 'kubectl get svc -n <zk-namespace>
- name: ZOOKEEPER_PORT
value: "2181"
# get from google cloud console or 'gcloud projects list'
- name: PROJECT_ID
value: "argon-tractor-237811"
# attached to (in use by) node where 'kafka-0' is scheduled
- name: DISK_NAME
value: "disk-1"
- name: ZONE_NAME
value: "us-central1-a"
# Uses 'disk-1' attached to the node on which it is scheduled
- name: KAFKA_BROKER
value: "kafka-0"
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: "60"
Create the ChaosEngine Resource
-
Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.
kubectl apply -f chaosengine.yml -
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.
Watch Chaos progress
-
View broker pod termination upon disk loss by setting up a watch on the pods in the Kafka namespace
watch -n 1 kubectl get pods -n <kafka-namespace>
Check Chaos Experiment Result
-
Check whether the kafka deployment is resilient to the broker disk failure, once the experiment (job) is completed. The ChaosResult resource name is derived like this:
<ChaosEngine-Name>-<ChaosExperiment-Name>.kubectl describe chaosresult kafka-chaos-kafka-broker-disk-failure -n <kafka-namespace>
Kafka Broker Recovery Post Experiment Execution
-
The experiment re-attaches the detached disk to the same node as part of recovery steps. However, if the disk is not provisioned as a Persistent Volume & instead provides the backing store to a PV carved out of it, the brokers may continue to stay in
CrashLoopBackOffstate (example: as hostPath directory for a Kubernetes Local PV) -
The complete recovery steps involve:
- Remounting the disk into the desired mount point
- Deleting the affected broker pod to force reschedule
Kafka Broker Disk Failure Demo
- TODO: A sample recording of this experiment execution is provided here.