8.2 KiB
| id | title | sidebar_label | original_id |
|---|---|---|---|
| node-restart | Node Restart Experiment Details | Node Restart | node-restart |
Experiment Metadata
| Type | Description | Tested K8s Platform |
|---|---|---|
| Generic | Restart the target node | Kubevirt VMs |
Prerequisites
- Ensure that the Litmus Chaos Operator is running by executing
kubectl get podsin operator namespace (typically,litmus). If not, install from here - Ensure that the
node-restartexperiment resource is available in the cluster by executingkubectl get chaosexperimentsin the desired namespace. If not, install from here - Create a Kubernetes secret having the private SSH key for
SSH_USERused to connect toTARGET_NODE. The name of secret should beid-rsaalong with private SSH key data, namedssh-privatekey. A sample secret example is given below:
apiVersion: v1
kind: Secret
metadata:
name: id-rsa
type: Opaque
stringData:
ssh-privatekey: |-
# Add the private key for ssh here
Entry-Criteria
- Application pods should be healthy before chaos injection.
- Target Nodes should be in Ready state before chaos injection.
Exit-Criteria
- Application pods should be healthy after chaos injection.
- Target Nodes should be in Ready state after chaos injection.
Details
- Causes chaos to disrupt state of node by restarting it.
- Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the application pod
Integrations
- Node Restart can be effected using the chaos library:
litmus.
Steps to Execute the Chaos Experiment
-
This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started
-
Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.
Prepare chaosServiceAccount
- Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.
Sample Rbac Manifest
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-restart-sa
namespace: default
labels:
name: node-restart-sa
app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: node-restart-sa
labels:
name: node-restart-sa
app.kubernetes.io/part-of: litmus
rules:
- apiGroups: ["", "litmuschaos.io", "batch", "apps"]
resources:
[
"pods",
"jobs",
"secrets",
"events",
"chaosengines",
"pods/log",
"pods/exec",
"chaosexperiments",
"chaosresults",
]
verbs:
["create", "list", "get", "patch", "update", "delete", "deletecollection"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: node-restart-sa
labels:
name: node-restart-sa
app.kubernetes.io/part-of: litmus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: node-restart-sa
subjects:
- kind: ServiceAccount
name: node-restart-sa
namespace: default
Note: In case of restricted systems/setup, create a PodSecurityPolicy(psp) with the required permissions. The chaosServiceAccount can subscribe to work around the respective limitations. An example of a standard psp that can be used for litmus chaos experiments can be found here.
Prepare ChaosEngine
- Provide the application info in
spec.appinfo - Provide the auxiliary applications info (ns & labels) in
spec.auxiliaryAppInfo - Override the experiment tunables if desired in
experiments.spec.components.env - To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts
Supported Experiment Tunables
| Variables | Description | Specify In ChaosEngine | Notes |
|---|---|---|---|
| LIB_IMAGE | The image used to restart the node | Optional | Defaults to `litmuschaos/go-runner:1.11.0` |
| SSH_USER | name of ssh user | Mandatory | Defaults to `root` |
| TARGET_NODE | name of target node, subjected to chaos | Mandatory | |
| TARGET_NODE_IP | ip of the target node, subjected to chaos | Mandatory | |
| REBOOT_COMMAND | Command used for reboot | Mandatory | Defaults to `sudo systemctl reboot` |
| TOTAL_CHAOS_DURATION | The time duration for chaos insertion (sec) | Optional | Defaults to 30s |
| RAMP_TIME | Period to wait before injection of chaos in sec | Optional | |
| LIB | The chaos lib used to inject the chaos | Optional | Defaults to `litmus` supported litmus only |
| LIB_IMAGE | The image used to restart the node | Optional | Defaults to `litmuschaos/go-runner:1.11.0` |
| INSTANCE_ID | A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name. | Optional | Ensure that the overall length of the chaosresult CR is still < 64 characters |
Sample ChaosEngine Manifest
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
# It can be true/false
annotationCheck: "false"
# It can be active/stop
engineState: "active"
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ""
appinfo:
appns: "default"
applabel: "app=nginx"
appkind: "deployment"
chaosServiceAccount: node-restart-sa
monitoring: false
# It can be delete/retain
jobCleanUpPolicy: "delete"
experiments:
- name: node-restart
spec:
components:
nodeSelector:
# provide the node labels
kubernetes.io/hostname: "node02"
env:
# ENTER THE TARGET NODE NAME
- name: TARGET_NODE
value: "node01"
# ENTER THE TARGET NODE IP
- name: TARGET_NODE_IP
value: ""
# ENTER THE USER TO BE USED FOR SSH AUTH
- name: SSH_USER
value: ""
Create the ChaosEngine Resource
-
Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.
kubectl apply -f chaosengine.yml -
If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.
Watch Chaos progress
-
View the status of the nodes as they are subjected to node restart.
watch -n 1 kubectl get nodes
Check Chaos Experiment Result
-
Check whether the application is resilient to the node restart, once the experiment (job) is completed. The ChaosResult resource name is derived like this:
<ChaosEngine-Name>-<ChaosExperiment-Name>.kubectl describe chaosresult nginx-chaos-node-restart -n <application-namespace>
Node Restart Experiment Demo
- A sample recording of this experiment execution will be added soon.