litmus-docs/website/versioned_docs/version-1.11.0/node-restart.md

8.2 KiB

id title sidebar_label original_id
node-restart Node Restart Experiment Details Node Restart node-restart

Experiment Metadata

Type Description Tested K8s Platform
Generic Restart the target node Kubevirt VMs

Prerequisites

  • Ensure that the Litmus Chaos Operator is running by executing kubectl get pods in operator namespace (typically, litmus). If not, install from here
  • Ensure that the node-restart experiment resource is available in the cluster by executing kubectl get chaosexperiments in the desired namespace. If not, install from here
  • Create a Kubernetes secret having the private SSH key for SSH_USER used to connect to TARGET_NODE. The name of secret should be id-rsa along with private SSH key data, named ssh-privatekey. A sample secret example is given below:
apiVersion: v1
kind: Secret
metadata:
  name: id-rsa
type: Opaque
stringData:
  ssh-privatekey: |-
    # Add the private key for ssh here    

Entry-Criteria

  • Application pods should be healthy before chaos injection.
  • Target Nodes should be in Ready state before chaos injection.

Exit-Criteria

  • Application pods should be healthy after chaos injection.
  • Target Nodes should be in Ready state after chaos injection.

Details

  • Causes chaos to disrupt state of node by restarting it.
  • Tests deployment sanity (replica availability & uninterrupted service) and recovery workflows of the application pod

Integrations

  • Node Restart can be effected using the chaos library: litmus.

Steps to Execute the Chaos Experiment

  • This Chaos Experiment can be triggered by creating a ChaosEngine resource on the cluster. To understand the values to provide in a ChaosEngine specification, refer Getting Started

  • Follow the steps in the sections below to create the chaosServiceAccount, prepare the ChaosEngine & execute the experiment.

Prepare chaosServiceAccount

  • Use this sample RBAC manifest to create a chaosServiceAccount in the desired (app) namespace. This example consists of the minimum necessary role permissions to execute the experiment.

Sample Rbac Manifest

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-restart-sa
  namespace: default
  labels:
    name: node-restart-sa
    app.kubernetes.io/part-of: litmus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-restart-sa
  labels:
    name: node-restart-sa
    app.kubernetes.io/part-of: litmus
rules:
  - apiGroups: ["", "litmuschaos.io", "batch", "apps"]
    resources:
      [
        "pods",
        "jobs",
        "secrets",
        "events",
        "chaosengines",
        "pods/log",
        "pods/exec",
        "chaosexperiments",
        "chaosresults",
      ]
    verbs:
      ["create", "list", "get", "patch", "update", "delete", "deletecollection"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: node-restart-sa
  labels:
    name: node-restart-sa
    app.kubernetes.io/part-of: litmus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: node-restart-sa
subjects:
  - kind: ServiceAccount
    name: node-restart-sa
    namespace: default

Note: In case of restricted systems/setup, create a PodSecurityPolicy(psp) with the required permissions. The chaosServiceAccount can subscribe to work around the respective limitations. An example of a standard psp that can be used for litmus chaos experiments can be found here.

Prepare ChaosEngine

  • Provide the application info in spec.appinfo
  • Provide the auxiliary applications info (ns & labels) in spec.auxiliaryAppInfo
  • Override the experiment tunables if desired in experiments.spec.components.env
  • To understand the values to provided in a ChaosEngine specification, refer ChaosEngine Concepts

Supported Experiment Tunables

Variables Description Specify In ChaosEngine Notes
LIB_IMAGE The image used to restart the node Optional Defaults to `litmuschaos/go-runner:1.11.0`
SSH_USER name of ssh user Mandatory Defaults to `root`
TARGET_NODE name of target node, subjected to chaos Mandatory
TARGET_NODE_IP ip of the target node, subjected to chaos Mandatory
REBOOT_COMMAND Command used for reboot Mandatory Defaults to `sudo systemctl reboot`
TOTAL_CHAOS_DURATION The time duration for chaos insertion (sec) Optional Defaults to 30s
RAMP_TIME Period to wait before injection of chaos in sec Optional
LIB The chaos lib used to inject the chaos Optional Defaults to `litmus` supported litmus only
LIB_IMAGE The image used to restart the node Optional Defaults to `litmuschaos/go-runner:1.11.0`
INSTANCE_ID A user-defined string that holds metadata/info about current run/instance of chaos. Ex: 04-05-2020-9-00. This string is appended as suffix in the chaosresult CR name. Optional Ensure that the overall length of the chaosresult CR is still < 64 characters

Sample ChaosEngine Manifest

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  # It can be true/false
  annotationCheck: "false"
  # It can be active/stop
  engineState: "active"
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ""
  appinfo:
    appns: "default"
    applabel: "app=nginx"
    appkind: "deployment"
  chaosServiceAccount: node-restart-sa
  monitoring: false
  # It can be delete/retain
  jobCleanUpPolicy: "delete"
  experiments:
    - name: node-restart
      spec:
        components:
          nodeSelector:
            # provide the node labels
            kubernetes.io/hostname: "node02"
          env:
            # ENTER THE TARGET NODE NAME
            - name: TARGET_NODE
              value: "node01"

            # ENTER THE TARGET NODE IP
            - name: TARGET_NODE_IP
              value: ""

              # ENTER THE USER TO BE USED FOR SSH AUTH
            - name: SSH_USER
              value: ""

Create the ChaosEngine Resource

  • Create the ChaosEngine manifest prepared in the previous step to trigger the Chaos.

    kubectl apply -f chaosengine.yml

  • If the chaos experiment is not executed, refer to the troubleshooting section to identify the root cause and fix the issues.

Watch Chaos progress

  • View the status of the nodes as they are subjected to node restart.

    watch -n 1 kubectl get nodes

Check Chaos Experiment Result

  • Check whether the application is resilient to the node restart, once the experiment (job) is completed. The ChaosResult resource name is derived like this: <ChaosEngine-Name>-<ChaosExperiment-Name>.

    kubectl describe chaosresult nginx-chaos-node-restart -n <application-namespace>

Node Restart Experiment Demo

  • A sample recording of this experiment execution will be added soon.