diff --git a/_data/guides.yml b/_data/guides.yml index f5654a6efd..91ce1bef50 100644 --- a/_data/guides.yml +++ b/_data/guides.yml @@ -265,3 +265,5 @@ toc: path: /docs/admin/garbage-collection/ - title: Configuring Kubernetes with Salt path: /docs/admin/salt/ + - title: Monitoring Node Health + path: /docs/admin/node-problem/ diff --git a/docs/admin/node-problem.md b/docs/admin/node-problem.md new file mode 100644 index 0000000000..5dc2d4cb52 --- /dev/null +++ b/docs/admin/node-problem.md @@ -0,0 +1,245 @@ +--- +--- + +* TOC +{:toc} + +## Node Problem Detector + +*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the +node health. It collects node problems from various daemons and reports them +to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and +[Event](/docs/api-reference/v1/definitions/#_v1_event). + +It supports some known kernel issue detection now, and will detect more and +more node problems over time. + +Currently Kubernetes won't take any action on the node conditions and events +generated by node problem detector. In the future, a remedy system could be +introduced to deal with node problems. + +See more information +[here](https://github.com/kubernetes/node-problem-detector). + +## Limitations + +* The kernel issue detection of node problem detector only supports file based +kernel log now. It doesn't support log tools like journald. + +* The kernel issue detection of node problem detector has assumption on kernel +log format, now it only works on Ubuntu and Debian. However, it is easy to extend +it to [support other log format](/docs/admin/node-problem/#support-other-log-format). + +## Enable/Disable in GCE cluster + +Node problem detector is running as a cluster +[addon](docs/admin/cluster-large/#addon-resources) enabled by default in the +gce cluster. + +You can enable/disable it by setting the environment variable +`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`. + +## Use in Other Environment + +To enable node problem detector in other environment outside of GCE, you can use +either `kubectl` or addon pod. + +### Kubectl + +This is the recommanded way to start node problem detector outside of GCE. It +provides more flexible management, such as overwriting the default +configuration to fit it into your environment or detect +customized node problems. + +* **Step 1:** Create `node-problem-detector.yaml`: + +```yaml +apiVersion: extensions/v1beta1 +kind: DaemonSet +metadata: + name: node-problem-detector-v0.1 + namespace: kube-system + labels: + k8s-app: node-problem-detector + version: v0.1 + kubernetes.io/cluster-service: "true" +spec: + template: + metadata: + labels: + k8s-app: node-problem-detector + version: v0.1 + kubernetes.io/cluster-service: "true" + spec: + hostNetwork: true + containers: + - name: node-problem-detector + image: gcr.io/google_containers/node-problem-detector:v0.1 + securityContext: + privileged: true + resources: + limits: + cpu: "200m" + memory: "100Mi" + requests: + cpu: "20m" + memory: "20Mi" + volumeMounts: + - name: log + mountPath: /log + readOnly: true + volumes: + - name: log + hostPath: + path: /var/log/ +``` + +***Notice that you should make sure the system log directory is right for your +OS distro.*** + +* **Step 2:** Start node problem detector with `kubectl`: + +```shell +kubectl create -f node-problem-detector.yaml +``` + +### Addon Pod + +This is for those who have their own cluster bootstrap solution, and don't need +to overwrite the default configuration. They could leverage the addon pod to +further automate the deployment. + +Just create `node-problem-detector.yaml`, and put it under the addon pods directory +`/etc/kubernetes/addons/node-problem-detector` on master node. + +## Overwrite the Configuration + +The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config) +is embedded when building the docker image of node problem detector. + +However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it +following the steps: + +* **Step 1:** Change the config files in `config/`. +* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap +node-problem-detector-config --from-file=config/`. +* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap: + +```yaml +apiVersion: extensions/v1beta1 +kind: DaemonSet +metadata: + name: node-problem-detector-v0.1 + namespace: kube-system + labels: + k8s-app: node-problem-detector + version: v0.1 + kubernetes.io/cluster-service: "true" +spec: + template: + metadata: + labels: + k8s-app: node-problem-detector + version: v0.1 + kubernetes.io/cluster-service: "true" + spec: + hostNetwork: true + containers: + - name: node-problem-detector + image: gcr.io/google_containers/node-problem-detector:v0.1 + securityContext: + privileged: true + resources: + limits: + cpu: "200m" + memory: "100Mi" + requests: + cpu: "20m" + memory: "20Mi" + volumeMounts: + - name: log + mountPath: /log + readOnly: true + - name: config # Overwrite the config/ directory with ConfigMap volume + mountPath: /config + readOnly: true + volumes: + - name: log + hostPath: + path: /var/log/ + - name: config # Define ConfigMap volume + configMap: + name: node-problem-detector-config +``` + +* **Step 4:** Re-create the node problem detector with the new yaml file: + +```shell +kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running +kubectl create -f node-problem-detector.yaml +``` + +***Notice that this approach only applies to node problem detector started with `kubectl`.*** + +For node problem detector running as cluster addon, because addon manager doesn't support +ConfigMap, configuration overwriting is not supported now. + +## Kernel Monitor + +*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log +and detects known kernel issues following predefined rules. + +The Kernel Monitor matches kernel issues according to a set of predefined rule list in +[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json). +The rule list is extensible, you can always extend it by [overwriting the +configuration](/docs/admin/node-problem/#overwrite-the-configuration). + +### Add New NodeConditions + +To support new node conditions, you can extend the `conditions` field in +`config/kernel-monitor.json` with new condition definition: + +```json +{ + "type": "NodeConditionType", + "reason": "CamelCaseDefaultNodeConditionReason", + "message": "arbitrary default node condition message" +} +``` + +### Detect New Problems + +To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json` +with new rule definition: + +```json +{ + "type": "temporary/permanent", + "condition": "NodeConditionOfPermanentIssue", + "reason": "CamelCaseShortReason", + "message": "regexp matching the issue in the kernel log" +} +``` + +### Change Log Path + +Kernel log in different OS distros may locate in different path. The `log` +field in `config/kernel-monitor.json` is the log path inside the container. +You can always configure it to match your OS distro. + +### Support Other Log Format + +Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go) +plugin to translate kernel log the internal data structure. It is easy to +implement a new translator for a new log format. + +## Caveats + +It is recommanded to run the node problem detector in your cluster to monitor +the node health. However, you should be aware that this will introduce extra +resource overhead on each node. Usually this is fine, because: + +* The kernel log is generated relatively slowly. +* Resource limit is set for node problem detector. +* Even under high load, the resource usage is acceptable. +(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))