Add document for node problem detector.

2016-06-20 16:24:35 -07:00 · 2016-06-20 16:24:35 -07:00 · c65d7400fd
parent ce6bf3d9c2
commit c65d7400fd
2 changed files with 247 additions and 0 deletions
--- a/_data/guides.yml
+++ b/_data/guides.yml
@ -265,3 +265,5 @@ toc:
    path: /docs/admin/garbage-collection/
  - title: Configuring Kubernetes with Salt
    path: /docs/admin/salt/
+  - title: Monitoring Node Health
+    path: /docs/admin/node-problem/
--- a/docs/admin/node-problem.md
+++ b/docs/admin/node-problem.md
@ -0,0 +1,245 @@
+---
+---
+
+* TOC
+{:toc}
+
+## Node Problem Detector
+
+*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
+node health. It collects node problems from various daemons and reports them
+to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
+[Event](/docs/api-reference/v1/definitions/#_v1_event).
+
+It supports some known kernel issue detection now, and will detect more and
+more node problems over time.
+
+Currently Kubernetes won't take any action on the node conditions and events
+generated by node problem detector. In the future, a remedy system could be
+introduced to deal with node problems.
+
+See more information
+[here](https://github.com/kubernetes/node-problem-detector).
+
+## Limitations
+
+* The kernel issue detection of node problem detector only supports file based
+kernel log now. It doesn't support log tools like journald.
+
+* The kernel issue detection of node problem detector has assumption on kernel
+log format, now it only works on Ubuntu and Debian. However, it is easy to extend
+it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
+
+## Enable/Disable in GCE cluster
+
+Node problem detector is running as a cluster
+[addon](docs/admin/cluster-large/#addon-resources) enabled by default in the
+gce cluster.
+
+You can enable/disable it by setting the environment variable
+`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
+
+## Use in Other Environment
+
+To enable node problem detector in other environment outside of GCE, you can use
+either `kubectl` or addon pod.
+
+### Kubectl
+
+This is the recommanded way to start node problem detector outside of GCE. It
+provides more flexible management, such as overwriting the default
+configuration to fit it into your environment or detect
+customized node problems.
+
+* **Step 1:** Create `node-problem-detector.yaml`:
+
+```yaml
+apiVersion: extensions/v1beta1
+kind: DaemonSet
+metadata:
+  name: node-problem-detector-v0.1
+  namespace: kube-system
+  labels:
+    k8s-app: node-problem-detector
+    version: v0.1
+    kubernetes.io/cluster-service: "true"
+spec:
+  template:
+    metadata:
+      labels:
+        k8s-app: node-problem-detector
+        version: v0.1
+        kubernetes.io/cluster-service: "true"
+    spec:
+      hostNetwork: true
+      containers:
+      - name: node-problem-detector
+        image: gcr.io/google_containers/node-problem-detector:v0.1
+        securityContext:
+          privileged: true
+        resources:
+          limits:
+            cpu: "200m"
+            memory: "100Mi"
+          requests:
+            cpu: "20m"
+            memory: "20Mi"
+        volumeMounts:
+        - name: log
+          mountPath: /log
+          readOnly: true
+      volumes:
+      - name: log
+        hostPath:
+          path: /var/log/
+```
+
+***Notice that you should make sure the system log directory is right for your
+OS distro.***
+
+* **Step 2:** Start node problem detector with `kubectl`:
+
+```shell
+kubectl create -f node-problem-detector.yaml
+```
+
+### Addon Pod
+
+This is for those who have their own cluster bootstrap solution, and don't need
+to overwrite the default configuration. They could leverage the addon pod to
+further automate the deployment.
+
+Just create `node-problem-detector.yaml`, and put it under the addon pods directory
+`/etc/kubernetes/addons/node-problem-detector` on master node.
+
+## Overwrite the Configuration
+
+The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
+is embedded when building the docker image of node problem detector.
+
+However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
+following the steps:
+
+* **Step 1:** Change the config files in `config/`.
+* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl  create configmap
+node-problem-detector-config --from-file=config/`.
+* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
+
+```yaml
+apiVersion: extensions/v1beta1
+kind: DaemonSet
+metadata:
+  name: node-problem-detector-v0.1
+  namespace: kube-system
+  labels:
+    k8s-app: node-problem-detector
+    version: v0.1
+    kubernetes.io/cluster-service: "true"
+spec:
+  template:
+    metadata:
+      labels:
+        k8s-app: node-problem-detector
+        version: v0.1
+        kubernetes.io/cluster-service: "true"
+    spec:
+      hostNetwork: true
+      containers:
+      - name: node-problem-detector
+        image: gcr.io/google_containers/node-problem-detector:v0.1
+        securityContext:
+          privileged: true
+        resources:
+          limits:
+            cpu: "200m"
+            memory: "100Mi"
+          requests:
+            cpu: "20m"
+            memory: "20Mi"
+        volumeMounts:
+        - name: log
+          mountPath: /log
+          readOnly: true
+        - name: config # Overwrite the config/ directory with ConfigMap volume
+          mountPath: /config
+          readOnly: true
+      volumes:
+      - name: log
+        hostPath:
+          path: /var/log/
+      - name: config # Define ConfigMap volume
+        configMap:
+          name: node-problem-detector-config
+```
+
+* **Step 4:** Re-create the node problem detector with the new yaml file:
+
+```shell
+kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
+kubectl create -f node-problem-detector.yaml
+```
+
+***Notice that this approach only applies to node problem detector started with `kubectl`.***
+
+For node problem detector running as cluster addon, because addon manager doesn't support
+ConfigMap, configuration overwriting is not supported now.
+
+## Kernel Monitor
+
+*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
+and detects known kernel issues following predefined rules.
+
+The Kernel Monitor matches kernel issues according to a set of predefined rule list in
+[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
+The rule list is extensible, you can always extend it by [overwriting the
+configuration](/docs/admin/node-problem/#overwrite-the-configuration).
+
+### Add New NodeConditions
+
+To support new node conditions, you can extend the `conditions` field in
+`config/kernel-monitor.json` with new condition definition:
+
+```json
+{
+  "type": "NodeConditionType",
+  "reason": "CamelCaseDefaultNodeConditionReason",
+  "message": "arbitrary default node condition message"
+}
+```
+
+### Detect New Problems
+
+To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
+with new rule definition:
+
+```json
+{
+  "type": "temporary/permanent",
+  "condition": "NodeConditionOfPermanentIssue",
+  "reason": "CamelCaseShortReason",
+  "message": "regexp matching the issue in the kernel log"
+}
+```
+
+### Change Log Path
+
+Kernel log in different OS distros may locate in different path. The `log`
+field in `config/kernel-monitor.json` is the log path inside the container.
+You can always configure it to match your OS distro.
+
+### Support Other Log Format
+
+Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
+plugin to translate kernel log the internal data structure. It is easy to
+implement a new translator for a new log format.
+
+## Caveats
+
+It is recommanded to run the node problem detector in your cluster to monitor
+the node health. However, you should be aware that this will introduce extra
+resource overhead on each node. Usually this is fine, because:
+
+* The kernel log is generated relatively slowly.
+* Resource limit is set for node problem detector.
+* Even under high load, the resource usage is acceptable.
+(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))