Add document for node problem detector.
This commit is contained in:
parent
ce6bf3d9c2
commit
c65d7400fd
|
@ -265,3 +265,5 @@ toc:
|
|||
path: /docs/admin/garbage-collection/
|
||||
- title: Configuring Kubernetes with Salt
|
||||
path: /docs/admin/salt/
|
||||
- title: Monitoring Node Health
|
||||
path: /docs/admin/node-problem/
|
||||
|
|
|
@ -0,0 +1,245 @@
|
|||
---
|
||||
---
|
||||
|
||||
* TOC
|
||||
{:toc}
|
||||
|
||||
## Node Problem Detector
|
||||
|
||||
*Node problem detector* is a [DaemonSet](/docs/admin/daemons/) monitoring the
|
||||
node health. It collects node problems from various daemons and reports them
|
||||
to the apiserver as [NodeCondition](/docs/admin/node/#node-condition) and
|
||||
[Event](/docs/api-reference/v1/definitions/#_v1_event).
|
||||
|
||||
It supports some known kernel issue detection now, and will detect more and
|
||||
more node problems over time.
|
||||
|
||||
Currently Kubernetes won't take any action on the node conditions and events
|
||||
generated by node problem detector. In the future, a remedy system could be
|
||||
introduced to deal with node problems.
|
||||
|
||||
See more information
|
||||
[here](https://github.com/kubernetes/node-problem-detector).
|
||||
|
||||
## Limitations
|
||||
|
||||
* The kernel issue detection of node problem detector only supports file based
|
||||
kernel log now. It doesn't support log tools like journald.
|
||||
|
||||
* The kernel issue detection of node problem detector has assumption on kernel
|
||||
log format, now it only works on Ubuntu and Debian. However, it is easy to extend
|
||||
it to [support other log format](/docs/admin/node-problem/#support-other-log-format).
|
||||
|
||||
## Enable/Disable in GCE cluster
|
||||
|
||||
Node problem detector is running as a cluster
|
||||
[addon](docs/admin/cluster-large/#addon-resources) enabled by default in the
|
||||
gce cluster.
|
||||
|
||||
You can enable/disable it by setting the environment variable
|
||||
`KUBE_ENABLE_NODE_PROBLEM_DETECTOR` before `kube-up.sh`.
|
||||
|
||||
## Use in Other Environment
|
||||
|
||||
To enable node problem detector in other environment outside of GCE, you can use
|
||||
either `kubectl` or addon pod.
|
||||
|
||||
### Kubectl
|
||||
|
||||
This is the recommanded way to start node problem detector outside of GCE. It
|
||||
provides more flexible management, such as overwriting the default
|
||||
configuration to fit it into your environment or detect
|
||||
customized node problems.
|
||||
|
||||
* **Step 1:** Create `node-problem-detector.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-problem-detector-v0.1
|
||||
namespace: kube-system
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: node-problem-detector
|
||||
image: gcr.io/google_containers/node-problem-detector:v0.1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
cpu: "200m"
|
||||
memory: "100Mi"
|
||||
requests:
|
||||
cpu: "20m"
|
||||
memory: "20Mi"
|
||||
volumeMounts:
|
||||
- name: log
|
||||
mountPath: /log
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/
|
||||
```
|
||||
|
||||
***Notice that you should make sure the system log directory is right for your
|
||||
OS distro.***
|
||||
|
||||
* **Step 2:** Start node problem detector with `kubectl`:
|
||||
|
||||
```shell
|
||||
kubectl create -f node-problem-detector.yaml
|
||||
```
|
||||
|
||||
### Addon Pod
|
||||
|
||||
This is for those who have their own cluster bootstrap solution, and don't need
|
||||
to overwrite the default configuration. They could leverage the addon pod to
|
||||
further automate the deployment.
|
||||
|
||||
Just create `node-problem-detector.yaml`, and put it under the addon pods directory
|
||||
`/etc/kubernetes/addons/node-problem-detector` on master node.
|
||||
|
||||
## Overwrite the Configuration
|
||||
|
||||
The [default configuration](https://github.com/kubernetes/node-problem-detector/tree/v0.1/config)
|
||||
is embedded when building the docker image of node problem detector.
|
||||
|
||||
However, you can use [ConfigMap](/docs/user-guide/configmap/) to overwrite it
|
||||
following the steps:
|
||||
|
||||
* **Step 1:** Change the config files in `config/`.
|
||||
* **Step 2:** Create the ConfigMap `node-problem-detector-config` with `kubectl create configmap
|
||||
node-problem-detector-config --from-file=config/`.
|
||||
* **Step 3:** Change the `node-problem-detector.yaml` to use the ConfigMap:
|
||||
|
||||
```yaml
|
||||
apiVersion: extensions/v1beta1
|
||||
kind: DaemonSet
|
||||
metadata:
|
||||
name: node-problem-detector-v0.1
|
||||
namespace: kube-system
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
k8s-app: node-problem-detector
|
||||
version: v0.1
|
||||
kubernetes.io/cluster-service: "true"
|
||||
spec:
|
||||
hostNetwork: true
|
||||
containers:
|
||||
- name: node-problem-detector
|
||||
image: gcr.io/google_containers/node-problem-detector:v0.1
|
||||
securityContext:
|
||||
privileged: true
|
||||
resources:
|
||||
limits:
|
||||
cpu: "200m"
|
||||
memory: "100Mi"
|
||||
requests:
|
||||
cpu: "20m"
|
||||
memory: "20Mi"
|
||||
volumeMounts:
|
||||
- name: log
|
||||
mountPath: /log
|
||||
readOnly: true
|
||||
- name: config # Overwrite the config/ directory with ConfigMap volume
|
||||
mountPath: /config
|
||||
readOnly: true
|
||||
volumes:
|
||||
- name: log
|
||||
hostPath:
|
||||
path: /var/log/
|
||||
- name: config # Define ConfigMap volume
|
||||
configMap:
|
||||
name: node-problem-detector-config
|
||||
```
|
||||
|
||||
* **Step 4:** Re-create the node problem detector with the new yaml file:
|
||||
|
||||
```shell
|
||||
kubectl delete -f node-problem-detector.yaml # If you have a node-problem-detector running
|
||||
kubectl create -f node-problem-detector.yaml
|
||||
```
|
||||
|
||||
***Notice that this approach only applies to node problem detector started with `kubectl`.***
|
||||
|
||||
For node problem detector running as cluster addon, because addon manager doesn't support
|
||||
ConfigMap, configuration overwriting is not supported now.
|
||||
|
||||
## Kernel Monitor
|
||||
|
||||
*Kernel Monitor* is a problem daemon in node problem detector. It monitors kernel log
|
||||
and detects known kernel issues following predefined rules.
|
||||
|
||||
The Kernel Monitor matches kernel issues according to a set of predefined rule list in
|
||||
[`config/kernel-monitor.json`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/config/kernel-monitor.json).
|
||||
The rule list is extensible, you can always extend it by [overwriting the
|
||||
configuration](/docs/admin/node-problem/#overwrite-the-configuration).
|
||||
|
||||
### Add New NodeConditions
|
||||
|
||||
To support new node conditions, you can extend the `conditions` field in
|
||||
`config/kernel-monitor.json` with new condition definition:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "NodeConditionType",
|
||||
"reason": "CamelCaseDefaultNodeConditionReason",
|
||||
"message": "arbitrary default node condition message"
|
||||
}
|
||||
```
|
||||
|
||||
### Detect New Problems
|
||||
|
||||
To detect new problems, you can extend the `rules` field in `config/kernel-monitor.json`
|
||||
with new rule definition:
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "temporary/permanent",
|
||||
"condition": "NodeConditionOfPermanentIssue",
|
||||
"reason": "CamelCaseShortReason",
|
||||
"message": "regexp matching the issue in the kernel log"
|
||||
}
|
||||
```
|
||||
|
||||
### Change Log Path
|
||||
|
||||
Kernel log in different OS distros may locate in different path. The `log`
|
||||
field in `config/kernel-monitor.json` is the log path inside the container.
|
||||
You can always configure it to match your OS distro.
|
||||
|
||||
### Support Other Log Format
|
||||
|
||||
Kernel monitor uses [`Translator`](https://github.com/kubernetes/node-problem-detector/blob/v0.1/pkg/kernelmonitor/translator/translator.go)
|
||||
plugin to translate kernel log the internal data structure. It is easy to
|
||||
implement a new translator for a new log format.
|
||||
|
||||
## Caveats
|
||||
|
||||
It is recommanded to run the node problem detector in your cluster to monitor
|
||||
the node health. However, you should be aware that this will introduce extra
|
||||
resource overhead on each node. Usually this is fine, because:
|
||||
|
||||
* The kernel log is generated relatively slowly.
|
||||
* Resource limit is set for node problem detector.
|
||||
* Even under high load, the resource usage is acceptable.
|
||||
(see [benchmark result](https://github.com/kubernetes/node-problem-detector/issues/2#issuecomment-220255629))
|
Loading…
Reference in New Issue