243 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			243 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| ---
 | |
| assignees:
 | |
| - caesarxuchao
 | |
| - dchen1107
 | |
| - lavalamp
 | |
| 
 | |
| ---
 | |
| 
 | |
| * TOC
 | |
| {:toc}
 | |
| 
 | |
| ## What is a node?
 | |
| 
 | |
| A `node` is a worker machine in Kubernetes, previously known as a `minion`. A node
 | |
| may be a VM or physical machine, depending on the cluster. Each node has
 | |
| the services necessary to run [pods](/docs/user-guide/pods) and is managed by the master
 | |
| components. The services on a node include Docker, kubelet and kube-proxy. See
 | |
| [The Kubernetes Node](https://github.com/kubernetes/kubernetes/blob/{{page.githubbranch}}/docs/design/architecture.md#the-kubernetes-node) section in the
 | |
| architecture design doc for more details.
 | |
| 
 | |
| ## Node Status
 | |
| 
 | |
| A node's status is comprised of the following information.
 | |
| 
 | |
| ### Addresses
 | |
| 
 | |
| The usage of these fields varies depending on your cloud provider or bare metal configuration.
 | |
| 
 | |
| * HostName: The hostname as reported by the node's kernel. Can be overridden via the kubelet `--hostname-override` parameter.
 | |
| * ExternalIP: Typically the IP address of the node that is externally routable (available from outside the cluster).
 | |
| * InternalIP: Typically the IP address of the node that is routable only within the cluster.
 | |
| 
 | |
| ### Phase
 | |
| 
 | |
| Deprecated: node phase is no longer used.
 | |
| 
 | |
| ### Condition
 | |
| 
 | |
| The `conditions` field describes the status of all `Running` nodes.
 | |
| 
 | |
| | Node Condition | Description |
 | |
| |----------------|-------------|
 | |
| | `OutOfDisk`    | `True` if there is insufficient free space on the node for adding new pods, otherwise `False` |
 | |
| | `Ready`        | `True` if the node is healthy and ready to accept pods, `False` if the node is not healthy and is not accepting pods, and `Unknown` if the node controller has not heard from the node in the last 40 seconds |
 | |
| 
 | |
| The node condition is represented as a JSON object. For example, the following response describes a healthy node.
 | |
| 
 | |
| ```json
 | |
| "conditions": [
 | |
|   {
 | |
|     "kind": "Ready",
 | |
|     "status": "True"
 | |
|   }
 | |
| ]
 | |
| ```
 | |
| 
 | |
| If the Status of the Ready condition is Unknown or False for more than five
 | |
| minutes, then all of the pods on the node are terminated by the node
 | |
| controller. (The timeout length is configurable by the `--pod-eviction-timeout`
 | |
| parameter on the controller manager.)
 | |
| 
 | |
| ### Capacity
 | |
| 
 | |
| Describes the resources available on the node: CPU, memory and the maximum
 | |
| number of pods that can be scheduled onto the node.
 | |
| 
 | |
| ### Info
 | |
| 
 | |
| General information about the node, such as kernel version, Kubernetes version
 | |
| (kubelet and kube-proxy version), Docker version (if used), OS name.
 | |
| The information is gathered by Kubelet from the node.
 | |
| 
 | |
| ## Management
 | |
| 
 | |
| Unlike [pods](/docs/user-guide/pods) and [services](/docs/user-guide/services),
 | |
| a node is not inherently created by Kubernetes: it is created externally by cloud
 | |
| providers like Google Compute Engine, or exists in your pool of physical or virtual
 | |
| machines. What this means is that when Kubernetes creates a node, it is really
 | |
| just creating an object that represents the node. After creation, Kubernetes
 | |
| will check whether the node is valid or not. For example, if you try to create
 | |
| a node from the following content:
 | |
| 
 | |
| ```json
 | |
| {
 | |
|   "kind": "Node",
 | |
|   "apiVersion": "v1",
 | |
|   "metadata": {
 | |
|     "name": "10.240.79.157",
 | |
|     "labels": {
 | |
|       "name": "my-first-k8s-node"
 | |
|     }
 | |
|   }
 | |
| }
 | |
| ```
 | |
| 
 | |
| Kubernetes will create a node object internally (the representation), and
 | |
| validate the node by health checking based on the `metadata.name` field (we
 | |
| assume `metadata.name` can be resolved). If the node is valid, i.e. all necessary
 | |
| services are running, it is eligible to run a pod; otherwise, it will be
 | |
| ignored for any cluster activity until it becomes valid. Note that Kubernetes
 | |
| will keep the object for the invalid node unless it is explicitly deleted by
 | |
| the client, and it will keep checking to see if it becomes valid.
 | |
| 
 | |
| Currently, there are three components that interact with the Kubernetes node
 | |
| interface: node controller, kubelet, and kubectl.
 | |
| 
 | |
| ### Node Controller
 | |
| 
 | |
| The node controller is a Kubernetes master component which manages various
 | |
| aspects of nodes.
 | |
| 
 | |
| The node controller has multiple roles in a node's life. The first is assigning a
 | |
| CIDR block to the node when it is registered (if CIDR assignment is turned on).
 | |
| 
 | |
| The second is keeping the node controller's internal list of nodes up to date with
 | |
| the cloud provider's list of available machines. When running in a cloud
 | |
| environment, whenever a node is unhealthy the node controller asks the cloud
 | |
| provider if the VM for that node is still available. If not, the node
 | |
| controller deletes the node from its list of nodes.
 | |
| 
 | |
| The third is monitoring the nodes' health. The node controller is
 | |
| responsible for updating the NodeReady condition of NodeStatus to
 | |
| ConditionUnknown when a node becomes unreachable (i.e. the node controller stops
 | |
| receiving heartbeats for some reason, e.g. due to the node being down), and then later evicting
 | |
| all the pods from the node (using graceful termination) if the node continues
 | |
| to be unreachable. (The default timeouts are 40s to start reporting
 | |
| ConditionUnknown and 5m after that to start evicting pods.) The node controller
 | |
| checks the state of each node every `--node-monitor-period` seconds.
 | |
| 
 | |
| In Kubernetes 1.4, we updated the logic of the node controller to better handle
 | |
| cases when a big number of nodes have problems with reaching the master
 | |
| (e.g. because the master has networking problem). Starting with 1.4, the node
 | |
| controller will look at the state of all nodes in the cluster when making a
 | |
| decision about pod eviction.
 | |
| 
 | |
| In most cases, node controller limits the eviction rate to
 | |
| `--node-eviction-rate` (default 0.1) per second, meaning it won't evict pods
 | |
| from more than 1 node per 10 seconds.
 | |
| 
 | |
| The node eviction behavior changes when a node in a given availability zone
 | |
| becomes unhealthy. The node controller checks what percentage of nodes in the zone
 | |
| are unhealthy (NodeReady condition is ConditionUnknown or ConditionFalse) at
 | |
| the same time. If the fraction of unhealthy nodes is at least
 | |
| `--unhealthy-zone-threshold` (default 0.55) then the eviction rate is reduced:
 | |
| if the cluster is small (i.e. has less than or equal to
 | |
| `--large-cluster-size-threshold` nodes - default 50) then evictions are
 | |
| stopped, otherwise the eviction rate is reduced to
 | |
| `--secondary-node-eviction-rate` (default 0.01) per second. The reason these
 | |
| policies are implemented per availability zone is because one availability zone
 | |
| might become partitioned from the master while the others remain connected. If
 | |
| your cluster does not span multiple cloud provider availability zones, then
 | |
| there is only one availability zone (the whole cluster).
 | |
| 
 | |
| A key reason for spreading your nodes across availability zones is so that the
 | |
| workload can be shifted to healthy zones when one entire zone goes down.
 | |
| Therefore, if all nodes in a zone are unhealthy then node controller evicts at
 | |
| the normal rate `--node-eviction-rate`.  The corner case is when all zones are
 | |
| completely unhealthy (i.e. there are no healthy nodes in the cluster). In such
 | |
| case, the node controller assumes that there's some problem with master
 | |
| connectivity and stops all evictions until some connectivity is restored.
 | |
| 
 | |
| ### Self-Registration of Nodes
 | |
| 
 | |
| When the kubelet flag `--register-node` is true (the default), the kubelet will attempt to
 | |
| register itself with the API server.  This is the preferred pattern, used by most distros.
 | |
| 
 | |
| For self-registration, the kubelet is started with the following options:
 | |
| 
 | |
|   - `--api-servers=` - Location of the apiservers.
 | |
|   - `--kubeconfig=` - Path to credentials to authenticate itself to the apiserver.
 | |
|   - `--cloud-provider=` - How to talk to a cloud provider to read metadata about itself.
 | |
|   - `--register-node` - Automatically register with the API server.
 | |
| 
 | |
| Currently, any kubelet is authorized to create/modify any node resource, but in practice it only creates/modifies
 | |
| its own. (In the future, we plan to only allow a kubelet to modify its own node resource.)
 | |
| 
 | |
| #### Manual Node Administration
 | |
| 
 | |
| A cluster administrator can create and modify node objects.
 | |
| 
 | |
| If the administrator wishes to create node objects manually, set the kubelet flag
 | |
| `--register-node=false`.
 | |
| 
 | |
| The administrator can modify node resources (regardless of the setting of `--register-node`).
 | |
| Modifications include setting labels on the node and marking it unschedulable.
 | |
| 
 | |
| Labels on nodes can be used in conjunction with node selectors on pods to control scheduling,
 | |
| e.g. to constrain a pod to only be eligible to run on a subset of the nodes.
 | |
| 
 | |
| Marking a node as unscheduleable will prevent new pods from being scheduled to that
 | |
| node, but will not affect any existing pods on the node. This is useful as a
 | |
| preparatory step before a node reboot, etc. For example, to mark a node
 | |
| unschedulable, run this command:
 | |
| 
 | |
| ```shell
 | |
| kubectl cordon $NODENAME
 | |
| ```
 | |
| 
 | |
| Note that pods which are created by a daemonSet controller bypass the Kubernetes scheduler,
 | |
| and do not respect the unschedulable attribute on a node.  The assumption is that daemons belong on
 | |
| the machine even if it is being drained of applications in preparation for a reboot.
 | |
| 
 | |
| ### Node capacity
 | |
| 
 | |
| The capacity of the node (number of cpus and amount of memory) is part of the node object.
 | |
| Normally, nodes register themselves and report their capacity when creating the node object. If
 | |
| you are doing [manual node administration](#manual-node-administration), then you need to set node
 | |
| capacity when adding a node.
 | |
| 
 | |
| The Kubernetes scheduler ensures that there are enough resources for all the pods on a node.  It
 | |
| checks that the sum of the limits of containers on the node is no greater than the node capacity.  It
 | |
| includes all containers started by the kubelet, but not containers started directly by Docker nor
 | |
| processes not in containers.
 | |
| 
 | |
| If you want to explicitly reserve resources for non-pod processes, you can create a placeholder
 | |
| pod. Use the following template:
 | |
| 
 | |
| ```yaml
 | |
| apiVersion: v1
 | |
| kind: Pod
 | |
| metadata:
 | |
|   name: resource-reserver
 | |
| spec:
 | |
|   containers:
 | |
|   - name: sleep-forever
 | |
|     image: gcr.io/google_containers/pause:0.8.0
 | |
|     resources:
 | |
|       limits:
 | |
|         cpu: 100m
 | |
|         memory: 100Mi
 | |
| ```
 | |
| 
 | |
| Set the `cpu` and `memory` values to the amount of resources you want to reserve.
 | |
| Place the file in the manifest directory (`--config=DIR` flag of kubelet).  Do this
 | |
| on each kubelet where you want to reserve resources.
 | |
| 
 | |
| 
 | |
| ## API Object
 | |
| 
 | |
| Node is a top-level resource in the kubernetes REST API. More details about the
 | |
| API object can be found at: [Node API
 | |
| object](/docs/api-reference/v1/definitions/#_v1_node).
 |