Clean up troubleshooting-kubeadm.md

This commit is contained in:
Michael 2023-11-27 21:13:14 +08:00
parent ce359cb7b0
commit b231bcf8dd
1 changed files with 83 additions and 34 deletions

View File

@ -73,7 +73,8 @@ If you see the following warnings while running `kubeadm init`
[preflight] WARNING: ethtool not found in system path [preflight] WARNING: ethtool not found in system path
``` ```
Then you may be missing `ebtables`, `ethtool` or a similar executable on your node. You can install them with the following commands: Then you may be missing `ebtables`, `ethtool` or a similar executable on your node.
You can install them with the following commands:
- For Ubuntu/Debian users, run `apt install ebtables ethtool`. - For Ubuntu/Debian users, run `apt install ebtables ethtool`.
- For CentOS/Fedora users, run `yum install ebtables ethtool`. - For CentOS/Fedora users, run `yum install ebtables ethtool`.
@ -90,9 +91,9 @@ This may be caused by a number of problems. The most common are:
- network connection problems. Check that your machine has full network connectivity before continuing. - network connection problems. Check that your machine has full network connectivity before continuing.
- the cgroup driver of the container runtime differs from that of the kubelet. To understand how to - the cgroup driver of the container runtime differs from that of the kubelet. To understand how to
configure it properly see [Configuring a cgroup driver](/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/). configure it properly, see [Configuring a cgroup driver](/docs/tasks/administer-cluster/kubeadm/configure-cgroup-driver/).
- control plane containers are crashlooping or hanging. You can check this by running `docker ps` - control plane containers are crashlooping or hanging. You can check this by running `docker ps`
and investigating each container by running `docker logs`. For other container runtime see and investigating each container by running `docker logs`. For other container runtime, see
[Debugging Kubernetes nodes with crictl](/docs/tasks/debug/debug-cluster/crictl/). [Debugging Kubernetes nodes with crictl](/docs/tasks/debug/debug-cluster/crictl/).
## kubeadm blocks when removing managed containers ## kubeadm blocks when removing managed containers
@ -144,10 +145,12 @@ provider. Please contact the author of the Pod Network add-on to find out whethe
Calico, Canal, and Flannel CNI providers are verified to support HostPort. Calico, Canal, and Flannel CNI providers are verified to support HostPort.
For more information, see the [CNI portmap documentation](https://github.com/containernetworking/plugins/blob/master/plugins/meta/portmap/README.md). For more information, see the
[CNI portmap documentation](https://github.com/containernetworking/plugins/blob/master/plugins/meta/portmap/README.md).
If your network provider does not support the portmap CNI plugin, you may need to use the [NodePort feature of If your network provider does not support the portmap CNI plugin, you may need to use the
services](/docs/concepts/services-networking/service/#type-nodeport) or use `HostNetwork=true`. [NodePort feature of services](/docs/concepts/services-networking/service/#type-nodeport)
or use `HostNetwork=true`.
## Pods are not accessible via their Service IP ## Pods are not accessible via their Service IP
@ -157,9 +160,10 @@ services](/docs/concepts/services-networking/service/#type-nodeport) or use `Hos
add-on provider to get the latest status of their support for hairpin mode. add-on provider to get the latest status of their support for hairpin mode.
- If you are using VirtualBox (directly or via Vagrant), you will need to - If you are using VirtualBox (directly or via Vagrant), you will need to
ensure that `hostname -i` returns a routable IP address. By default the first ensure that `hostname -i` returns a routable IP address. By default, the first
interface is connected to a non-routable host-only network. A work around interface is connected to a non-routable host-only network. A work around
is to modify `/etc/hosts`, see this [Vagrantfile](https://github.com/errordeveloper/k8s-playground/blob/22dd39dfc06111235620e6c4404a96ae146f26fd/Vagrantfile#L11) is to modify `/etc/hosts`, see this
[Vagrantfile](https://github.com/errordeveloper/k8s-playground/blob/22dd39dfc06111235620e6c4404a96ae146f26fd/Vagrantfile#L11)
for an example. for an example.
## TLS certificate errors ## TLS certificate errors
@ -175,6 +179,7 @@ Unable to connect to the server: x509: certificate signed by unknown authority (
regenerate a certificate if necessary. The certificates in a kubeconfig file regenerate a certificate if necessary. The certificates in a kubeconfig file
are base64 encoded. The `base64 --decode` command can be used to decode the certificate are base64 encoded. The `base64 --decode` command can be used to decode the certificate
and `openssl x509 -text -noout` can be used for viewing the certificate information. and `openssl x509 -text -noout` can be used for viewing the certificate information.
- Unset the `KUBECONFIG` environment variable using: - Unset the `KUBECONFIG` environment variable using:
```sh ```sh
@ -190,7 +195,7 @@ Unable to connect to the server: x509: certificate signed by unknown authority (
- Another workaround is to overwrite the existing `kubeconfig` for the "admin" user: - Another workaround is to overwrite the existing `kubeconfig` for the "admin" user:
```sh ```sh
mv $HOME/.kube $HOME/.kube.bak mv $HOME/.kube $HOME/.kube.bak
mkdir $HOME/.kube mkdir $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
@ -198,7 +203,8 @@ Unable to connect to the server: x509: certificate signed by unknown authority (
## Kubelet client certificate rotation fails {#kubelet-client-cert} ## Kubelet client certificate rotation fails {#kubelet-client-cert}
By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the `/var/lib/kubelet/pki/kubelet-client-current.pem` symlink specified in `/etc/kubernetes/kubelet.conf`. By default, kubeadm configures a kubelet with automatic rotation of client certificates by using the
`/var/lib/kubelet/pki/kubelet-client-current.pem` symlink specified in `/etc/kubernetes/kubelet.conf`.
If this rotation process fails you might see errors such as `x509: certificate has expired or is not yet valid` If this rotation process fails you might see errors such as `x509: certificate has expired or is not yet valid`
in kube-apiserver logs. To fix the issue you must follow these steps: in kube-apiserver logs. To fix the issue you must follow these steps:
@ -231,24 +237,34 @@ The following error might indicate that something was wrong in the pod network:
Error from server (NotFound): the server could not find the requested resource Error from server (NotFound): the server could not find the requested resource
``` ```
- If you're using flannel as the pod network inside Vagrant, then you will have to specify the default interface name for flannel. - If you're using flannel as the pod network inside Vagrant, then you will have to
specify the default interface name for flannel.
Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts are assigned the IP address `10.0.2.15`, is for external traffic that gets NATed. Vagrant typically assigns two interfaces to all VMs. The first, for which all hosts
are assigned the IP address `10.0.2.15`, is for external traffic that gets NATed.
This may lead to problems with flannel, which defaults to the first interface on a host. This leads to all hosts thinking they have the same public IP address. To prevent this, pass the `--iface eth1` flag to flannel so that the second interface is chosen. This may lead to problems with flannel, which defaults to the first interface on a host.
This leads to all hosts thinking they have the same public IP address. To prevent this,
pass the `--iface eth1` flag to flannel so that the second interface is chosen.
## Non-public IP used for containers ## Non-public IP used for containers
In some situations `kubectl logs` and `kubectl run` commands may return with the following errors in an otherwise functional cluster: In some situations `kubectl logs` and `kubectl run` commands may return with the
following errors in an otherwise functional cluster:
```console ```console
Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc65b868-glc5m/mysql: dial tcp 10.19.0.41:10250: getsockopt: no route to host
``` ```
- This may be due to Kubernetes using an IP that can not communicate with other IPs on the seemingly same subnet, possibly by policy of the machine provider. - This may be due to Kubernetes using an IP that can not communicate with other IPs on
- DigitalOcean assigns a public IP to `eth0` as well as a private one to be used internally as anchor for their floating IP feature, yet `kubelet` will pick the latter as the node's `InternalIP` instead of the public one. the seemingly same subnet, possibly by policy of the machine provider.
- DigitalOcean assigns a public IP to `eth0` as well as a private one to be used internally
as anchor for their floating IP feature, yet `kubelet` will pick the latter as the node's
`InternalIP` instead of the public one.
Use `ip addr show` to check for this scenario instead of `ifconfig` because `ifconfig` will not display the offending alias IP address. Alternatively an API endpoint specific to DigitalOcean allows to query for the anchor IP from the droplet: Use `ip addr show` to check for this scenario instead of `ifconfig` because `ifconfig` will
not display the offending alias IP address. Alternatively an API endpoint specific to
DigitalOcean allows to query for the anchor IP from the droplet:
```sh ```sh
curl http://169.254.169.254/metadata/v1/interfaces/public/0/anchor_ipv4/address curl http://169.254.169.254/metadata/v1/interfaces/public/0/anchor_ipv4/address
@ -270,12 +286,13 @@ Error from server: Get https://10.19.0.41:10250/containerLogs/default/mysql-ddc6
## `coredns` pods have `CrashLoopBackOff` or `Error` state ## `coredns` pods have `CrashLoopBackOff` or `Error` state
If you have nodes that are running SELinux with an older version of Docker you might experience a scenario If you have nodes that are running SELinux with an older version of Docker, you might experience a scenario
where the `coredns` pods are not starting. To solve that you can try one of the following options: where the `coredns` pods are not starting. To solve that, you can try one of the following options:
- Upgrade to a [newer version of Docker](/docs/setup/production-environment/container-runtimes/#docker). - Upgrade to a [newer version of Docker](/docs/setup/production-environment/container-runtimes/#docker).
- [Disable SELinux](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security-enhanced_linux/sect-security-enhanced_linux-enabling_and_disabling_selinux-disabling_selinux). - [Disable SELinux](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/security-enhanced_linux/sect-security-enhanced_linux-enabling_and_disabling_selinux-disabling_selinux).
- Modify the `coredns` deployment to set `allowPrivilegeEscalation` to `true`: - Modify the `coredns` deployment to set `allowPrivilegeEscalation` to `true`:
```bash ```bash
@ -284,7 +301,8 @@ kubectl -n kube-system get deployment coredns -o yaml | \
kubectl apply -f - kubectl apply -f -
``` ```
Another cause for CoreDNS to have `CrashLoopBackOff` is when a CoreDNS Pod deployed in Kubernetes detects a loop. [A number of workarounds](https://github.com/coredns/coredns/tree/master/plugin/loop#troubleshooting-loops-in-kubernetes-clusters) Another cause for CoreDNS to have `CrashLoopBackOff` is when a CoreDNS Pod deployed in Kubernetes detects a loop.
[A number of workarounds](https://github.com/coredns/coredns/tree/master/plugin/loop#troubleshooting-loops-in-kubernetes-clusters)
are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits. are available to avoid Kubernetes trying to restart the CoreDNS Pod every time CoreDNS detects the loop and exits.
{{< warning >}} {{< warning >}}
@ -300,7 +318,7 @@ If you encounter the following error:
rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\"" rpc error: code = 2 desc = oci runtime error: exec failed: container_linux.go:247: starting container process caused "process_linux.go:110: decoding init error from pipe caused \"read parent: connection reset by peer\""
``` ```
this issue appears if you run CentOS 7 with Docker 1.13.1.84. This issue appears if you run CentOS 7 with Docker 1.13.1.84.
This version of Docker can prevent the kubelet from executing into the etcd container. This version of Docker can prevent the kubelet from executing into the etcd container.
To work around the issue, choose one of these options: To work around the issue, choose one of these options:
@ -344,6 +362,7 @@ to pick up the node's IP address properly and has knock-on effects to the proxy
load balancers. load balancers.
The following error can be seen in kube-proxy Pods: The following error can be seen in kube-proxy Pods:
``` ```
server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: [] server.go:610] Failed to retrieve node IP: host IP unknown; known addresses: []
proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
@ -352,8 +371,26 @@ proxier.go:340] invalid nodeIP, initializing kube-proxy with 127.0.0.1 as nodeIP
A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane A known solution is to patch the kube-proxy DaemonSet to allow scheduling it on control-plane
nodes regardless of their conditions, keeping it off of other nodes until their initial guarding nodes regardless of their conditions, keeping it off of other nodes until their initial guarding
conditions abate: conditions abate:
``` ```
kubectl -n kube-system patch ds kube-proxy -p='{ "spec": { "template": { "spec": { "tolerations": [ { "key": "CriticalAddonsOnly", "operator": "Exists" }, { "effect": "NoSchedule", "key": "node-role.kubernetes.io/control-plane" } ] } } } }' kubectl -n kube-system patch ds kube-proxy -p='{
"spec": {
"template": {
"spec": {
"tolerations": [
{
"key": "CriticalAddonsOnly",
"operator": "Exists"
},
{
"effect": "NoSchedule",
"key": "node-role.kubernetes.io/control-plane"
}
]
}
}
}
}'
``` ```
The tracking issue for this problem is [here](https://github.com/kubernetes/kubeadm/issues/1027). The tracking issue for this problem is [here](https://github.com/kubernetes/kubeadm/issues/1027).
@ -365,12 +402,15 @@ For [flex-volume support](https://github.com/kubernetes/community/blob/ab55d85/c
Kubernetes components like the kubelet and kube-controller-manager use the default path of Kubernetes components like the kubelet and kube-controller-manager use the default path of
`/usr/libexec/kubernetes/kubelet-plugins/volume/exec/`, yet the flex-volume directory _must be writeable_ `/usr/libexec/kubernetes/kubelet-plugins/volume/exec/`, yet the flex-volume directory _must be writeable_
for the feature to work. for the feature to work.
(**Note**: FlexVolume was deprecated in the Kubernetes v1.23 release)
To workaround this issue you can configure the flex-volume directory using the kubeadm {{< note >}}
FlexVolume was deprecated in the Kubernetes v1.23 release.
{{< /note >}}
To workaround this issue, you can configure the flex-volume directory using the kubeadm
[configuration file](/docs/reference/config-api/kubeadm-config.v1beta3/). [configuration file](/docs/reference/config-api/kubeadm-config.v1beta3/).
On the primary control-plane Node (created using `kubeadm init`) pass the following On the primary control-plane Node (created using `kubeadm init`), pass the following
file using `--config`: file using `--config`:
```yaml ```yaml
@ -402,7 +442,10 @@ be advised that this is modifying a design principle of the Linux distribution.
## `kubeadm upgrade plan` prints out `context deadline exceeded` error message ## `kubeadm upgrade plan` prints out `context deadline exceeded` error message
This error message is shown when upgrading a Kubernetes cluster with `kubeadm` in the case of running an external etcd. This is not a critical bug and happens because older versions of kubeadm perform a version check on the external etcd cluster. You can proceed with `kubeadm upgrade apply ...`. This error message is shown when upgrading a Kubernetes cluster with `kubeadm` in
the case of running an external etcd. This is not a critical bug and happens because
older versions of kubeadm perform a version check on the external etcd cluster.
You can proceed with `kubeadm upgrade apply ...`.
This issue is fixed as of version 1.19. This issue is fixed as of version 1.19.
@ -422,6 +465,7 @@ can be used insecurely by passing the `--kubelet-insecure-tls` to it. This is no
If you want to use TLS between the metrics-server and the kubelet there is a problem, If you want to use TLS between the metrics-server and the kubelet there is a problem,
since kubeadm deploys a self-signed serving certificate for the kubelet. This can cause the following errors since kubeadm deploys a self-signed serving certificate for the kubelet. This can cause the following errors
on the side of the metrics-server: on the side of the metrics-server:
``` ```
x509: certificate signed by unknown authority x509: certificate signed by unknown authority
x509: certificate is valid for IP-foo not IP-bar x509: certificate is valid for IP-foo not IP-bar
@ -438,6 +482,7 @@ Only applicable to upgrading a control plane node with a kubeadm binary v1.28.3
where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2. where the node is currently managed by kubeadm versions v1.28.0, v1.28.1 or v1.28.2.
Here is the error message you may encounter: Here is the error message you may encounter:
``` ```
[upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition [upgrade/etcd] Failed to upgrade etcd: couldn't upgrade control plane. kubeadm has tried to recover everything into the earlier state. Errors faced: static Pod hash for component etcd on Node kinder-upgrade-control-plane-1 did not change after 5m0s: timed out waiting for the condition
[upgrade/etcd] Waiting for previous etcd to become available [upgrade/etcd] Waiting for previous etcd to become available
@ -454,16 +499,19 @@ k8s.io/kubernetes/cmd/kubeadm/app/phases/upgrade.performEtcdStaticPodUpgrade
... ...
``` ```
The reason for this failure is that the affected versions generate an etcd manifest file with unwanted defaults in the PodSpec. The reason for this failure is that the affected versions generate an etcd manifest file with
This will result in a diff from the manifest comparison, and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash. unwanted defaults in the PodSpec. This will result in a diff from the manifest comparison,
and kubeadm will expect a change in the Pod hash, but the kubelet will never update the hash.
There are two way to workaround this issue if you see it in your cluster: There are two way to workaround this issue if you see it in your cluster:
- The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
```shell
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
```
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version. - The etcd upgrade can be skipped between the affected versions and v1.28.3 (or later) by using:
```shell
kubeadm upgrade {apply|node} [version] --etcd-upgrade=false
```
This is not recommended in case a new etcd version was introduced by a later v1.28 patch version.
- Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes: - Before upgrade, patch the manifest for the etcd static pod, to remove the problematic defaulted attributes:
@ -509,4 +557,5 @@ This is not recommended in case a new etcd version was introduced by a later v1.
path: /etc/kubernetes/pki/etcd path: /etc/kubernetes/pki/etcd
``` ```
More information can be found in the [tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug. More information can be found in the
[tracking issue](https://github.com/kubernetes/kubeadm/issues/2927) for this bug.