Issue:
While trying to join a secondary master node to the master, all nodes now unable to contact the master.
The failure was after this line in the init:
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
$ kubectl get pods
The connection to the server masternode:6443 was refused - did you specify the right host or port?
Troubleshooting:
Confirmed port 6443 is refusing connections:
$ curl https://masternode:6443
curl: (7) Failed to connect to localhost port 6443: Connection refused
$ curl https://localhost:6443
curl: (7) Failed to connect to localhost port 6443: Connection refused
$ ping masternode
< successful pings to this node>
Confirm kubelet running:
$ sudo systemctl status kubelet
<running = good>
Confirm swap entry was commented out or removed:
$ sudo vi /etc/fstab/
<no swap entry = good>
If this was incorrect, just restart the kubelet service:
$ sudo systemctl disable kubelet
$ sudo systemctl enable kubelet
$ sudo systemctl start kubelet
Confirm UFW disabled:
$ sudo ufw status
<inactive = good>
Preformed re-setup / init of cluster:
$ kubeadm init --ignore-preflight-errors=all
Last message displayed that issue not
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
error execution phase kubeconfig/admin: a kubeconfig file "/etc/kubernetes/admin.conf" exists already but has got the wrong API Server URL
$ sudo vi /etc/kubernetes/admin.conf
<confirmed address is masternode:6443 = good, but bad - still doesn't work>
Decided to loop back to the 6443 port:
$ sudo lsof -i
kube-apis 14622 root 3u IPv6 1536092 0t0 TCP #:6443 (LISTEN)
<no iPv4 entry = NOT good>
Issue is that it is now only listening on IPv6?
Decided to loop back and look at the raw system.log. Found the following:
Nov 07 16:35:09 masternode kubelet[16875]: E1107 16:35:09.398583 16875 kubelet.go:2248] node "masternode" not found
Kubelet is unable to talk to docker. Decided to check docker version. (I saw docker was running when I looked at lsof earlier.)
$ docker --version
<Docker version 18.09.7, build 2d0083d>
Docker 18.09 was not supported in Kubernetes 1.13, but we are running 1.15. This was not a warning on the preflight check for 1.15 - not the issue.
Since this is NOT a production cluster, we gave up salvage. Next we tried removing config files and doing another init, but only removing one folder of configuration and seeing if we could get past the control-plane 40s time-out.
$ sudo rm -rf /etc/kubernetes/
$ sudo kubeadm init --ignore-preflight-errors=all
(time out at 40s - waited 2 minutes, never succeeds. )
$ sudo rm -rf /etc/kubernetes/
$ sudo rm ~/.kube/
$ sudo kubeadm init --ignore-preflight-errors=all
(time out at 40s - waited 2 minutes, never succeeds. )
$ sudo rm -rf /etc/kubernetes/
$ sudo rm ~/.kube/
$ sudo rm /var/lib/etcd/
$ sudo kubeadm init --ignore-preflight-errors=all
(time out at 40s - waited 2 minutes, never succeeds. )
$ sudo rm -rf /etc/kubernetes/
$ sudo rm ~/.kube/
$ sudo rm /var/lib/etcd/
$ sudo rm /var/lib/kubelet/
$ sudo kubeadm init --ignore-preflight-errors=all
Only after the cleared the /var/lib/kubelet/ folder did the init succeed with the problem control plane line taking less than 5 seconds, as before.
Note:
If kubelet was properly running, we would not have been able to successfully clear the /var/lib/kubelet/ folder, because even after the service shutdown pod files would be locked.
In other words, the kubernetes.io-secret/calico-node-token and the kube-proxy-token would still be in use.
previous page
|