How to restore a cluster from an unhealthy state.
When nodes are restarted—either manually or due to external factors like a power outage—they may fail to return to the Ready status. This issue is typically caused by a TLS certificate problem affecting RKE2, which prevents the restarted node from re-establishing a connection with the cluster
To fix the issue:
- Open Grafana and monitor the node statuses for a few minutes.
- If a node is in NotReady status, reboot the node.Important: Only restart one node at a time.
- Continue monitoring Grafana for at least 10 minutes. If during that time any of the nodes (including the one that was just rebooted) remains in NotReady status, restart them one at a time.
- Once the node enters Ready status, wait for all affected pods to transition to Running status.
- After confirming system stability, resume using HCI.