Adventures in Kubernetes: When Your Node Changes Data Centers
Today we faced an interesting challenge after migrating our prod26-k8s node to a new data center using snapshots. What seemed like a straightforward move turned into a fascinating journey through Kubernetes tolerations, cert-manager, and ingress configurations.
The Initial State
Our setup was relatively standard:
- Single-node Kubernetes cluster running on
prod26-k8s - Mercury bot deployment with nginx-ingress
- Let's Encrypt certificates managed by cert-manager
- All workloads running on the control-plane node
The Challenge
After restoring the node from a snapshot in the new DC, everything seemed fine at first glance. However, we quickly ran into several issues:
- Ingress Controller Pod: The nginx-ingress controller pod wouldn't schedule because it needed tolerations for the control-plane taint.
- Certificate Management: cert-manager's HTTP01 solver pods were stuck in
Pendingstate. - ACME Challenges: Let's Encrypt validation was failing with 503 errors.
