Adventures in Kubernetes: When Your Node Changes Data Centers
Today we faced an interesting challenge after migrating our prod26-k8s node to a new data center using snapshots. What seemed like a straightforward move turned into a fascinating journey through Kubernetes tolerations, cert-manager, and ingress configurations.
The Initial State
Our setup was relatively standard:
- Single-node Kubernetes cluster running on
prod26-k8s - Mercury bot deployment with nginx-ingress
- Let's Encrypt certificates managed by cert-manager
- All workloads running on the control-plane node
The Challenge
After restoring the node from a snapshot in the new DC, everything seemed fine at first glance. However, we quickly ran into several issues:
- Ingress Controller Pod: The nginx-ingress controller pod wouldn't schedule because it needed tolerations for the control-plane taint.
- Certificate Management: cert-manager's HTTP01 solver pods were stuck in
Pendingstate. - ACME Challenges: Let's Encrypt validation was failing with 503 errors.
The Root Cause
The core issue was that our node was both a control-plane and a worker node, with the taint node-role.kubernetes.io/control-plane:NoSchedule. After the migration, we needed to ensure all critical components had the proper tolerations:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
The Solution
We solved this in several steps:
- First, we added tolerations to the nginx-ingress controller:
spec:
template:
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- Then, we discovered that cert-manager's HTTP01 solver pods also needed tolerations. We modified the ClusterIssuer:
spec:
acme:
solvers:
- http01:
ingress:
class: nginx
podTemplate:
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- Finally, we simplified the ingress configuration to let cert-manager handle the ACME challenge paths automatically.
Lessons Learned
- Tolerations are Critical: In a single-node cluster where the control-plane runs workloads, tolerations are essential for every component.
- cert-manager Complexity: The cert-manager stack involves multiple components (controller, webhook, solver pods) - each needing proper scheduling permissions.
- Migration Checklist: When moving nodes between DCs, we need to verify:
- Node labels and taints
- Pod scheduling requirements
- Network connectivity for ACME challenges
Best Practices Moving Forward
- Always include control-plane tolerations in critical components when running a single-node cluster.
- Use podTemplates in cert-manager configurations to ensure ACME challenges can be completed.
- Keep ingress configurations simple and let cert-manager handle its paths.
Conclusion
What started as a simple DC migration turned into a deep dive into Kubernetes scheduling and certificate management. The key takeaway? In a single-node cluster, tolerations are just as important as the workloads themselves.
Remember: Just because a node moves to a new home doesn't mean its Kubernetes personality changes - you just need to make sure everyone's invited to the party! 🎉
Tags: #kubernetes #devops #cert-manager #ingress #tolerations #migration