Skip to main content

Adventures in Kubernetes: When Your Node Changes Data Centers

· 2 min read

Today we faced an interesting challenge after migrating our prod26-k8s node to a new data center using snapshots. What seemed like a straightforward move turned into a fascinating journey through Kubernetes tolerations, cert-manager, and ingress configurations.

The Initial State

Our setup was relatively standard:

  • Single-node Kubernetes cluster running on prod26-k8s
  • Mercury bot deployment with nginx-ingress
  • Let's Encrypt certificates managed by cert-manager
  • All workloads running on the control-plane node

The Challenge

After restoring the node from a snapshot in the new DC, everything seemed fine at first glance. However, we quickly ran into several issues:

  1. Ingress Controller Pod: The nginx-ingress controller pod wouldn't schedule because it needed tolerations for the control-plane taint.
  2. Certificate Management: cert-manager's HTTP01 solver pods were stuck in Pending state.
  3. ACME Challenges: Let's Encrypt validation was failing with 503 errors.

The Root Cause

The core issue was that our node was both a control-plane and a worker node, with the taint node-role.kubernetes.io/control-plane:NoSchedule. After the migration, we needed to ensure all critical components had the proper tolerations:

tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule

The Solution

We solved this in several steps:

  1. First, we added tolerations to the nginx-ingress controller:
spec:
template:
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
  1. Then, we discovered that cert-manager's HTTP01 solver pods also needed tolerations. We modified the ClusterIssuer:
spec:
acme:
solvers:
- http01:
ingress:
class: nginx
podTemplate:
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
  1. Finally, we simplified the ingress configuration to let cert-manager handle the ACME challenge paths automatically.

Lessons Learned

  1. Tolerations are Critical: In a single-node cluster where the control-plane runs workloads, tolerations are essential for every component.
  2. cert-manager Complexity: The cert-manager stack involves multiple components (controller, webhook, solver pods) - each needing proper scheduling permissions.
  3. Migration Checklist: When moving nodes between DCs, we need to verify:
    • Node labels and taints
    • Pod scheduling requirements
    • Network connectivity for ACME challenges

Best Practices Moving Forward

  1. Always include control-plane tolerations in critical components when running a single-node cluster.
  2. Use podTemplates in cert-manager configurations to ensure ACME challenges can be completed.
  3. Keep ingress configurations simple and let cert-manager handle its paths.

Conclusion

What started as a simple DC migration turned into a deep dive into Kubernetes scheduling and certificate management. The key takeaway? In a single-node cluster, tolerations are just as important as the workloads themselves.

Remember: Just because a node moves to a new home doesn't mean its Kubernetes personality changes - you just need to make sure everyone's invited to the party! 🎉

Tags: #kubernetes #devops #cert-manager #ingress #tolerations #migration