Adventures in Kubernetes: When Your Node Changes Data Centers

January 9, 2025 · 2 min read

Today we faced an interesting challenge after migrating our prod26-k8s node to a new data center using snapshots. What seemed like a straightforward move turned into a fascinating journey through Kubernetes tolerations, cert-manager, and ingress configurations.

The Initial State

Our setup was relatively standard:

Single-node Kubernetes cluster running on prod26-k8s
Mercury bot deployment with nginx-ingress
Let's Encrypt certificates managed by cert-manager
All workloads running on the control-plane node

The Challenge

After restoring the node from a snapshot in the new DC, everything seemed fine at first glance. However, we quickly ran into several issues:

Ingress Controller Pod: The nginx-ingress controller pod wouldn't schedule because it needed tolerations for the control-plane taint.
Certificate Management: cert-manager's HTTP01 solver pods were stuck in Pending state.
ACME Challenges: Let's Encrypt validation was failing with 503 errors.

The Root Cause

The core issue was that our node was both a control-plane and a worker node, with the taint node-role.kubernetes.io/control-plane:NoSchedule. After the migration, we needed to ensure all critical components had the proper tolerations:

tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule

The Solution

We solved this in several steps:

First, we added tolerations to the nginx-ingress controller:

spec:
  template:
    spec:
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule

Then, we discovered that cert-manager's HTTP01 solver pods also needed tolerations. We modified the ClusterIssuer:

spec:
  acme:
    solvers:
      - http01:
          ingress:
            class: nginx
            podTemplate:
              spec:
                tolerations:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
                    effect: NoSchedule

Finally, we simplified the ingress configuration to let cert-manager handle the ACME challenge paths automatically.

Lessons Learned

Tolerations are Critical: In a single-node cluster where the control-plane runs workloads, tolerations are essential for every component.
cert-manager Complexity: The cert-manager stack involves multiple components (controller, webhook, solver pods) - each needing proper scheduling permissions.
Migration Checklist: When moving nodes between DCs, we need to verify:
- Node labels and taints
- Pod scheduling requirements
- Network connectivity for ACME challenges

Best Practices Moving Forward

Always include control-plane tolerations in critical components when running a single-node cluster.
Use podTemplates in cert-manager configurations to ensure ACME challenges can be completed.
Keep ingress configurations simple and let cert-manager handle its paths.

Conclusion

What started as a simple DC migration turned into a deep dive into Kubernetes scheduling and certificate management. The key takeaway? In a single-node cluster, tolerations are just as important as the workloads themselves.

Remember: Just because a node moves to a new home doesn't mean its Kubernetes personality changes - you just need to make sure everyone's invited to the party! 🎉

Tags: #kubernetes #devops #cert-manager #ingress #tolerations #migration

The Initial State​

The Challenge​

The Root Cause​

The Solution​

Lessons Learned​

Best Practices Moving Forward​

Conclusion​