The Art of Pain: A Tale of Kubernetes Ingress Configuration
Date: January 20, 2025
Today, I want to share a story about the intricate dance between Kubernetes, Ingress Controllers, and SSL certificates - a tale that many DevOps engineers might find painfully familiar.
The Setup
Picture this: A seemingly straightforward task of setting up an Ingress Controller with SSL certificates for two domains. The players in our story:
- An NGINX Ingress Controller
- cert-manager for SSL certificate management
- Let's Encrypt as our certificate authority
- MetalLB for load balancing in a bare metal environment
Act 1: The Initial Deployment
Everything started smoothly. The Ingress Controller deployment went up without complaints, MetalLB assigned our IP address, and cert-manager was ready to do its magic. But as we all know, in the world of Kubernetes, silence often precedes the storm.
Act 2: The Certificate Challenge
The first sign of trouble came when our ACME challenges started failing. The Ingress Controller was running, but it wasn't accepting connections on port 80. A classic case of "it's running but not working."
The investigation revealed multiple layers of complexity:
- The Ingress Controller pod was using host networking
- The service selectors didn't match the pod labels
- iptables rules were rejecting traffic because endpoints weren't properly registered
Act 3: The Label Mismatch
The root cause? A simple yet maddening mismatch between service selectors and pod labels. The service was looking for:
app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx
While our pods were labeled with just:
app: ingress-nginx
The Learning
This experience reinforces several key lessons:
- Always verify label consistency across your Kubernetes resources
- Check endpoint registration when network issues arise
- Don't assume host networking will simplify your setup - it often adds complexity
- Monitor iptables rules when troubleshooting network issues
Conclusion
In the end, the solution was to realign our labels and selectors. A simple fix for a problem that caused hours of investigation. But isn't that often the case in our field? The most challenging problems often have the simplest solutions - once you find them.
Remember: In Kubernetes, labels are not just organizational tools - they're the glue that holds your services together. Treat them with the respect they deserve.
The Real Question: Why So Much Pain?
The stark contrast between our Notus and Boreas deployments raises an important question: Why did we face these issues on Notus when Boreas runs smoothly? The answer lies in several critical differences:
1. Infrastructure Maturity
Boreas, being our AMD64 cluster, benefits from more mature and tested configurations. Most Kubernetes components are primarily developed and tested on AMD64 architectures. Notus, running on ARM, occasionally encounters edge cases and compatibility issues that aren't immediately apparent in standard testing.
2. Deployment Evolution
Boreas's configuration evolved organically through multiple iterations, with each issue being addressed and documented. Its current state represents months of refinement. In contrast, Notus was deployed with a "clean slate" approach, exposing us to issues that were previously solved but not fully documented in our infrastructure-as-code.
3. The Host Networking Trap
The decision to use host networking on Notus seemed like a simplification at first - direct access to host ports, no overlay network complexity. However, this "simplification" actually introduced subtle complications:
- Label propagation became more critical
- Service endpoint registration became more sensitive
- iptables rules needed more precise configuration
4. The Hidden Cost of Automation
Our automation tools and Helm charts, optimized for Boreas, made assumptions about the underlying infrastructure that didn't hold true for Notus. These assumptions included:
- Default label schemas
- Network topology
- Service discovery patterns
Moving Forward: Bridging the Gap
To prevent similar issues in future deployments, we need to:
- Document architecture-specific configurations explicitly
- Maintain parity in deployment processes between clusters
- Avoid making assumptions about infrastructure compatibility
- Create robust testing procedures for both AMD64 and ARM environments
Remember: In the pursuit of infrastructure consistency, what works on one platform might need careful adaptation for another. The "pain" we experienced wasn't just about technical issues - it was about the gap between assumptions and reality in cross-architecture deployments.
This post is part of our series on Kubernetes Best Practices and Lessons Learned.