The Art of Pain: A Tale of Kubernetes Ingress Configuration

January 22, 2025 · 3 min read

Date: January 20, 2025

Today, I want to share a story about the intricate dance between Kubernetes, Ingress Controllers, and SSL certificates - a tale that many DevOps engineers might find painfully familiar.

The Setup

Picture this: A seemingly straightforward task of setting up an Ingress Controller with SSL certificates for two domains. The players in our story:

An NGINX Ingress Controller
cert-manager for SSL certificate management
Let's Encrypt as our certificate authority
MetalLB for load balancing in a bare metal environment

Act 1: The Initial Deployment

Everything started smoothly. The Ingress Controller deployment went up without complaints, MetalLB assigned our IP address, and cert-manager was ready to do its magic. But as we all know, in the world of Kubernetes, silence often precedes the storm.

Act 2: The Certificate Challenge

The first sign of trouble came when our ACME challenges started failing. The Ingress Controller was running, but it wasn't accepting connections on port 80. A classic case of "it's running but not working."

The investigation revealed multiple layers of complexity:

The Ingress Controller pod was using host networking
The service selectors didn't match the pod labels
iptables rules were rejecting traffic because endpoints weren't properly registered

Act 3: The Label Mismatch

The root cause? A simple yet maddening mismatch between service selectors and pod labels. The service was looking for:

app.kubernetes.io/component: controller
app.kubernetes.io/instance: ingress-nginx
app.kubernetes.io/name: ingress-nginx

While our pods were labeled with just:

app: ingress-nginx

The Learning

This experience reinforces several key lessons:

Always verify label consistency across your Kubernetes resources
Check endpoint registration when network issues arise
Don't assume host networking will simplify your setup - it often adds complexity
Monitor iptables rules when troubleshooting network issues

Conclusion

In the end, the solution was to realign our labels and selectors. A simple fix for a problem that caused hours of investigation. But isn't that often the case in our field? The most challenging problems often have the simplest solutions - once you find them.

Remember: In Kubernetes, labels are not just organizational tools - they're the glue that holds your services together. Treat them with the respect they deserve.

The Real Question: Why So Much Pain?

The stark contrast between our Notus and Boreas deployments raises an important question: Why did we face these issues on Notus when Boreas runs smoothly? The answer lies in several critical differences:

1. Infrastructure Maturity

Boreas, being our AMD64 cluster, benefits from more mature and tested configurations. Most Kubernetes components are primarily developed and tested on AMD64 architectures. Notus, running on ARM, occasionally encounters edge cases and compatibility issues that aren't immediately apparent in standard testing.

2. Deployment Evolution

Boreas's configuration evolved organically through multiple iterations, with each issue being addressed and documented. Its current state represents months of refinement. In contrast, Notus was deployed with a "clean slate" approach, exposing us to issues that were previously solved but not fully documented in our infrastructure-as-code.

3. The Host Networking Trap

The decision to use host networking on Notus seemed like a simplification at first - direct access to host ports, no overlay network complexity. However, this "simplification" actually introduced subtle complications:

Label propagation became more critical
Service endpoint registration became more sensitive
iptables rules needed more precise configuration

4. The Hidden Cost of Automation

Our automation tools and Helm charts, optimized for Boreas, made assumptions about the underlying infrastructure that didn't hold true for Notus. These assumptions included:

Default label schemas
Network topology
Service discovery patterns

Moving Forward: Bridging the Gap

To prevent similar issues in future deployments, we need to:

Document architecture-specific configurations explicitly
Maintain parity in deployment processes between clusters
Avoid making assumptions about infrastructure compatibility
Create robust testing procedures for both AMD64 and ARM environments

Remember: In the pursuit of infrastructure consistency, what works on one platform might need careful adaptation for another. The "pain" we experienced wasn't just about technical issues - it was about the gap between assumptions and reality in cross-architecture deployments.

This post is part of our series on Kubernetes Best Practices and Lessons Learned.

The Setup​

Act 1: The Initial Deployment​

Act 2: The Certificate Challenge​

Act 3: The Label Mismatch​

The Learning​

Conclusion​

The Real Question: Why So Much Pain?​

1. Infrastructure Maturity​

2. Deployment Evolution​

3. The Host Networking Trap​

4. The Hidden Cost of Automation​

Moving Forward: Bridging the Gap​