The Great SSL Certificate Mystery: A Tale of Dokku, Domains, and DevOps Drama
Origin and Evolution
It all started in the dead of night when our production server p12 ran out of disk space. The culprit? A misconfigured Prometheus instance happily hoarding metrics data, blissfully ignoring its retention parameters. This seemingly simple storage issue would spiral into a cascade of problems that would take hours to fully unravel.
To make matters worse, the server became completely inaccessible during our recovery attempts. We suspect Prometheus's WAL (Write-Ahead Log) rebuilding might have been the culprit, but for hours we were stuck in a frustrating loop of similar troubleshooting steps, unable to maintain a stable connection to the server.
The breakthrough came when we decided to fall back to our infrastructure-as-code approach and run the Ansible playbook. Like magic, it not only restored server connectivity but also revealed an unexpected surprise - four abandoned applications that had been silent for months suddenly sprang back to life!
During our attempts to restore monitoring after cleaning up the disk space, we tried accessing prometheus.kaido.team/targets. But instead of the expected monitoring dashboard, we were greeted with an unexpected surprise: a Let's Encrypt certificate for... Anytracker? This kicked off a hours-long investigation that would reveal some interesting quirks in our Dokku-based infrastructure.
The setup seemed simple enough: a Dokku installation managing multiple applications, each with its own domain and SSL certificate. But as we would discover, the devil was in the details of domain configuration and certificate management.
Four Revolutionary Use Cases
-
Domain Flexibility Gone Wrong
- Dokku's ability to handle multiple domains per app is powerful
- But this flexibility can lead to configuration drift
- Our case:
prometheus-p12.kaido.teamvsprometheus.kaido.team
-
SSL Certificate Management
- Let's Encrypt integration in Dokku is seamless when configured correctly
- Automatic renewal works like magic
- But certificate-domain mismatches can cause cryptic issues
-
Configuration as Code Saves the Day
- Ansible playbooks ensure consistent server setup
- Infrastructure as Code prevents configuration drift
- Running the playbook fixed server connectivity issues
- Unexpectedly revived four abandoned applications
-
Monitoring Stack Reliability
- Prometheus is crucial for system observability
- SSL issues can break monitoring silently
- Proper retention configuration is critical for stability
- WAL rebuilding can impact server accessibility
- Quick detection and resolution is essential
Comparison of Alternatives
While Dokku has served us well, this incident made us evaluate alternatives:
-
PaaS Alternatives
- Kubernetes with Rancher
- Pros: Built-in certificate management, better resource isolation, automated scaling
- Cons: Steep learning curve, higher operational overhead
- Best for: When you need true multi-node scaling and HA
- CapRover
- Pros: Similar to Dokku but with web UI, built-in monitoring, automatic SSL
- Cons: Less mature, smaller community
- Best for: Single-node deployments needing better UI/UX
- Coolify
- Pros: Modern UI, built-in monitoring, automatic backups
- Cons: Relatively new, community support still growing
- Best for: Developer-focused teams needing simple deployment
- Kubernetes with Rancher
-
Monitoring Alternatives
- VictoriaMetrics
- Pros: Better resource usage, simpler operation, built-in data retention
- Cons: Different query language than Prometheus
- Best for: When storage efficiency is critical
- Thanos
- Pros: Long-term storage, high availability, unlimited retention
- Cons: More complex setup, requires object storage
- Best for: When you need to scale Prometheus horizontally
- VictoriaMetrics
-
SSL/Domain Management Alternatives
- Caddy
- Pros: Automatic HTTPS, simpler configuration, built-in rate limiting
- Cons: Less flexible than Nginx for complex setups
- Best for: Simple, automated SSL management
- Traefik
- Pros: Dynamic configuration, automatic SSL, Docker-native
- Cons: Configuration can be verbose
- Best for: Container-heavy environments
- Caddy
-
Storage Management Alternatives
- Loki for Logs
- Pros: Better log compression, label-based querying
- Cons: Different mental model than traditional logging
- Best for: When metrics and logs need similar querying
- MinIO for Object Storage
- Pros: S3-compatible, can offload metrics/logs
- Cons: Additional service to maintain
- Best for: Long-term storage of metrics/logs
- Loki for Logs
Each alternative addresses specific pain points from our incident:
-
PaaS Choice Impact
- Better resource isolation prevents disk space issues
- Built-in monitoring reduces configuration complexity
- Automated certificate management prevents SSL issues
-
Monitoring Stack Impact
- Better storage efficiency
- Automatic data retention management
- Built-in high availability
-
SSL Management Impact
- Automated certificate renewal
- Simplified domain configuration
- Better error reporting
-
Storage Solutions Impact
- Predictable storage growth
- Easy scaling of storage
- Better backup options
Practical Application Steps
If you encounter similar SSL certificate issues with Dokku, here's our battle-tested solution:
-
Check domain configuration:
dokku domains:report your-app -
Verify SSL certificate details:
dokku certs:report your-app -
Fix domain mismatch:
dokku domains:clear your-app
dokku domains:add your-app your-app.example.com -
Renew SSL certificate:
dokku letsencrypt:enable your-app
Monitoring Improvements
After this incident, we implemented several crucial monitoring improvements:
-
Key Metrics to Watch
- Disk usage per application
- WAL write rate and size
- Certificate expiration dates
- Domain configuration changes
- Application uptime and last deployment date
-
Alert Configuration
- alert: DiskSpaceRunningOut
expr: disk_free_percent < 15
for: 10m
labels:
severity: warning
annotations:
summary: 'Disk space running low'
- alert: CertificateExpiringSoon
expr: ssl_cert_expiry_days < 7
for: 1h
labels:
severity: warning
annotations:
summary: 'SSL certificate expiring soon' -
Recovery Runbook
- Immediate actions for disk space issues
- Certificate renewal process
- Domain configuration verification steps
- Application state recovery procedures
Challenges and Opportunities
Current Challenges with Dokku
-
Configuration Management
- Manual changes can drift from IaC
- Multiple domain configurations can get messy
- SSL certificate management needs careful attention
- Storage management requires vigilant monitoring
-
Scaling Limitations
- Single-host deployment model
- Limited high availability options
- Resource allocation can be tricky
- Storage constraints can cascade into SSL/domain issues
Future Opportunities
-
Infrastructure Evolution
- Consider gradual migration to Kubernetes for critical services
- Keep Dokku for simple, single-host applications
- Implement better monitoring for SSL certificate issues
-
Process Improvements
- Strengthen CI/CD to catch configuration issues early
- Implement automated testing for SSL certificate validity
- Document common issues and solutions
The Moral of the Story
While Dokku remains a fantastic tool for simple deployments, this incident highlighted the importance of:
- Regular infrastructure audits
- Consistent use of Infrastructure as Code (it saved our day!)
- Clear documentation of domain and SSL configurations
- Having a solid troubleshooting process
- Regular checks on abandoned applications
- Proper monitoring configuration to prevent cascading failures
Remember: sometimes the simplest issues can teach us the most valuable lessons about our infrastructure. Keep your domains aligned, your certificates valid, and your Ansible playbooks ready - they might just save you from a late-night DevOps nightmare and even resurrect some long-forgotten applications!
