ADR 0006: Prometheus Retention
Status
Accepted
Context
During a production incident on January 1, 2025, we encountered several cascading issues:
- Prometheus instance ran out of disk space due to misconfigured retention parameters
- Server became inaccessible during WAL rebuilding
- Domain/SSL certificate mismatches in Dokku applications were discovered
- Several abandoned applications were found in an inconsistent state
Alternatives Considered
-
Platform Alternatives
- Replace Dokku with CapRover
- Pros: Built-in monitoring, web UI, automatic SSL
- Cons: Migration effort, less mature than Dokku
- Decision: Rejected due to migration complexity
- Migrate to Kubernetes
- Pros: Better scaling, robust cert-manager
- Cons: Significant complexity increase
- Decision: Deferred for future consideration
- Replace Dokku with CapRover
-
Monitoring Alternatives
- VictoriaMetrics + Grafana
- Pros: Better storage efficiency, simpler operation
- Cons: Team already familiar with Prometheus
- Decision: Rejected to maintain team expertise
- Thanos for HA Prometheus
- Pros: Solves storage and HA issues
- Cons: Overkill for current scale
- Decision: Deferred until scale requires it
- VictoriaMetrics + Grafana
-
Storage Solutions
- External Object Storage
- Pros: Infinite scaling, better isolation
- Cons: Additional cost, network latency
- Decision: Accepted for future implementation
- Separate Metrics Node
- Pros: Resource isolation, dedicated storage
- Cons: Additional infrastructure cost
- Decision: Rejected due to cost/benefit ratio
- External Object Storage
Decision
We will implement the following changes:
-
Prometheus Configuration
- Enforce strict retention policies through Ansible configuration
- Set up disk usage alerts at 70% and 85% thresholds
- Configure WAL directory on a separate volume
-
Dokku Domain Management
- Standardize domain naming:
{app-name}.kaido.teamformat - Remove legacy
-p12subdomain pattern - Manage all domain configurations through Ansible
- Regular automated SSL certificate validation checks
- Standardize domain naming:
-
Application Lifecycle Management
- Monthly audit of running applications
- Automated reporting of unused/abandoned applications
- Clear decommissioning process for unused apps
Consequences
Positive
- Prevents disk space issues through proactive monitoring
- Reduces risk of SSL certificate mismatches
- Better visibility into application lifecycle
- More consistent domain naming scheme
- Improved recovery process through IaC
Negative
- Additional operational overhead for monitoring
- Need to update existing documentation and scripts
- Migration effort for legacy domain names
- Potential temporary disruption during domain standardization