ADR 0006: Prometheus Retention

Status

Accepted

During a production incident on January 1, 2025, we encountered several cascading issues:

Prometheus instance ran out of disk space due to misconfigured retention parameters
Server became inaccessible during WAL rebuilding
Domain/SSL certificate mismatches in Dokku applications were discovered
Several abandoned applications were found in an inconsistent state

Platform Alternatives
- Replace Dokku with CapRover
  - Pros: Built-in monitoring, web UI, automatic SSL
  - Cons: Migration effort, less mature than Dokku
  - Decision: Rejected due to migration complexity
- Migrate to Kubernetes
  - Pros: Better scaling, robust cert-manager
  - Cons: Significant complexity increase
  - Decision: Deferred for future consideration
Monitoring Alternatives
- VictoriaMetrics + Grafana
  - Pros: Better storage efficiency, simpler operation
  - Cons: Team already familiar with Prometheus
  - Decision: Rejected to maintain team expertise
- Thanos for HA Prometheus
  - Pros: Solves storage and HA issues
  - Cons: Overkill for current scale
  - Decision: Deferred until scale requires it
Storage Solutions
- External Object Storage
  - Pros: Infinite scaling, better isolation
  - Cons: Additional cost, network latency
  - Decision: Accepted for future implementation
- Separate Metrics Node
  - Pros: Resource isolation, dedicated storage
  - Cons: Additional infrastructure cost
  - Decision: Rejected due to cost/benefit ratio

We will implement the following changes:

Prometheus Configuration
- Enforce strict retention policies through Ansible configuration
- Set up disk usage alerts at 70% and 85% thresholds
- Configure WAL directory on a separate volume
Dokku Domain Management
- Standardize domain naming: {app-name}.kaido.team format
- Remove legacy -p12 subdomain pattern
- Manage all domain configurations through Ansible
- Regular automated SSL certificate validation checks
Application Lifecycle Management
- Monthly audit of running applications
- Automated reporting of unused/abandoned applications
- Clear decommissioning process for unused apps