Skip to main content

ADR 0006: Prometheus Retention

Status

Accepted

Context

During a production incident on January 1, 2025, we encountered several cascading issues:

  1. Prometheus instance ran out of disk space due to misconfigured retention parameters
  2. Server became inaccessible during WAL rebuilding
  3. Domain/SSL certificate mismatches in Dokku applications were discovered
  4. Several abandoned applications were found in an inconsistent state

Alternatives Considered

  1. Platform Alternatives

    • Replace Dokku with CapRover
      • Pros: Built-in monitoring, web UI, automatic SSL
      • Cons: Migration effort, less mature than Dokku
      • Decision: Rejected due to migration complexity
    • Migrate to Kubernetes
      • Pros: Better scaling, robust cert-manager
      • Cons: Significant complexity increase
      • Decision: Deferred for future consideration
  2. Monitoring Alternatives

    • VictoriaMetrics + Grafana
      • Pros: Better storage efficiency, simpler operation
      • Cons: Team already familiar with Prometheus
      • Decision: Rejected to maintain team expertise
    • Thanos for HA Prometheus
      • Pros: Solves storage and HA issues
      • Cons: Overkill for current scale
      • Decision: Deferred until scale requires it
  3. Storage Solutions

    • External Object Storage
      • Pros: Infinite scaling, better isolation
      • Cons: Additional cost, network latency
      • Decision: Accepted for future implementation
    • Separate Metrics Node
      • Pros: Resource isolation, dedicated storage
      • Cons: Additional infrastructure cost
      • Decision: Rejected due to cost/benefit ratio

Decision

We will implement the following changes:

  1. Prometheus Configuration

    • Enforce strict retention policies through Ansible configuration
    • Set up disk usage alerts at 70% and 85% thresholds
    • Configure WAL directory on a separate volume
  2. Dokku Domain Management

    • Standardize domain naming: {app-name}.kaido.team format
    • Remove legacy -p12 subdomain pattern
    • Manage all domain configurations through Ansible
    • Regular automated SSL certificate validation checks
  3. Application Lifecycle Management

    • Monthly audit of running applications
    • Automated reporting of unused/abandoned applications
    • Clear decommissioning process for unused apps

Consequences

Positive

  • Prevents disk space issues through proactive monitoring
  • Reduces risk of SSL certificate mismatches
  • Better visibility into application lifecycle
  • More consistent domain naming scheme
  • Improved recovery process through IaC

Negative

  • Additional operational overhead for monitoring
  • Need to update existing documentation and scripts
  • Migration effort for legacy domain names
  • Potential temporary disruption during domain standardization