All posts

Our Alerting System Alerts Us That The Alerting System Is Down

We have an alert for everything. CPU high? Alert. Memory high? Alert. Disk full? Alert. Alertmanager down? You guessed it—alert.

The problem? When Alertmanager is down, who sends the alert?

The Setup

groups:
  - name: meta-alerts
    rules:
      - alert: AlertmanagerDown
        expr: up{job="alertmanager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: 'Alertmanager is down'
          description: 'Good luck receiving this alert lol'

The Paradox

┌─────────────────────────────────────────────┐
│  Alertmanager is DOWN                       │
│                                             │
│  Status: Unable to send alert               │
│  Reason: Alertmanager is down               │
│  Irony level: Maximum                       │
└─────────────────────────────────────────────┘

Our Solution

We now have:

  • Primary Alertmanager
  • Secondary Alertmanager that monitors primary
  • Tertiary Alertmanager that monitors secondary
  • A cron job that emails us if all three are down
  • A Post-it note on the monitor that says “check alerts”

Current Alert Count

SeverityCountAction Taken
Critical3Acknowledged
Warning47Filtered to Slack
Info2,841Spiritual damage

The system is working as designed. We just designed it wrong.