Our Alerting System Alerts Us That The Alerting System Is Down
We have an alert for everything. CPU high? Alert. Memory high? Alert. Disk full? Alert. Alertmanager down? You guessed it—alert.
The problem? When Alertmanager is down, who sends the alert?
The Setup
groups:
- name: meta-alerts
rules:
- alert: AlertmanagerDown
expr: up{job="alertmanager"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: 'Alertmanager is down'
description: 'Good luck receiving this alert lol'
The Paradox
┌─────────────────────────────────────────────┐
│ Alertmanager is DOWN │
│ │
│ Status: Unable to send alert │
│ Reason: Alertmanager is down │
│ Irony level: Maximum │
└─────────────────────────────────────────────┘
Our Solution
We now have:
- Primary Alertmanager
- Secondary Alertmanager that monitors primary
- Tertiary Alertmanager that monitors secondary
- A cron job that emails us if all three are down
- A Post-it note on the monitor that says “check alerts”
Current Alert Count
| Severity | Count | Action Taken |
|---|---|---|
| Critical | 3 | Acknowledged |
| Warning | 47 | Filtered to Slack |
| Info | 2,841 | Spiritual damage |
The system is working as designed. We just designed it wrong.