Our Alerting System Alerts Us That The Alerting System Is Down

December 15, 2025

We have an alert for everything. CPU high? Alert. Memory high? Alert. Disk full? Alert. Alertmanager down? You guessed it—alert.

The problem? When Alertmanager is down, who sends the alert?

The Setup

groups:
  - name: meta-alerts
    rules:
      - alert: AlertmanagerDown
        expr: up{job="alertmanager"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: 'Alertmanager is down'
          description: 'Good luck receiving this alert lol'

The Paradox

┌─────────────────────────────────────────────┐
│  Alertmanager is DOWN                       │
│                                             │
│  Status: Unable to send alert               │
│  Reason: Alertmanager is down               │
│  Irony level: Maximum                       │
└─────────────────────────────────────────────┘

Our Solution

We now have:

Primary Alertmanager
Secondary Alertmanager that monitors primary
Tertiary Alertmanager that monitors secondary
A cron job that emails us if all three are down
A Post-it note on the monitor that says “check alerts”

Current Alert Count

Severity	Count	Action Taken
Critical	3	Acknowledged
Warning	47	Filtered to Slack
Info	2,841	Spiritual damage

The system is working as designed. We just designed it wrong.