The pages came in looking unrelated. Postgres wouldn't accept connections, the mail relay was bouncing, and cron jobs were silently not running. Three different services, three different teams reaching for their own runbooks. The common thread was the host, and the host had a full /var.
That's the thing about a full /var: it doesn't fail loudly in one place. It fails quietly everywhere that needs to write a socket, a lock, a pid file, a spool entry, or a log line. Postgres couldn't create its stats temp file. The mailer couldn't queue. Cron couldn't write its own logs and gave up. Each service's error pointed inward at itself rather than at the shared disk underneath them all.
$ df -h /var
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 20G 20G 20K 100% /var
The culprit was, predictably, logs. A misconfigured application had been writing a stack trace per request at several thousand requests a minute, and nobody had logrotate pointed at its directory. du -xhd1 /var | sort -h found it in seconds once we were looking at the right thing.
The fix was trivial: truncate the offending log, get logrotate watching it, and add a disk-space alert that fires at 85 percent rather than waiting for the services to fall over at 100. The lesson is the diagnostic order. When several unrelated services on one box go strange at once, don't debug the services. Check the boring shared resources first: disk, inodes, memory, file descriptors. df, df -i, free. It's almost never a coincidence, and it's usually /var.