The symptom was that the service was almost fine. Requests mostly worked. Logins sometimes didn't. Logs had gone strangely quiet. Nothing was throwing a clean error, which is always the worst kind of broken, because a stack trace at least tells you where to stand.
I went round the usual houses for ten minutes, app logs, database, the load balancer, before doing the thing I should have done first.
$ df -h
/dev/sda1 20G 20G 0 100% /var
There it is. /var was completely full. And a full /var does not fail loudly, it fails broadly. Logging silently dropped writes, which is why the logs had gone quiet. Session files couldn't be written, which is why logins sometimes failed. The package cache, the mail spool, half a dozen things that all assume they can write a few bytes, were quietly returning ENOSPC and the application was mostly shrugging it off and carrying on, badly.
The culprit was a debug log left on after a deployment a fortnight earlier, growing at a few hundred megabytes a day until it ate the partition. A truncate on the offending file bought immediate relief, then I turned the debug logging off and added a logrotate stanza so it could never do this again. A disk-usage alert at 85 percent went on afterwards, because the real failure here was that the first I heard of a full disk was a user telling me logins were flaky. The machine knew hours earlier. It just had no way to tell me.