The service was throwing 500s and the logs said nothing useful, which is the first clue that the problem isn't in the code. The application was healthy. The thing underneath it was not.
df -h told the whole story in one line: /var at 100%. A debug logger someone had left at DEBUG was writing several gigabytes a day, logrotate wasn't configured for it, and the disk had been creeping towards full for a fortnight. Postgres couldn't write its WAL, so writes failed, so the app threw 500s it had no idea how to explain.
The fix was thirty seconds: truncate the runaway log, restart the service, breathe. The actual fix took longer: a logrotate rule, the log level dropped back to INFO, and a check on free space so this shouts at me before it becomes an outage rather than after.
*/5 * * * * df --output=pcent /var | tail -1 | tr -dc '0-9' | awk '$1>85{print "var "$1"%"}'
The lesson is the same one I keep relearning. When an app misbehaves for no reason it can articulate, look down a layer. The bug is rarely where the error is.