A service died overnight and would not come back. The process started, ran for a second or two, and exited without anything useful in its logs, which was the first oddity: a clean crash usually leaves a complaint behind, and this one left nothing. Worse than nothing, actually. The logs simply stopped, as though the service had decided partway through writing a line that it could not be bothered to finish the sentence.
That truncation was the tell, though I did not read it correctly for a while. A log that stops mid-line is not usually a log being too quiet. It is a log that could not be written.
$ tail -f /var/log/app/service.log
2021-01-20T03:14:22Z INFO starting up
2021-01-20T03:14:22Z INFO loading config from /etc/app
2021-01-20T03:14:23Z INFO connecting to datab
It cut off mid-word. "datab". The service had tried to log "connecting to database", got three quarters of the way through, and the write had failed. So I did the thing I should have done first, before I read a single log line, and checked the disk.
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 31G 17G 64% /
Sixty-four percent. Plenty of room. Which is exactly the trap, and I fell in it for a good few minutes, because the root filesystem was fine and I let that reassure me. The mistake was checking /. The thing that was full was not /.
Per-mount, not per-disk
This box, like a lot of our older fleet, had /var on its own separate partition. That is generally a sensible thing to do: it stops runaway logs or a swollen package cache from filling the root filesystem and taking the whole machine down with it. The flip side is that "the disk has space" and "the partition this service writes to has space" are two completely different statements, and df -h / only answers the first one.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 31G 17G 64% /
/dev/sda2 10G 10G 20K 100% /var
There it is. /var at one hundred percent, twenty kilobytes free, which for practical purposes is zero. Every log write, every temp file, anything under /var was failing with ENOSPC, "no space left on device". The service could not write its log, could not write its pid file, could not write whatever scratch state it needed at startup, and gave up. The root filesystem having seventeen gigabytes spare was completely irrelevant, because nothing the service cared about lived there.
Where ten gigabytes went
Finding the culprit is the satisfying bit. Sort the directory tree by size and follow the fat branch down.
$ du -h -d1 /var | sort -rh | head
9.4G /var/log
9.1G /var/log/app
312M /var/cache
...
Almost all of it was one application's logs. A library had been bumped a fortnight earlier and, somewhere in that change, debug logging had been left switched on in production. It had been quietly writing gigabytes a day, and the log rotation that should have kept it in check was configured for the old log filename, so the rotator was diligently rotating a file that no longer received writes whilst the real, renamed log grew without limit. Two small misconfigurations, individually harmless, that together filled a ten gigabyte partition over a couple of weeks until the night it finally hit the ceiling and took the service with it.
There was a second casualty I only noticed afterwards. A small local database on the same box had quietly flipped itself read-only when /var filled, because it could not write its write-ahead log, which is exactly the correct and safe thing for it to do and also exactly the sort of thing that turns a single dead service into a confusing multi-service incident.
Fixing it, and not getting bitten again
The emergency fix was the obvious one: free space. I truncated the runaway log rather than deleting it, because deleting a file a process still holds open does not return the space until the process closes the handle, and a quick : > /var/log/app/service.log reclaims it immediately whilst leaving the file in place. The partition dropped to a few percent used, the database came back to read-write on its own, and the service started cleanly.
Then the real fixes. Debug logging went back off in production, where it should never have been. The rotation config was pointed at the actual log filename and given a sane size cap and retention, and I tested it with a forced rotation rather than trusting it, having just learned what trusting it costs. And, most importantly, we added disk-usage alerting per mount, not per host, with a warning at eighty percent and a page well before a partition can hit one hundred. The whole outage existed because a partition crept from ninety to one hundred percent over days with nobody watching that specific number.
The lesson is small and I keep relearning it in new outfits: "the disk has space" is not a thing you can know from one df. Separate partitions exist precisely so that one of them can fill without the others noticing, which is a feature right up until it is the reason you cannot tell that one of them has filled. Check the mount the service actually writes to. When a process dies quietly and its log stops mid-line, suspect the page before you suspect the code. The byte that would not write is usually louder than any error you will find in the logs, because it is the reason there is no error in the logs.
And keep df -h (no path, all mounts) and du -h -d1 somewhere in muscle memory, because between them they answer the only two questions that matter when a disk fills: which partition, and which directory. I have wasted more time than I care to admit reading application logs about a problem the application was in no position to describe, because the application could not write the very log line that would have explained it. The layer below the application knew the whole time. It always does.