parsing a log format with nom, and learning to think in combinators

A screen of source code

I had a pile of log lines in a format nobody documented, produced by a daemon that has long since outlived the person who wrote it. The lines were almost regular but not quite, with optional fields and a timestamp that changed shape depending on the log level. My first instinct was a regex, and my first regex was correct for about ninety per cent of the lines, which is the worst possible result because it looks like it works.

So I reached for nom instead. If you've not used it, nom is a parser combinator library: you write small functions that each parse one little thing, and you glue them together into bigger parsers. The appeal is that each piece is testable on its own, and the failures tell you where in the input they gave up rather than just returning None and shrugging.

The mental shift took me an evening. With a regex you describe the whole shape at once and hope. With combinators you describe it in layers. Parse a timestamp. Parse a level. Parse the rest. Then a line is just those three, in sequence.

A close-up of code on screen

Here's roughly the shape of it, using nom 4's macros:

named!(level<&str, Level>,
    alt!(
        tag!("INFO")  => { |_| Level::Info }  |
        tag!("WARN")  => { |_| Level::Warn }  |
        tag!("ERROR") => { |_| Level::Error }
    )
);

named!(entry<&str, Entry>,
    do_parse!(
        ts:    timestamp >>
        space >>
        lvl:   level     >>
        space >>
        msg:   rest      >>
        (Entry { ts, level: lvl, message: msg.to_string() })
    )
);

The alt! tries each branch until one matches; do_parse! runs a sequence and binds the bits you care about. The optional fields, the ones that broke my regex, became an opt! wrapped around a sub-parser, and the line either had it or didn't, no special-casing.

The optional fields, the ones that had quietly broken my regex, were the real test. One of the log levels carried an extra request ID; the others didn't. With nom that became an opt! wrapped around a small sub-parser, and the result was an Option I could match on later. No lookahead gymnastics, no branch in the regex that subtly changed how a later group matched. The line either had the ID or it didn't, and the type told me which.

Two things sold me. First, when a line failed, nom told me it failed parsing the timestamp at byte such-and-such, which meant I found a third timestamp format I didn't know existed instead of silently mangling it. The regex would have matched the wrong field and handed me a plausible, wrong answer, which is the failure mode that costs you a day six weeks later. Second, the parser reads top to bottom like the grammar it represents. Six months from now I'll be able to read it, which is more than I can say for the regex I nearly committed.

It is more code than a regex, no question. But it's code I can test, code that fails loudly, and code I can extend without holding the entire pattern in my head at once. For a throwaway script I'd still reach for a regex. For anything that has to keep working, nom won me over.