parsing a config format with nom, and learning to think in combinators

A code editor showing Rust source

I had a small, irritating config format to parse. Not quite INI, not quite TOML, something a previous tool had grown organically and now I had to read it from Rust. The pragmatic answer would have been a pile of regular expressions and a prayer. Instead I reached for nom, partly because I needed a real parser and partly because I had been meaning to learn it properly for ages.

nom is a parser combinator library. Rather than write one big function that walks bytes and tracks state, you build small parsers and glue them together. A parser for a digit, combined into a parser for a number, combined into a parser for a key-value pair, combined into a parser for the whole file. Each piece is tiny and testable on its own, which is the bit that sold me.

The format had lines like this:

host = db01.internal
port = 5432
# comments start with hash
tags = primary, replica

Here is the shape of recognising a single key-value line. This is nom in its current macro-heavy style, which is what you get in the 4.x releases:

named!(key<&str, &str>,
    map_res!(
        take_while1!(|c: char| c.is_alphanumeric() || c == '_'),
        |s| std::str::from_utf8(s)
    )
);

named!(kv<&str, (&str, &str)>,
    do_parse!(
        k: key       >>
        ws           >>
        char!('=')   >>
        ws           >>
        v: not_line_ending >>
        (k, v.trim())
    )
);

The thing that took me a while to internalise: a nom parser does not return your value, it returns the value plus the unconsumed remainder of the input. The type is roughly IResult<Input, Output>, an Ok((remaining, parsed)) or an error. Once that clicked, the combinators stopped feeling like magic. do_parse! threads the remaining input through each step for you, which is why you can read it almost like a sequence of statements.

The macros are the part people either love or quietly resent. They give you very terse parsers, but the error messages when you get one wrong are genuinely awful. A misplaced >> and the compiler vomits half a page of macro expansion at you with no obvious cause. I lost a good half hour to a trailing comma before I learned to read past the noise. I gather the next major version is moving towards plain functions instead of macros, which I would welcome, because the macro errors are the single biggest barrier to picking this up.

What I ended up with was a parser of about sixty lines that handles comments, blank lines, the key-value pairs and the comma-separated list values, with a unit test per combinator. It is faster than the regex version would have been, but honestly the speed was never the point. The point is that each rule is small, named and verifiable, and when the format inevitably grows another wart I can add one combinator rather than untangling a thicket of patterns.

Would I use nom for a throwaway script? No. For anything I have to maintain that involves real structure, yes, and I am glad I finally sat down with it.