parsing a config format with nom, and liking it

Code on a screen, a parser in progress

I needed to parse a small config format this week, the sort of thing you'd normally reach for a regex or a hand-rolled line splitter to handle. I'd been meaning to learn nom properly for ages, so I used it as the excuse. The verdict: once it clicks, it's genuinely pleasant, and the thing it teaches you about composing parsers is worth more than this one job.

nom is a parser combinator library. You don't write a big stateful parser; you write tiny parsers that each consume a bit of input, and then you combine them. A parser in nom is a function that takes some input and returns the remaining input plus whatever it parsed, or an error. The type is IResult<&str, Output>, and almost everything is built on that one shape.

A diagram of parser combinators feeding into each other

the format

Nothing exotic. Lines of key = value, comments starting with #, blank lines ignored. The kind of thing where a regex works until the value contains an = and then it doesn't.

building it up

Start with the smallest piece. A key is one or more alphanumeric-or-underscore characters:

use nom::{
    bytes::complete::{tag, take_while1},
    character::complete::{space0, not_line_ending},
    sequence::separated_pair,
    IResult,
};

fn key(input: &str) -> IResult<&str, &str> {
    take_while1(|c: char| c.is_alphanumeric() || c == '_')(input)
}

The value is everything up to the end of the line, which not_line_ending gives you for free. Then the whole key = value line is just two parsers joined by a separator, and nom has separated_pair for exactly that:

fn entry(input: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(
        key,
        // the separator: optional space, an equals, optional space
        nom::sequence::tuple((space0, tag("="), space0)),
        not_line_ending,
    )(input)
}

And that's the bit that made me grin. The =-in-the-value problem that breaks the regex approach simply doesn't exist here, because key stops at the first non-key character and not_line_ending swallows the rest verbatim. The parser is structurally incapable of getting confused by it.

what took me a while

Two things tripped me up, both worth flagging if you're starting out.

The first is complete versus streaming. nom has two versions of most combinators. The streaming ones return an "incomplete" error if they hit the end of input, because they assume more might be coming over a socket. For parsing a string you already hold in memory, you want the complete variants, every time. I lost half an hour to a baffling Incomplete error before that penny dropped.

The second is that the error type is noisy. nom's default IResult error carries a lot of machinery, and the ergonomics of turning it into something a human wants to read are a bit of a faff. For this job I didn't care, but for anything user-facing I'd look at nom_locate or a custom error type before going much further.

was it worth it over a hand-rolled split

For this format, on its own, honestly no. A split('=') with a bit of care would have done. But that's not really the point. The format will grow, it always does, and a combinator parser grows by adding small functions and gluing them on, whereas the hand-rolled version grows into a thicket of indices and edge cases. I came out the other side able to read nom code without flinching, which was the actual goal. Next time I won't need an excuse.