parsing a config format with nom, one combinator at a time

A close-up of code on a dark editor

I had a small, dull problem: a homegrown config format, somewhere between INI and a shrug, that a tool of mine had grown organically over a couple of years. It started as "split on equals" and had since accreted comments, quoted values, sections, and the occasional line continuation. The parsing was a tangle of split, trim, and a regex I no longer understood. Every new edge case made it worse. I decided to do it properly and reach for nom, the parser combinator library, partly to solve the problem and partly because I'd been meaning to actually learn nom rather than admire it from a distance.

The headline, before I bury it: this went far better than the regex ever did, and I'd reach for nom again without hesitating.

what a combinator actually is

The idea behind nom, and parser combinators generally, is that a parser is just a function. It takes some input and either returns the bit it consumed plus the rest, or an error. Crucially, parsers compose. You write tiny parsers that recognise tiny things, then glue them together with combinators into bigger parsers, and the gluing is itself just more functions. There's no separate grammar file, no code generation step, no .y file staring at you. It's all ordinary Rust, which means the compiler and the borrow checker are in the loop the whole way.

In modern nom the convention is a parser that takes &str and returns IResult<&str, Output>. The IResult is the part that frightened me off for ages, because the type signatures look like someone sat on the keyboard. Here's the smallest real one I wrote, a parser for a comment line:

use nom::{
    bytes::complete::{tag, take_till},
    character::complete::line_ending,
    sequence::{delimited, preceded},
    IResult,
};

fn comment(input: &str) -> IResult<&str, &str> {
    preceded(tag("#"), take_till(|c| c == '\n'))(input)
}

That reads, once you've stared at it for a minute, exactly like the English: a comment is a # followed by everything up to the newline, and we throw away the # and keep the rest. preceded is the combinator that runs two parsers and discards the output of the first. tag matches a literal. take_till consumes characters until a predicate is true. None of these is clever on its own, and that's the appeal.

A diagram-like screenshot of nested function calls

the bit that clicked

For a while I was fighting it, because I kept trying to write one big parser that did everything, and the types got away from me. The thing that made it click was accepting that I should write the smallest possible parser for each concept and name it, then build up. A key. A value. A quoted value. A key-value pair. A section header. Once each of those was its own little named function that I could unit test in isolation, assembling them was almost mechanical.

A key-value pair, for instance, is a key, then optional whitespace, then =, then optional whitespace, then a value:

use nom::character::complete::{alphanumeric1, space0};

fn key_value(input: &str) -> IResult<&str, (&str, &str)> {
    let (input, key) = alphanumeric1(input)?;
    let (input, _) = delimited(space0, tag("="), space0)(input)?;
    let (input, value) = take_till(|c| c == '\n')(input)?;
    Ok((input, (key.trim(), value.trim())))
}

The ? operator does exactly what you want here, threading the "remaining input" through and bailing out cleanly on the first parser that doesn't match. That's the part the regex could never give me: when something failed, I'd get a position and a reason, not a sullen None and a fifteen-minute debugging session squinting at backslashes.

handling the messy reality

Real config files have alternatives. A line might be a comment, or a section header, or a key-value pair, or blank. That's what alt is for: it tries each parser in turn and returns the first that succeeds.

use nom::branch::alt;

enum Line<'a> {
    Comment,
    Section(&'a str),
    Pair(&'a str, &'a str),
    Blank,
}

fn line(input: &str) -> IResult<&str, Line> {
    alt((
        map(comment, |_| Line::Comment),
        map(section, Line::Section),
        map(key_value, |(k, v)| Line::Pair(k, v)),
        map(space0, |_| Line::Blank),
    ))(input)
}

There's an ordering subtlety lurking in alt that bit me: the parsers are tried in order, and the first to match wins, so you have to put the more specific ones first. I had blank near the top at one point and it cheerfully matched the leading whitespace of every line and declared victory. Move it to the bottom, problem gone. This is the kind of bug that's obvious in hindsight and invisible while you're writing it, and it's the one genuine footgun I hit.

was it worth it

The final parser is about ninety lines, every one of which I can explain. It has unit tests at the level of individual combinators, so when the format inevitably grows another wart I'll add a small parser and a small test and slot it in, rather than extending an incomprehensible regex by trial and error. Error messages now point at the line and the reason. And it's fast, though for a config file that was never the concern.

What surprised me is how much the experience resembled writing ordinary code rather than learning a tool. There's no grammar DSL to keep in your head, no separate mental model. A parser is a function, functions compose, and the compiler checks the joins. The IResult signatures stop looking like noise within an afternoon. If you've been avoiding nom because the types look hostile, I'd gently suggest the types are the whole point: they're what stops you assembling something that can't possibly work. I came for a config parser and left with a tool I'll actually reach for again, which is more than I can say for most afternoons spent learning a library.