Parsing Without the Regret, Using nom

A screen full of code

I had a config format to parse. Not a standard one, a slightly odd in-house thing with key-value lines, nested blocks, and comments. My first instinct was a pile of regular expressions, and my second, wiser instinct was the memory of every other time I have done that and regretted it within a month.

So I reached for nom, the parser combinator library, and built it properly. The thing that sells nom is that small parsers compose into bigger ones, and the code ends up reading like the grammar it implements rather than a heap of string slicing.

Combinators, from the bottom up

You start with tiny parsers that each recognise one thing, then glue them together. A parser for an identifier, a parser for whitespace, a parser for a value. Each is a function that takes input and returns the rest of the input plus what it matched.

use nom::{
    bytes::complete::{tag, take_while1},
    character::complete::{char, space0},
    sequence::separated_pair,
    IResult,
};

fn ident(input: &str) -> IResult<&str, &str> {
    take_while1(|c: char| c.is_alphanumeric() || c == '_')(input)
}

fn key_value(input: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(ident, (space0, char('='), space0), ident)(input)
}

key_value is built entirely from smaller parsers. separated_pair runs an identifier, then the = with optional spaces around it, then another identifier, and hands back the pair. No backtracking I had to reason about, no clever regex group I would fail to understand next year.

A close-up of code on a screen

Where it pays off

The win is not raw speed, though nom is fast because it works on slices and does not allocate unless you ask. The win is that the structure of the parser matches the structure of the language. When I needed to add nested blocks, I wrote a block parser that recursively used the line parser, and it slotted in without disturbing anything else. Try doing that to a 200-character regex.

The error handling is the other quiet benefit. When a parse fails, nom tells you where in the input it gave up and what it expected, which beats a regex returning a flat "no match" for the entire file. With a little effort you get errors precise enough to point a human at the offending line.

The honest downsides

It is not free. The type signatures get baroque, and the first time you hit a lifetime error inside a combinator chain you will question your choices. The complete versus streaming distinction trips everyone up once, where one assumes it has all the input and the other expects more to arrive. Read the docs on that before, not after.

But for anything beyond a trivial split-on-comma, I will take a nom parser over a regex every time. The regex is faster to write and slower to live with. The parser is the reverse, and I am the one who has to read it in a year.