Ramblings of an aging IT geek
← Ramblings of an aging IT geek
rust

parsing a log line with nom, and why i stopped reaching for regex

A short note on using nom to parse a structured log line in Rust, and where combinators beat a regex once the format has nested fields.

Source code on a screen

I had a regex for parsing a log line and it had grown to the length of a sentence you'd need read aloud twice. Timestamp, level, a bracketed component, then a free-text message that might itself contain brackets. The regex worked until the message contained a ] and then it didn't, and the fix made it longer and more fragile. Classic.

So I rewrote it with nom. The point isn't that combinators are shorter, because here they aren't. It's that they're readable a year later, and they fail in a place you can name.

named!(level, alt!(tag!("INFO") | tag!("WARN") | tag!("ERROR")));

named!(component, delimited!(char!('['), is_not!("]"), char!(']')));

named!(pub line<(&[u8], &[u8], &[u8])>,
    do_parse!(
        l: ws!(level)     >>
        c: ws!(component) >>
        msg: rest         >>
        (l, c, msg)
    )
);

rest grabs everything left over as the message, so a bracket in the free text is no longer my problem. Each piece is named, each piece is a function, and when a new format quirk turns up I add a combinator rather than performing surgery on a regex with a scalpel and a prayer.

For a one-off, grep and a regex still win on effort. But the moment the format has nesting, or you'll be staring at the parser again in six months, nom earns its keep. The code tells you what it expects, which the regex never quite did.