A picture of me.

Tom Hodson

Maker, Baker Programmer Reformed Physicist RSE@ECMWF


Parsing is fun!

Usually when I want to parse something so that I can manipulate it in code, be it JSON, YAML, HTML, XML, whatever, there is a nice existing library to do that for me. The solution is a simple import json away. However if the language is a bit more niche, there maybe won’t be a good parser for it available or that parser might be missing features.

Recently I came across a tiny language at work that looks like this:

1[foo, bar, bazz
2    [more, names, of, things
3     [even, more]]]
4
5[another, one, [here, too]]

I won’t get into what this is but it was an interesting excuse to much about with writing a grammar for a parser, something I had never tried before. So I found a library, after a false start, I settled on pe. Don’t ask me what the gold standard in this space is, but I like pe.

To avoid getting too verbose, let’s just see some examples. Let’s start with an easy version of this problem: “[a, b, c]”.

 1import pe
 2
 3parser = pe.compile(
 4    r'''
 5    List    <- "[" String ("," Spacing String)* "]"
 6    String  <- ~[a-zA-Z]+
 7    Spacing <- [\t\n\f\r ]*
 8    ''',
 9)
10parser.match("[a, b, c]").groups()
11
12>>> ('a', 'b', 'c')

So what’s going on here? Many characters mean the same as they do in regular expressions, so “[a-zA-Z]+” is one or more upper or lowercase letters while “[\t\n\f\r ]” matches zero or more whitespace characters. The tilde “~” tells pe that we’re interested in keeping the string, while we don’t really care about the spacing characters. The pattern “String (“,” Spacing String)” seems to be the classic way to express a list like structure or arbitrary length.

Whitespace turns out to be annoying, “[ a, b, c]” does not parse with this, we’d have to change the grammar to something like this:

 1import pe
 2
 3parser = pe.compile(
 4    r'''
 5    List    <- "[" Spacing String (Comma String)* Spacing "]"
 6    Comma <- Spacing "," Spacing
 7    String  <- ~[a-zA-Z]+
 8    Spacing <- [\t\n\f\r ]*
 9    ''',
10)
11parser.match("[ a, b , c ]").groups()

NB: there is a branch of pe, which hopefully will be merged soon, that includes the ability to auto-ignore whitespace.

We can now allow nested lists by changing the grammar slightly, we also add a hint to pe for what kind of python object to make from each rule:

 1import pe
 2from pe.actions import Pack
 3
 4parser = pe.compile(
 5    r'''
 6    List    <- "[" Spacing Value (Comma Value)* Spacing "]"
 7    Value   <- List / String
 8    Comma   <- Spacing "," Spacing
 9    String  <- ~[a-zA-Z]+
10    Spacing <- [\t\n\f\r ]*
11    ''',
12    actions={
13        'List': Pack(list),
14    },
15)
16parser.match("[ a, b , c, [d, e, f]]").value()
17>>> ['a', 'b', 'c', ['d', 'e', 'f']]

I’ll wrap up here because this post already feels long but one thing I really like about pe is that you can easily push parts of what you’re parsing into named arguments to python functions, in the below I have set it up so that anytime a “Name” rule gets parsed, the parser will call Name(name = "foo", value = "bar") and this even works well with optional values too.

 1import pe
 2from pe.actions import Pack
 3from dataclasses import dataclass
 4
 5@dataclass
 6class N:
 7    name: str
 8    value: str | None = None
 9
10parser = pe.compile(
11    r'''
12    List    <- "[" Spacing Value (Comma Value)* Spacing "]"
13    Value   <- List / Name
14    Name    <- name:String Spacing ("=" Spacing value:String)?
15    Comma   <- Spacing "," Spacing
16    String  <- ~[a-zA-Z]+
17    Spacing <- [\t\n\f\r ]*
18    ''',
19    actions={
20        'List': Pack(list),
21        'Name': N,
22    },
23)
24parser.match("[ a=b, b=g, c, [d, e, f]]").value()
25>>>[N(name='a', value='b'),
26    N(name='b', value='g'),
27    N(name='c', value=None),
28    [N(name='d', value=None), N(name='e', value=None), N(name='f', value=None)]]

Comments