In 2007 I wrote a regular expression engine that I called wregex. I've now cleaned up the code and placed it on GitHub.
This essay is a description of how it is implemented. I think that if you understand the concepts you will never be puzzled by a pattern again.
Regular expression engines are implemented as finite state machines (FSM). The pattern you supply is compiled into a data structure that represents this state machine.
When you match a string against this pattern, the regex engine takes each character and decides the state transition within the FSM. If there are no valid state transitions for an input character the match fails.
One of the states in the FSM is a terminating/end state. If the regex engine gets there it reports success.
There are two ways[1] in which these state machines can be arranged, namely Deterministic Finite Automata (DFA) and Non-Deterministic Finite Automata (NFA).
These names refer to the way in which the internal states of the regex engine changes on each input character.
A DFA is called deterministic because it can always choose a next state for a given input character; If it cannot go to a valid state it means that the input string does not match the regex pattern.
The following graphic shows an example DFA for the regular expression "ab*c":
Consider an example input string "abbc":
If we were to read, say, a 'd' while we were in state 2, we know that the input doesn't match the regex.
A NFA is called non-deterministic because there are cases where the regex engine has to guess which state to go to next. If it guesses wrong it has to go back to a previous state and try a different transition.
The following graphic shows an example NFA for the regular expression "ab*c" of before. I've drawn it a bit differently, but I'll explain:
Here's how the NFA would process the input string "abbc":
Note that I've drawn the arrow from X to B thick and the arrow from X to C thin. The regex compiler decided that the engine should go to B first, but it is a detail I'll explain later.
This process of going back to a previous state and trying another state transition in a NFA (as in steps 8 and 9) is called backtracking.
The NFA will have to try all possible routes through the state machine until it finds the terminating state, the possible routes are exhausted, or there are no more input characters.
DFAs have some big advantages over NFAs: Because DFA regex engines don't need
to backtrack they are in general much faster than NFAs. Also, because NFAs need
to backtrack, it is possible to structure your pattern in such a way that the
backtracking will cause nearly infinite loops on certain input sequences. DFAs
also don't need the non-greedy operators *?
and +?
NFAs have got its own advantages over DFAs. The most important of these are that NFAs allow for capture groups. The other advantage is that NFA regex engines are easier to implement (in my experience at least).
For these reasons, what you see in practice is that most modern languages implement their regex engines as NFAs and most regex tutorials on the internet explain how use the non-greedy operators to avoid the problems with backtracking.
wregex is implemented as a NFA regex engine.
The code is available on GitHub if you'd like to follow along.
The API has 2 broad phases[2]:
wregex_t
, and does the compilation in the function wrx_comp()
defined in
the file wrx_comp.c
.wrx_exec()
defined in the file wrx_exec.c
.
The remainder of the API are functions for error handling and debugging.
In the description below, I will need to describe the execution phase first. It is easier to understand how the parser generates the NFA structure if you know what we want to achieve in the execution phase.
I will also split my description of the compilation phase into two stages: parsing and code generation. Parsing is responsible for reading the regex pattern into memory, while code generation is the process of outputing the NFA data structure. In practice wregex generates the NFA while its parser reads the pattern, but it will be clearer if I describe the two stages separately.
The core of wregex is built around the wregex_t
structure defined in
wregex.h
. Here are the most important parts
typedef struct _wregex_t
{
wrx_state *states; /* The states themselves */
short n_states; /* The number of states */
short start, /* The start state */
stop; /* The stop state */
// ...
} wregex_t;
The states
member is the most important part, because it contains the actual
state machine. Each state is a wrx_state
object, which look like the
following:
typedef struct _wrx_state
{
char op; /* opcode */
short s[2]; /* State transitions */
union {
char c; /* Actual character */
char *bv; /* Bit vector for storing ranges of characters */
short idx; /* Index if this is a submatch/backreference state (REC, STP, BRF) */
} data;
} wrx_state;
op
is code for an operation (opcode) that should be performed
when the NFA is in that state. Example opcodes are MTC which tells the NFA to
match a single character, CHC which tells the NFA to try one state
transition, and if it fails try a different state transition. The opcodes are
defined at the top of wrxcfg.h
.s[2]
contains state transitions.data
contains the operand of the state. It is a union, and the
particular sub-member used depends on the specific opcode. For example if the
opcode is MTC then the NFA will compare the next character in the input
string to the c
member of data
.
If you are familiar with compilers, interpreters or assembly language you
would've realised by now that the states in the wregex_t
are actually the
instructions for a very specific interpreter executed by the wrx_exec()
function.
The start
and stop
members of the wregex_t
are the indices of the first
and last states of the NFA within the array (the first state is not necessarily
at states[0]
as will become apparent shortly).
I'm going to skip forward for a moment and show you at what a compiled
wregex_t
NFA looks like. The following is the wregex_t
's states
array for
the regular expression "ab"
:
start: 4; stop: 6
0 MTC 'a' 2
2 MTC 'b' 5
4 REC <0> 0
5 STP <0> 6
6 EOM
The output above can be obtained if you compile wregex as in debug mode (use
make debug
in your shell) and then run the test program with "ab" as
parameter: ./test ab
.
states
array[3].data
member).s[0]
member).CHC
instruction will have a fifth column for the second state transition
(the s[1]
member).
This is what the NFA would look like graphically:
The wrx_exec()
function goes through these steps to match a string against
this NFA:
REC
instruction represented by the triangle in
the graphic.REC
instruction, and goes to state 0 (Ignore the specifics
of the REC
and STP
instructions for now. I'll discuss them when I get to
submatches below).MTC
instruction in state 0 matches the first character of the input
string to the literal 'a' and then goes to state 2.MTC
instruction in state 2 matches the next character of the input
string to 'b' and goes to the STP instruction in state 5 (represented by the
inverted triangle).STP
instruction is executed and it proceeds to state 6.EOM
instruction which means End of Match: The input string matched
the regular expression.
If you concluded that backtracking would somehow involve a stack from my description of the NFA above, then you would be correct.
We introduce the CHC
("choice") instruction to implement backtracking. It
has two arguments corresponding to the two possible next states we should try.
These are stored in the s[0]
and s[1]
members of the wrx_state
structure.
When executing this instruction the engine pushes the second state onto a stack and tries the first state. If it cannot match a state, it pops a new state from the stack and tries again. If the stack is empty when we try to pop it, we know that the input string does not match the regex.
Lets look at the states for the expression "a|b":
start: 6; stop: 8
0 MTC 'a' 7
2 MTC 'b' 7
4 CHC --- 0 2
6 REC <0> 4
7 STP <0> 8
8 EOM
Consider what wrx_exec()
should do when it encounters the input string "b"
REC
instruction in state 6, which goes to state 4.CHC
instruction pushes the current position in the input string and
state 2 (the value of wrx_state.s[1]
) onto a stack and goes to state 0 (the
value of wrx_state.s[0]
). The arrow from state 4 to state 0 is drawn thick
and the one from state 4 to state 2 thin to show that state 0 will be tried
first. Which one to try first is an important detail for NFA-based regex
engines.MTC
instruction matches the 'a'
to the "b"
in the input string. It
fails so we go to the stack. It is not empty so we pop the previous position
in the input string along with the other state transition, 2. So, we now go
to state 2.'b'
to "b"
. It is successful, so we go to state 7.STP
instruction and go to state 8.EOM
so the match was successful.
If the input string was "d" instead we would've been unsuccessful in step 4. In this case the the stack would now be empty, so we would know that the input string does not match the regex.
wregex implements its parser as a recursive descent parser. Here is the complete syntax of wregex as a BNF-like grammar:
pattern ::= ['^'] [list] ['$']
list ::= element ["|" list]
element ::= ("(" [":"] list ")" | value)
[(("*"|"+"|"?")["?"])|("{" [digit+] ["," [digit+]] "}" ["?"])]
[element]
value ::= (A-Za-z0-9!"#%&',-/:;=@\\_`~\r\t\n) | '<' | '>' |
"[" ["^"] ranges "]" | "." | 'escape sequence'
ranges ::= (c ["-" c])+
where c is any printable ASCII character (>= 0x20)
If you are not familiar with BNF, the above states that:
^
and optionally followed by a dollar $
sign.|
and
another list (hence the recursive in recursive descent parser).*
, +
or '?'
(like 'a*'
), which in turn may be non-greedy (like a*?
) or a
number of repetitions in curly braces (like a{4}
or a{5,10}
) which in
turn may be non-greedy (like a{4,5}?
), which may all be followed by another
element.[a-z]
), the '.'
operator or an escape sequence.What I like about recursive descent parsers is that it is very easy to
implement a parser by hand from a grammar like the above and follow the code of
the parser later. Also each of the rules in the grammar has a corresponding
function in wrx_comp.c
, which nicely organises the code for maintainability.
Recursive descent parsers have drawbacks, but I'll leave that as an exercise for the reader.
The nice thing about the recursive descent parser is that it veery neatly splits the regex pattern we're compiling into sub-patterns. Every level down the parser creates a segment of the NFA and returns it to its caller.
All these instructions are stored in a dynamically growing array of the instructions
that described in the section on the Interpreter basics. The function next_state()
returns the next available index in this array, and resizes it if it becomes full
(you can think of this index as an instruction address for our bytecode interpreter).
The so the caller needs some mechanism to keep track the start and end indices of the
segments generated by each of the functions it calls. It does this through the
push_seg()
and pop_seg()
functions, which pushes and pops a nfa_segment
sturcture
onto a stack. The nfa_segment
simply contains a beginning and end index for a
particular segment.
As an example, consider compiiling the pattern ab|c
.
The parser goes down through the functions, calling the pattern()
, list()
and the
element()
functions. element()
doesn't find any of the operators so it calls value()
right at its end.
The first pattern is the character 'a'
, for which a MTC
instruction followed by a
MOV
instruction is generated:
0 MTC 'a' 1
1 MOV 0
The MOV
(move) instruction is just a dummy instruction that causes the interpreter
to move to the state specified by its operand (it didn't occur to me at the time that
JMP
might have been a more appropriate name). At this point the operand is not set, so
it is 0.
The beginning and end addresses of this segement (0 and 1, respectively) is pushed onto
the stack through push_seg()
and value()
returns.
We're now back in the element()
function, which calls itself rcurisvely and ends up calling
the value()
function again. The parser now encounters the 'b'
character, so it generates
the same 2 instructions again after the first:
0 MTC 'a' 1
1 MOV 0
2 MTC 'b' 3
3 MOV 0
This new segment's addresseses (2 and 3) are also pushed onto the stack, and value()
returns,
and then the recursive call to element()
returns.
We're now back in the initial call to element()
function. element()
knows it has to
concatenate two segements, so it pops two segments off the stack through pop_seg()
.
Now it uses the transition()
function to tie the end of the first segment to the second
segment by setting the operand of the first MOV
instruction to 2
0 MTC 'a' 1
1 MOV 2
2 MTC 'b' 3
3 MOV 0
The element()
function now returns up one level to the list()
function that encounters the
'|'
symbol and calls element()
again, which encounters the 'c'
and adds a new segment:
0 MTC 'a' 1
1 MOV 2
2 MTC 'b' 3
3 MOV 0
4 MTC 'c' 5
5 MOV 0
So back in the list()
function, there are now two segements on the stack: 0-3 and 4-5. They
get popped off. list()
then generates a CHC
and a MOV
instruction
The CHC
has the start of the two segments, 0 and 4 as operands, to control the branching.
It also connects the two segment's end points to the generated MOV
instruction.
0 MTC 'a' 1
1 MOV 2
2 MTC 'b' 3
3 MOV 7
4 MTC 'c' 5
5 MOV 7
6 CHC --- 0 4
7 MOV 0
The addresses of the CHC
and MOV
instructions are now pushed onto the stack as the
resulting segment from
This demonstrates whey the MOV
instructions are necessary: They allow each function in
the parser to have a single entry and exit point in the generated NFA.
list()
now returns to pattern()
which wraps the segment it pops from the stack in
between REC
and STP
instructions (used for capturing groups, as described
below), and pushes the new segment back onto the stack.
We're now at the top of the parser in wrx_comp()
. It pops the completed
NFA (which is now one big segment) off the stack and appends the EOM
instruction that
shows a successful match.
The completed NFA looks like this:
0 MTC 'a' 1
1 MOV 2
2 MTC 'b' 3
3 MOV 7
4 MTC 'c' 5
5 MOV 7
6 CHC --- 0 4
7 MOV 9
8 REC <0> 6
9 STP <0>10
10 EOM
Using these principles, we can compile the *
, +
and ?
operators. In the
examples I'm about to give P
, Q
, R
and S
are addresses of instructions,
and A
is an arbitrary pattern that is compiled to a segement through recursive
calls.
To compile the *
operator in pattern like A*
, we generate a CHC
instruction
that will first try to match the pattern A . It looks like this:
start at R
P <segment for matching A>
...
Q MOV R
R CHC --- P S
S ...
The A+
pattern is then just a simple modification: We start at P
rather than R
to ensure that the input string matches A
at least once.
start at P
P <segment for matching A>
...
Q MOV R
R CHC --- P S
S ...
The ?
operator works a lot like the *
, but the MOV
instruction at Q
jumps directly to S
instead of going through the CHC
at R
start at R
P <segment for matching A>
...
Q MOV S
R CHC --- P S
S ...
The last step of the compiler is to call the optimize()
function, which removes all the
MOV
instructions, which have now become redundant: If an instruction's destination is a MOV
,
we just change the instruction's destination to the MOV
's destination.
start: 8; stop: 10
0 MTC 'a' 2
2 MTC 'b' 9
4 MTC 'c' 9
6 CHC --- 0 4
8 REC <0> 6
9 STP <0>10
10 EOM
If you followed the above, then the extra features are quite straight forward to implement.
Implementing the non-greedy *?
and +?
operators is laughably simple: You
simply swap the order in which you try the next states in the CHC
states.
Remember how I showed that we try state B first in step 3 of my description of
NFAs above? With that setup, the regex engine will always try to consume
another 'b'
character from state X before it tries a 'c'
, making it
greedy for more 'b'
s.
In order to make the '*'
operator non-greedy, the regex engine simply has to
try state C first. The following graphic the NFA for the regular expression
"ab*?c"
:
I've drawn the arrow from X to C thick and the arrow from X to B thin this time to show the difference.
In this case, once you're in state X after you've read an 'a'
, the regex will
always try to match a 'c'
first and go to the "stop" state before it tries to
match any 'b'
characters.
The function weaken()
in wrx_comp.c
swaps the state transitions of an NFA
if a ?
is encountered after a *
or a +
.
I cheated a bit with expressions containing curly braces, like A{m,n}
: They
are implemented by repeating the preceding expression A n times. For example,
A{1,4}
is handled internally as 'AA?A?A?'
.
The function duplicate()
in wrx_comp.c
takes care of the details.
Character sets are implemented as 256-bit bit vectors. Each bit in the bit vector corresponds to an ASCII character. The SET opcode compares the current input character against the bit vector.
This is the reason why wregex cannot support unicode in its implementation.
A set like [a]
will have the bits associated with both 'a'
and 'A'
set in
case-insensitive mode.
The '.'
operator uses a bit vector with all bits set except for the bits
associated with the ASCII control characters.
wregex uses the '\i'
and '\I'
control sequences to enable and disable
case-insensitivity, rather than depending on flags passed to wrx_comp()
. I
wanted wregex to be usable as the regex engine of a scripting language, so
doing it this way gave control to the person writing the scripts rather than
the person implementing the scripting language.
These control sequences toggle a flag in the compiler which controls whether MTC or MCI opcodes are generated and how the bit vectors of the character sets are generated.
Special operators like '^'
, '$'
, '<'
and '>'
all have special opcodes that
causes wrx_exec()
to check for the beginning and end of lines and words,
respectively.
The REC ("record") and STP ("stop recording") instructions are used to record submatches.
Submatches are stored in an array of wregmatch_t
structures while the engine
is executing. Each wregmatch_t
structure contains a pointer to the beginning
and the end (called beg
and end
respectively) of the current submatch.
When the REC instruction is encountered the engine sets the beg
pointer to
the current position in the input string, and when the STP instruction it sets
the end
pointer. When wrx_exec()
is done, the wregmatch_t
array contains
the correct values.
The wregex_t
hs a member n_subm
that counts the opening brackets to ensure
that the correct wregmatch_t
structure's value is set.
All NFAs are surrounded by a "REC <0>
- STP <0>
" instruction
pair to capture the entire matching string into submatch 0.
For an example, consider how the pattern a(b*)c
is represented in an NFA:
start: 10; stop: 12
0 MTC 'a' 6
2 MTC 'b' 4
4 CHC --- 2 7
6 REC <1> 4
7 STP <1> 8
8 MTC 'c'11
10 REC <0> 0
11 STP <0>12
12 EOM
At state 6 the REC
instruction records in the current position in the input
string as the start of submatch 1. It then proceeds to state 4, the CHC
instruction, that uses the backtracking mechanism with state 2 to read a
sequence of 'b' symbols.
Once it cannot read any more 'b'
s it jumps to state 7 that records the current
input position as the end of submatch 1 an proceeds to state 8 to read the 'c'
symbols at the end of the pattern.
Backreferences are then simply implemented by matching the input string at the current position against the corresponding submatch.
I hope this article gave you some insight into how regex engines work.
I remember that it seemed so complex when I implemented wregex originally way back, but the code seems so simple looking at it now.
Below are some of the references I used when I implemented wregex
I wrote the original implementation in 2007, so some of these links may have gone stale. Apologies in advance for that.
While writing this entry, I also noticed that Russ Cox wrote a new article in the intervening years titled Regular Expression Matching: the Virtual Machine Approach, which covers a lot of the topics I've discussed here and discusses several alternative approaches, optimizations and practical considerations.