wregex - How Regular Expression Engines Work

2017-06-01 by Werner

Home
Tweet

In 2007 I wrote a regular expression engine that I called wregex. I've now cleaned up the code and placed it on GitHub.

This essay is a description of how it is implemented. I think that if you understand the concepts you will never be puzzled by a pattern again.

State Machines

Regular expression engines are implemented as finite state machines (FSM). The pattern you supply is compiled into a data structure that represents this state machine.

When you match a string against this pattern, the regex engine takes each character and decides the state transition within the FSM. If there are no valid state transitions for an input character the match fails.

One of the states in the FSM is a terminating/end state. If the regex engine gets there it reports success.

There are two ways^[1] in which these state machines can be arranged, namely Deterministic Finite Automata (DFA) and Non-Deterministic Finite Automata (NFA).

These names refer to the way in which the internal states of the regex engine changes on each input character.

DFAs

A DFA is called deterministic because it can always choose a next state for a given input character; If it cannot go to a valid state it means that the input string does not match the regex pattern.

The following graphic shows an example DFA for the regular expression "ab*c":

DFA example for the regular expression

Consider an example input string "abbc":

We start out in the "start" state (surprise!).
We read the 'a' character and go to state 1.
We read a 'b' and go to state 2.
We read another 'b' and remain in state 2.
We read a 'c' and go to state 3.
The stop state is now reachable, so we're done: The input matches the regular expression.

If we were to read, say, a 'd' while we were in state 2, we know that the input doesn't match the regex.

NFAs

A NFA is called non-deterministic because there are cases where the regex engine has to guess which state to go to next. If it guesses wrong it has to go back to a previous state and try a different transition.

The following graphic shows an example NFA for the regular expression "ab*c" of before. I've drawn it a bit differently, but I'll explain:

NFA example for the regular expression

Here's how the NFA would process the input string "abbc":

From the "start" state we proceed immediately to state A.
We read the 'a' character and go to state X.
State X is special - in this state the next character can be either a 'b' or a 'c'. We guess that it will be a 'b' (accepting that we may be wrong), so we go to state B.
Note that I've drawn the arrow from X to B thick and the arrow from X to C thin. The regex compiler decided that the engine should go to B first, but it is a detail I'll explain later.
We read a 'b' and go back to state X.
We repeat step 3 and go back to state B.
We read another 'b' and go back to state X.
We repeat step 3 again and go to state B.
We read a 'c', but we since we're in state B we can't go forward. So we go back where we came from: back to state X (step 6).
Because we already tried to go to state B from state X (the thick arrow) and it didn't work out, we now go to state C.
The stop state is now reachable, so we're done: The input matches the regular expression.

This process of going back to a previous state and trying another state transition in a NFA (as in steps 8 and 9) is called backtracking.

The NFA will have to try all possible routes through the state machine until it finds the terminating state, the possible routes are exhausted, or there are no more input characters.

So, which type do we prefer?

DFAs have some big advantages over NFAs: Because DFA regex engines don't need to backtrack they are in general much faster than NFAs. Also, because NFAs need to backtrack, it is possible to structure your pattern in such a way that the backtracking will cause nearly infinite loops on certain input sequences. DFAs also don't need the non-greedy operators *? and +?

NFAs have got its own advantages over DFAs. The most important of these are that NFAs allow for capture groups. The other advantage is that NFA regex engines are easier to implement (in my experience at least).

For these reasons, what you see in practice is that most modern languages implement their regex engines as NFAs and most regex tutorials on the internet explain how use the non-greedy operators to avoid the problems with backtracking.

wregex is implemented as a NFA regex engine.

Implementation

The code is available on GitHub if you'd like to follow along.

The API has 2 broad phases^[2]:

Compilation - This is the step that builds the NFA. It is the part that takes an input pattern like "ab*c" and builds a data structure that represents the state machine of Figure 2. wregex calls this structure a wregex_t, and does the compilation in the function wrx_comp() defined in the file wrx_comp.c.
Execution - This is the part that matches a string against the compiled NFA, determines whether it is a match and extracts submatches. wregex does this in the function wrx_exec() defined in the file wrx_exec.c.

The remainder of the API are functions for error handling and debugging.

In the description below, I will need to describe the execution phase first. It is easier to understand how the parser generates the NFA structure if you know what we want to achieve in the execution phase.

I will also split my description of the compilation phase into two stages: parsing and code generation. Parsing is responsible for reading the regex pattern into memory, while code generation is the process of outputing the NFA data structure. In practice wregex generates the NFA while its parser reads the pattern, but it will be clearer if I describe the two stages separately.

Execution

Structures

The core of wregex is built around the wregex_t structure defined in wregex.h. Here are the most important parts

typedef struct _wregex_t
{
    wrx_state *states;  /* The states themselves */
    short n_states;    /* The number of states */
    
    short   start,    /* The start state */
            stop;    /* The stop state */
    
    // ...
    
} wregex_t;

The states member is the most important part, because it contains the actual state machine. Each state is a wrx_state object, which look like the following:

typedef struct _wrx_state
{
    char op;    /* opcode */
    short s[2]; /* State transitions */
    union {
        char c;        /* Actual character */
        char *bv;    /* Bit vector for storing ranges of characters */
        short idx;    /* Index if this is a submatch/backreference state (REC, STP, BRF) */
    } data;
} wrx_state;

The member op is code for an operation (opcode) that should be performed when the NFA is in that state. Example opcodes are MTC which tells the NFA to match a single character, CHC which tells the NFA to try one state transition, and if it fails try a different state transition. The opcodes are defined at the top of wrxcfg.h.
The member s[2] contains state transitions.
The member data contains the operand of the state. It is a union, and the particular sub-member used depends on the specific opcode. For example if the opcode is MTC then the NFA will compare the next character in the input string to the c member of data.

If you are familiar with compilers, interpreters or assembly language you would've realised by now that the states in the wregex_t are actually the instructions for a very specific interpreter executed by the wrx_exec() function.

The start and stop members of the wregex_t are the indices of the first and last states of the NFA within the array (the first state is not necessarily at states[0] as will become apparent shortly).

Interpreter basics

I'm going to skip forward for a moment and show you at what a compiled wregex_t NFA looks like. The following is the wregex_t's states array for the regular expression "ab":

start: 4; stop: 6
 0 MTC 'a' 2
 2 MTC 'b' 5
 4 REC <0> 0
 5 STP <0> 6
 6 EOM

The output above can be obtained if you compile wregex as in debug mode (use make debug in your shell) and then run the test program with "ab" as parameter: ./test ab.

The first column is the index in the states array^[3].
The second column is the opcode.
The third column is the operand (the data member).
The forth column is the first state transition (the s[0] member).
The CHC instruction will have a fifth column for the second state transition (the s[1] member).

This is what the NFA would look like graphically:

NFA example for the regular expression

The wrx_exec() function goes through these steps to match a string against this NFA:

It starts in state 4 the REC instruction represented by the triangle in the graphic.
It executes the REC instruction, and goes to state 0 (Ignore the specifics of the REC and STP instructions for now. I'll discuss them when I get to submatches below).
The MTC instruction in state 0 matches the first character of the input string to the literal 'a' and then goes to state 2.
The MTC instruction in state 2 matches the next character of the input string to 'b' and goes to the STP instruction in state 5 (represented by the inverted triangle).
The STP instruction is executed and it proceeds to state 6.
The EOM instruction which means End of Match: The input string matched the regular expression.

Backtracking

If you concluded that backtracking would somehow involve a stack from my description of the NFA above, then you would be correct.

We introduce the CHC ("choice") instruction to implement backtracking. It has two arguments corresponding to the two possible next states we should try. These are stored in the s[0] and s[1] members of the wrx_state structure.

When executing this instruction the engine pushes the second state onto a stack and tries the first state. If it cannot match a state, it pops a new state from the stack and tries again. If the stack is empty when we try to pop it, we know that the input string does not match the regex.

Lets look at the states for the expression "a|b":

start: 6; stop: 8
  0 MTC 'a' 7
  2 MTC 'b' 7
  4 CHC --- 0  2
  6 REC <0> 4
  7 STP <0> 8
  8 EOM

NFA example for the regular expression

Consider what wrx_exec() should do when it encounters the input string "b"

It starts at the REC instruction in state 6, which goes to state 4.
The CHC instruction pushes the current position in the input string and state 2 (the value of wrx_state.s[1]) onto a stack and goes to state 0 (the value of wrx_state.s[0]). The arrow from state 4 to state 0 is drawn thick and the one from state 4 to state 2 thin to show that state 0 will be tried first. Which one to try first is an important detail for NFA-based regex engines.
The MTC instruction matches the 'a' to the "b" in the input string. It fails so we go to the stack. It is not empty so we pop the previous position in the input string along with the other state transition, 2. So, we now go to state 2.
We now match the 'b' to "b". It is successful, so we go to state 7.
We perform the STP instruction and go to state 8.
We are now at an EOM so the match was successful.

If the input string was "d" instead we would've been unsuccessful in step 4. In this case the the stack would now be empty, so we would know that the input string does not match the regex.

Parsing

wregex implements its parser as a recursive descent parser. Here is the complete syntax of wregex as a BNF-like grammar:

pattern ::= ['^'] [list] ['$']
list    ::= element ["|" list]
element ::= ("(" [":"] list ")" | value)
            [(("*"|"+"|"?")["?"])|("{" [digit+] ["," [digit+]] "}" ["?"])]
            [element]
value   ::= (A-Za-z0-9!"#%&',-/:;=@\\_`~\r\t\n) | '<' | '>' |
            "[" ["^"] ranges "]" | "." | 'escape sequence'
ranges  ::= (c ["-" c])+
    where c is any printable ASCII character (>= 0x20)

If you are not familiar with BNF, the above states that:

A pattern consists of an optional list, optionally prefixed by a caret ^ and optionally followed by a dollar $ sign.
A list consists of an element, optionally followed by a pipe | and another list (hence the recursive in recursive descent parser).
An element could be a list wrapped in parenthesis (capturing or non capturing groups) or a value (like 'a'), followed optionally by a *, + or '?' (like 'a*'), which in turn may be non-greedy (like a*?) or a number of repetitions in curly braces (like a{4} or a{5,10}) which in turn may be non-greedy (like a{4,5}?), which may all be followed by another element.
A value is a literal character, a character set (like [a-z]), the '.' operator or an escape sequence.
...and so on and so forth.

What I like about recursive descent parsers is that it is very easy to implement a parser by hand from a grammar like the above and follow the code of the parser later. Also each of the rules in the grammar has a corresponding function in wrx_comp.c, which nicely organises the code for maintainability.

Recursive descent parsers have drawbacks, but I'll leave that as an exercise for the reader.

Code Generation

The nice thing about the recursive descent parser is that it veery neatly splits the regex pattern we're compiling into sub-patterns. Every level down the parser creates a segment of the NFA and returns it to its caller.

All these instructions are stored in a dynamically growing array of the instructions that described in the section on the Interpreter basics. The function next_state() returns the next available index in this array, and resizes it if it becomes full (you can think of this index as an instruction address for our bytecode interpreter).

The so the caller needs some mechanism to keep track the start and end indices of the segments generated by each of the functions it calls. It does this through the push_seg() and pop_seg() functions, which pushes and pops a nfa_segment sturcture onto a stack. The nfa_segment simply contains a beginning and end index for a particular segment.

As an example, consider compiiling the pattern ab|c.

The parser goes down through the functions, calling the pattern(), list() and the element() functions. element() doesn't find any of the operators so it calls value() right at its end.

The first pattern is the character 'a', for which a MTC instruction followed by a MOV instruction is generated:

0 MTC 'a' 1
1 MOV  0

The MOV (move) instruction is just a dummy instruction that causes the interpreter to move to the state specified by its operand (it didn't occur to me at the time that JMP might have been a more appropriate name). At this point the operand is not set, so it is 0.

The beginning and end addresses of this segement (0 and 1, respectively) is pushed onto the stack through push_seg() and value() returns.

We're now back in the element() function, which calls itself rcurisvely and ends up calling the value() function again. The parser now encounters the 'b' character, so it generates the same 2 instructions again after the first:

0 MTC 'a' 1
1 MOV  0
2 MTC 'b' 3
3 MOV  0

This new segment's addresseses (2 and 3) are also pushed onto the stack, and value() returns, and then the recursive call to element() returns.

We're now back in the initial call to element() function. element() knows it has to concatenate two segements, so it pops two segments off the stack through pop_seg().

Now it uses the transition() function to tie the end of the first segment to the second segment by setting the operand of the first MOV instruction to 2

0 MTC 'a' 1
1 MOV  2
2 MTC 'b' 3
3 MOV  0

The element() function now returns up one level to the list() function that encounters the '|' symbol and calls element() again, which encounters the 'c' and adds a new segment:

0 MTC 'a' 1
1 MOV  2
2 MTC 'b' 3
3 MOV  0
4 MTC 'c' 5
5 MOV  0

So back in the list() function, there are now two segements on the stack: 0-3 and 4-5. They get popped off. list() then generates a CHC and a MOV instruction

The CHC has the start of the two segments, 0 and 4 as operands, to control the branching.

It also connects the two segment's end points to the generated MOV instruction.

0 MTC 'a' 1
1 MOV  2
2 MTC 'b' 3
3 MOV  7
4 MTC 'c' 5
5 MOV  7
6 CHC --- 0  4
7 MOV  0

The addresses of the CHC and MOV instructions are now pushed onto the stack as the resulting segment from

This demonstrates whey the MOV instructions are necessary: They allow each function in the parser to have a single entry and exit point in the generated NFA.

list() now returns to pattern() which wraps the segment it pops from the stack in between REC and STP instructions (used for capturing groups, as described below), and pushes the new segment back onto the stack.

We're now at the top of the parser in wrx_comp(). It pops the completed NFA (which is now one big segment) off the stack and appends the EOM instruction that shows a successful match.

The completed NFA looks like this:

 0 MTC 'a' 1
 1 MOV  2
 2 MTC 'b' 3
 3 MOV  7
 4 MTC 'c' 5
 5 MOV  7
 6 CHC --- 0  4
 7 MOV  9
 8 REC <0> 6
 9 STP <0>10
10 EOM

The Other Operators

Using these principles, we can compile the *, + and ? operators. In the examples I'm about to give P, Q, R and S are addresses of instructions, and A is an arbitrary pattern that is compiled to a segement through recursive calls.

To compile the * operator in pattern like A*, we generate a CHC instruction that will first try to match the pattern A . It looks like this:

 start at R
 P <segment for matching A>
   ...
 Q MOV  R
 R CHC --- P  S
 S ...

NFA example

The A+ pattern is then just a simple modification: We start at P rather than R to ensure that the input string matches A at least once.

 start at P
 P <segment for matching A>
   ...
 Q MOV  R
 R CHC --- P  S
 S ...

NFA example

The ? operator works a lot like the *, but the MOV instruction at Q jumps directly to S instead of going through the CHC at R

 start at R
 P <segment for matching A>
   ...
 Q MOV  S
 R CHC --- P  S
 S ...

NFA example

Optimizer

The last step of the compiler is to call the optimize() function, which removes all the MOV instructions, which have now become redundant: If an instruction's destination is a MOV, we just change the instruction's destination to the MOV's destination.

start: 8; stop: 10
 0 MTC 'a' 2
 2 MTC 'b' 9
 4 MTC 'c' 9
 6 CHC --- 0  4
 8 REC <0> 6
 9 STP <0>10
10 EOM

Extra details

If you followed the above, then the extra features are quite straight forward to implement.

"lazy"/"non-greedy" operators

Implementing the non-greedy *? and +? operators is laughably simple: You simply swap the order in which you try the next states in the CHC states.

Remember how I showed that we try state B first in step 3 of my description of NFAs above? With that setup, the regex engine will always try to consume another 'b' character from state X before it tries a 'c', making it greedy for more 'b's.

In order to make the '*' operator non-greedy, the regex engine simply has to try state C first. The following graphic the NFA for the regular expression "ab*?c":

NFA example for the regular expression

I've drawn the arrow from X to C thick and the arrow from X to B thin this time to show the difference.

In this case, once you're in state X after you've read an 'a', the regex will always try to match a 'c' first and go to the "stop" state before it tries to match any 'b' characters.

The function weaken() in wrx_comp.c swaps the state transitions of an NFA if a ? is encountered after a * or a +.

Curly braces

I cheated a bit with expressions containing curly braces, like A{m,n}: They are implemented by repeating the preceding expression A n times. For example, A{1,4} is handled internally as 'AA?A?A?'.

The function duplicate() in wrx_comp.c takes care of the details.

Character sets

Character sets are implemented as 256-bit bit vectors. Each bit in the bit vector corresponds to an ASCII character. The SET opcode compares the current input character against the bit vector.

This is the reason why wregex cannot support unicode in its implementation.

A set like [a] will have the bits associated with both 'a' and 'A' set in case-insensitive mode.

The '.' operator uses a bit vector with all bits set except for the bits associated with the ASCII control characters.

Case insensitivity

wregex uses the '\i' and '\I' control sequences to enable and disable case-insensitivity, rather than depending on flags passed to wrx_comp(). I wanted wregex to be usable as the regex engine of a scripting language, so doing it this way gave control to the person writing the scripts rather than the person implementing the scripting language.

These control sequences toggle a flag in the compiler which controls whether MTC or MCI opcodes are generated and how the bit vectors of the character sets are generated.

Special operators

Special operators like '^', '$', '<' and '>' all have special opcodes that causes wrx_exec() to check for the beginning and end of lines and words, respectively.

Submatches and backreferences

The REC ("record") and STP ("stop recording") instructions are used to record submatches.

Submatches are stored in an array of wregmatch_t structures while the engine is executing. Each wregmatch_t structure contains a pointer to the beginning and the end (called beg and end respectively) of the current submatch.

When the REC instruction is encountered the engine sets the beg pointer to the current position in the input string, and when the STP instruction it sets the end pointer. When wrx_exec() is done, the wregmatch_t array contains the correct values.

The wregex_t hs a member n_subm that counts the opening brackets to ensure that the correct wregmatch_t structure's value is set.

All NFAs are surrounded by a "REC <0> - STP <0>" instruction pair to capture the entire matching string into submatch 0.

For an example, consider how the pattern a(b*)c is represented in an NFA:

start: 10; stop: 12
  0 MTC 'a' 6
  2 MTC 'b' 4
  4 CHC --- 2  7
  6 REC <1> 4
  7 STP <1> 8
  8 MTC 'c'11
 10 REC <0> 0
 11 STP <0>12
 12 EOM

NFA example for the regular expression

At state 6 the REC instruction records in the current position in the input string as the start of submatch 1. It then proceeds to state 4, the CHC instruction, that uses the backtracking mechanism with state 2 to read a sequence of 'b' symbols.

Once it cannot read any more 'b's it jumps to state 7 that records the current input position as the end of submatch 1 an proceeds to state 8 to read the 'c' symbols at the end of the pattern.

Backreferences are then simply implemented by matching the input string at the current position against the corresponding submatch.

Conclusion

I hope this article gave you some insight into how regex engines work.

I remember that it seemed so complex when I implemented wregex originally way back, but the code seems so simple looking at it now.

References

Below are some of the references I used when I implemented wregex

I wrote the original implementation in 2007, so some of these links may have gone stale. Apologies in advance for that.

"Pattern Matching with Regular Expressions in C++" by Oliver Mueller, Published in Issue 27 of Linux Gazette, April 1998
"Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...)" by Russ Cox
"The text editor sam" Rob Pike and the implementation at http://plan9.bell-labs.com/sources/plan9/sys/src/libregexp/regexec.c
"COMPILERS Principles, Techniques and Tools" Aho, Sethi, Ullman Addison Wesley 1986
The webpages at http://www.regular-expressions.info, such as the one at Regular Expression Basic Syntax Reference provided some insight.
"How Regexes Work" by Mark-Jason Dominus from The Perl Journal
"Writing own regular expression parser" By Amer Gerzic from The Code Project
Ozan S. Yigit's implementation at http://www.cse.yorku.ca/~oz/
"Algorithmic Forays" by Eli Bendersky, accessible from http://www.gamedev.net/reference/list.asp?categoryid=25
"Understanding Regular Expressions" Jeffrey Friedl
"Regular Expressions" by Hardeep Singh
Henry Spencer's implementation can be found at http://arglist.com/regex/
Jeff Atwood also discusses catastrophic backtracking.
Wikipedia entry for Regex

While writing this entry, I also noticed that Russ Cox wrote a new article in the intervening years titled Regular Expression Matching: the Virtual Machine Approach, which covers a lot of the topics I've discussed here and discusses several alternative approaches, optimizations and practical considerations.

broadly speaking - The rest are variants and combinations of these ↶ Back
You'll notice that the POSIX regex API, regex(3), specifies functions for similar purposes. ↶ Back
you'll notice that some of the indices, like 1 and 3 are missing. This is an artifact of the compilation process where those elements in the array contain states that are never reached ↶ Back