How do regular expressions work behind the scenes (at the CPU level)?

Tags:

Do interpreters and compilers compare (and ultimately match) two strings for a potential match in a character-by-character and left-to-right fashion? Or is there an underlying binary value (e.g., a bit pattern) assigned to each string in a comparison function? Or does it depend on the string being encoded in a certain way (ASCII or UTF-32), or the interpreter, compiler, database engine, or programming language?

Redesigning the data store (data files or databases) is a considerable effort. The answer to a similar question on stackoverflow didn't definitively describe the encoding question (whether bit patterns were being evaluated or actual alphabetic characters). The answer to this question could be important for an optimization effort.

I don't want to know how to implement a regular expression (e.g., write my own). I want to know for educational purposes for the benefit of using existing regular expressions in an optimal way (e.g., when it is time to design data to be stored as a composition of substrings, should I be mindful of the left to right evaluation). A similar StackOverflow question's answer (which is a link that has an untrusted certificate to view it) focuses on finite automata (the theory of how strings are compared). That answer emphasizes how it can work and the computational complexity of comparing strings. It does imply that there is a left-to-right character evaluation. I don't think it was definitive by any means. The article was largely specific to Perl and the language agnostic Thomson non-deterministic finite automata algorithm. I would like to know for sure with these three technology combinations: 1) Java native functions using ASCII data files, 2) MySQL (table data and SELECT statements), and 3) with Python native functions and UTF-32 data files.

My question and approach is different from the older post in that I am not trying to develop a parser for doing regexes. I'm trying to architect data for future development. I want to know how to utilize existing regex tools in an optimal way. I believe stackoverflow is the right forum because it is central to regexes, and this question in its original and less verbose form has been voted up.

I want to know at the CPU level, are bit patterns the representations of the characters in the string? Is there a short-lived index of the bit patterns corresponding to each character of the strings participating in the comparisons wherein one string is anchored? I would think the technology (e.g., the database, the programming language, and/or the encoding of the data) would vary.

744

asked May 23 '15 17:05

Yousef

2 Answers

There are two big families of regex engines, called NFA and DFA (I'm using the terminology from Jeffrey Friedl's book):

Nondeterministic finite automaton
Deterministic finite automaton

A NFA implementation will roughly work the following way:

Keep a pointer to a current offset in the input string
Keep a pointer to the current position in the pattern (which is interpreted as a graph or tree).

Then use the pattern as a recipe of how to advance in the input string. If the pattern says a for instance, and if the current input offset points to an a character, then that character will be consumed and both pointers will advance to the next position. If the character doesn't match, there will be a backtrack (the input pointer will go to a previous valid position and the pattern pointer will be set to a different possible alternative at the input position).

The point is that the recognition is driven by the pattern.

(the above explanation is very rough, as it doesn't include optimizations etc - and modern patterns cannot be implemented with a formal automaton anyway)

A DFA implementation works the other way around:

There is still one input pointer, but there are multiple pattern pointers. The input pattern will advance character by character, and the pattern pointers will keep track of a valid state in the pattern for the given input.

The recognition is driven by the input.

Both these methods have very different properties:

NFA engines can offer much more features, but their running time is dependent on the combination of the input and the pattern itself
DFA engines offer less features, but their complexity is O(n), where n is the length of the input string.

Some regex engines (such as PCRE) can implement both recognition methods. I recommend you read the PCRE docs about the two matching algorithms, which explain the differences in more technical terms.

As to the actual implementation, it highly depends on the regex engine itself. PCRE has several of them:

A NFA algorithm based on a tree traversal approach
An optimized version of the above based on JIT compilation (one version for each supported instruction set)
A DFA implementation

So you can actually see there are several possible approaches for NFA alone. Other engines have different implementations that allow for a different feature set. For instance, .NET's regexes can be matched left-to-right, or right-to-left and thus can easily provide variable-length lookbehind, whereas PCRE's lookbehind is implemented by temporarily shifting the input pointer to the left by the lookbehind's expected input length, and performing a left-to-right match from this offset.

115

answered Oct 05 '22 01:10

Lucas Trzesniewski

Regular expressions don't specify implementation details. It's up to the language to decide the best way to compile them to machine code. For example, this regex.c implementation looks like it goes more than one character at a time.

answered Oct 05 '22 01:10

JDong

Related questions
                            
                                Regex for fixed length string of character A followed by character B
                            
                                regex lookarounds, combining lookahead and lookbehind
                            
                                Escaping plus signs doesn't work
                            
                                Python regex -- extraneous matchings
                            
                                Value.match() Regex in Google Refine
                            
                                Remove commas in a string, surrounded by a comma and double quotes / Python
                            
                                Capture the match content of two different regexp in perl
                            
                                htaccess - Rewrite files that have a directory with the same name
                            
                                Java regular expression to split string
                            
                                Regex for a valid 32-bit signed integer
                            
                                Reading std::string, remove all special characters from a std::string
                            
                                web.config rewrite to redirect html files but exclude certain folders
                            
                                Regex that matches anything not ending in .json
                            
                                Should you initialize $matches before calling preg_match?
                            
                                Regex multiple match substring
                            
                                How to Generate a String from a String Template in Java?
                            
                                Regex must start with a letter or number, string can only have letters, numbers or slashes, and no double slashes
                            
                                Github Issues Partial Match
                            
                                Regular Expressions in MarkLogic's xQuery
                            
                                Powershell script to remove double quotes from CSV unless comma exists inside double quotes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do regular expressions work behind the scenes (at the CPU level)?

Tags:

string

regex

optimization

filter

Yousef

People also ask

2 Answers

Lucas Trzesniewski

JDong

Recent Activity

Donate For Us