Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do regular expressions work behind the scenes (at the CPU level)?

Do interpreters and compilers compare (and ultimately match) two strings for a potential match in a character-by-character and left-to-right fashion? Or is there an underlying binary value (e.g., a bit pattern) assigned to each string in a comparison function? Or does it depend on the string being encoded in a certain way (ASCII or UTF-32), or the interpreter, compiler, database engine, or programming language?

Redesigning the data store (data files or databases) is a considerable effort. The answer to a similar question on stackoverflow didn't definitively describe the encoding question (whether bit patterns were being evaluated or actual alphabetic characters). The answer to this question could be important for an optimization effort.

I don't want to know how to implement a regular expression (e.g., write my own). I want to know for educational purposes for the benefit of using existing regular expressions in an optimal way (e.g., when it is time to design data to be stored as a composition of substrings, should I be mindful of the left to right evaluation). A similar StackOverflow question's answer (which is a link that has an untrusted certificate to view it) focuses on finite automata (the theory of how strings are compared). That answer emphasizes how it can work and the computational complexity of comparing strings. It does imply that there is a left-to-right character evaluation. I don't think it was definitive by any means. The article was largely specific to Perl and the language agnostic Thomson non-deterministic finite automata algorithm. I would like to know for sure with these three technology combinations: 1) Java native functions using ASCII data files, 2) MySQL (table data and SELECT statements), and 3) with Python native functions and UTF-32 data files.

My question and approach is different from the older post in that I am not trying to develop a parser for doing regexes. I'm trying to architect data for future development. I want to know how to utilize existing regex tools in an optimal way. I believe stackoverflow is the right forum because it is central to regexes, and this question in its original and less verbose form has been voted up.

I want to know at the CPU level, are bit patterns the representations of the characters in the string? Is there a short-lived index of the bit patterns corresponding to each character of the strings participating in the comparisons wherein one string is anchored? I would think the technology (e.g., the database, the programming language, and/or the encoding of the data) would vary.

like image 744
Yousef Avatar asked May 23 '15 17:05

Yousef


People also ask

How does a regular expression work?

Regular expression is not a library nor is it a programming language. Instead, regular expression is a sequence of characters that specifies a search pattern in any given text (string). A text can consist of pretty much anything from letters to numbers, space characters to special characters.

What is a regular expression and what would you use it for?

Regular expressions are useful in search and replace operations. The typical use case is to look for a sub-string that matches a pattern and replace it with something else. Most APIs using regular expressions allow you to reference capture groups from the search pattern in the replacement string.

How do regular expressions work in JavaScript?

Regular expressions are a sequence of characters that are used for matching character combinations in strings for text matching/searching. In JavaScript, regular expressions are search patterns (JavaScript objects) from sequences of characters. RegExp makes searching and matching of strings easier and faster.

What is the role of regular expressions in text processing explain?

A regular expression (RE) is a language for specifying text search strings. It helps us to match or extract other strings or sets of strings, with the help of a specialized syntax present in a pattern.


2 Answers

There are two big families of regex engines, called NFA and DFA (I'm using the terminology from Jeffrey Friedl's book):

  • Nondeterministic finite automaton
  • Deterministic finite automaton

A NFA implementation will roughly work the following way:

  • Keep a pointer to a current offset in the input string
  • Keep a pointer to the current position in the pattern (which is interpreted as a graph or tree).

Then use the pattern as a recipe of how to advance in the input string. If the pattern says a for instance, and if the current input offset points to an a character, then that character will be consumed and both pointers will advance to the next position. If the character doesn't match, there will be a backtrack (the input pointer will go to a previous valid position and the pattern pointer will be set to a different possible alternative at the input position).

The point is that the recognition is driven by the pattern.

(the above explanation is very rough, as it doesn't include optimizations etc - and modern patterns cannot be implemented with a formal automaton anyway)

A DFA implementation works the other way around:

There is still one input pointer, but there are multiple pattern pointers. The input pattern will advance character by character, and the pattern pointers will keep track of a valid state in the pattern for the given input.

The recognition is driven by the input.

Both these methods have very different properties:

  • NFA engines can offer much more features, but their running time is dependent on the combination of the input and the pattern itself
  • DFA engines offer less features, but their complexity is O(n), where n is the length of the input string.

Some regex engines (such as PCRE) can implement both recognition methods. I recommend you read the PCRE docs about the two matching algorithms, which explain the differences in more technical terms.

As to the actual implementation, it highly depends on the regex engine itself. PCRE has several of them:

  • A NFA algorithm based on a tree traversal approach
  • An optimized version of the above based on JIT compilation (one version for each supported instruction set)
  • A DFA implementation

So you can actually see there are several possible approaches for NFA alone. Other engines have different implementations that allow for a different feature set. For instance, .NET's regexes can be matched left-to-right, or right-to-left and thus can easily provide variable-length lookbehind, whereas PCRE's lookbehind is implemented by temporarily shifting the input pointer to the left by the lookbehind's expected input length, and performing a left-to-right match from this offset.

like image 115
Lucas Trzesniewski Avatar answered Oct 05 '22 01:10

Lucas Trzesniewski


Regular expressions don't specify implementation details. It's up to the language to decide the best way to compile them to machine code. For example, this regex.c implementation looks like it goes more than one character at a time.

like image 37
JDong Avatar answered Oct 05 '22 01:10

JDong