Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How a RegEx engine works [closed]

Tags:

regex

theory

In learning Regular Expressions it had me wondering how the underlying engine works. Probably more specifically, I'd like to know more about how it evalutates, prioritizies and parses the expression. I feel the RegEx engine is a blackbox to me, and I would really enjoy deciphering it.

So I'd like to ask if there are some great resources that I could read up on that discuss RegEx engine theory.

*Note: I am not interested in building an engine, just learning the inner workings of it.

like image 524
Robb Avatar asked Sep 01 '10 21:09

Robb


People also ask

Is regex faster than for loop?

Regex is faster for large string than an if (perhaps in a for loops) to check if anything matches your requirement.

How does regex engine work?

A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character. The engine is by default greedy.


1 Answers

There are two main classes of regex engines.

  1. Those based on Finite State Automaton. These are generally the fastest. They work by building a state machine, and feeding it characters from the input string. It is difficult, if not impossible, to implement some more advanced features in engines like this.

    Examples of FSA based engines:

    • Posix/GNU ERE/BRE — Used in most unix utilities, such as grep, sed and awk.
    • Re2 — A relatively new project for trying to give more power to the Automata based method.
       
  2. Those based on back-tracking. These often compile the pattern into byte-code, resembling machine instructions. The engine then executes the code, jumping from instruction to instruction. When an instruction fails, it then back-tracks to find another way to match the input.

    Examples of back-tracking based engines:

    • Perl — The original. Most other engines of this type try to replicate the functionality of regexes in the Perl language.
    • PCRE — The most successful implementation. This library is the most widely used implementation. It has a rich set of features, some of which can't be considered as "Regular" any more.
    • Python, Ruby, Java, .NET — Other implementations I don't intend to describe further.

For more information:

  • regular-expressions.info - Tutorial
  • regular-expressions.info - Flavor comparison
  • swtch.com - Implementing Regular Expressions — A good set of articles about effective, Automata based, regular expressions.

If you want me to expand on something, post a comment.

like image 82
Markus Jarderot Avatar answered Sep 23 '22 11:09

Markus Jarderot