Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What kind of formal languages can modern regex engines parse?

Here on SO people sometimes say something like "you cannot parse X with regular expressions, because X is not a regular language". From my understanding however, modern regular expressions engines can match more than just regular languages in Chomsky's sense. My questions:

given a regular expression engine that supports

  • backreferences
  • lookaround assertions of unlimited width
  • recursion, like (?R)

what kind of languages can it parse? Can it parse any context-free language, and if not, what would be the counterexample?

(To be precise, by "parse" I mean "build a single regular expression that would accept all strings generated by the grammar X and reject all other strings").

Add.: I'm particularly interested to see an example of a context-free language that modern regex engines (Perl, Net, python regex module) would be unable to parse.

like image 677
georg Avatar asked Jul 03 '12 07:07

georg


People also ask

What languages use regex?

Regex support is part of the standard library of many programming languages, including Java and Python, and is built into the syntax of others, including Perl and ECMAScript. Implementations of regex functionality is often called a regex engine, and a number of libraries are available for reuse.

Does regex work for other languages?

Short answer: yes.

What are regex engines?

A regex engine executes the regex one character at a time in left-to-right order. This input string itself is parsed one character at a time, in left-to-right order. Once a character is matched, it's said to be consumed from the input, and the engine moves to the next input character. The engine is by default greedy.

What is regex and parsing?

The Parse Regex operator (also called the extract operator) enables users comfortable with regular expression syntax to extract more complex data from log lines. Parse regex can be used, for example, to extract nested fields.


1 Answers

I recently wrote a rather long article on this topic: The true power of regular expressions.

To summarize:

  • Regular expressions with support for recursive subpattern references can match all context-free languages (e.g a^n b^n).
  • Regular expressions with lookaround assertions and subpattern references can match at least some context-sensitive languages (e.g. ww and a^n b^n c^n).
  • If the assertions have unlimited width (as you say), then all context-sensitive grammars can be matched. I don't know any regex flavor though that does not have fixed-width restrictions on lookbehind (and at the same time supports subpattern references).
  • Regular expressions with backreferences are NP-complete, so any other NP problem can be solved using regular expressions (after applying a polynomial-time transformation).

Some examples:

  • Matching the context-free language {a^n b^n, n>0}:

    /^(a(?1)?b)$/ # or /^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x 
  • Matching the context-sensitive language {a^n b^n c^n, n>0}:

    /^     (?=(a(?-1)?b)c)     a+(b(?-1)?c) $/x # or /^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x 
like image 187
NikiC Avatar answered Sep 21 '22 01:09

NikiC