Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Algorithmic complexity of Regular Languages in Extended Regular Language frameworks

I have a bit of a background in Formal Languages and recently I found out that Java and other languages use what is coined Extended Regular Languages. Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background. As a result, I had always assumed no matter how ugly my regex, no matter how long my regex, Pattern.matches or similar methods would run in linear time. But that assumption seems to be incorrect.

A post I read seems to suggest some Regex expressions do run in linear time, but I don't entirely believe or trust one single person.

I'll eventually write my own Java Formal Regular Expression library (existing ones I've found only have GNU GPL licences) but in the meantime I have a few questions on the time complexity of Java/C# regexs. Want to ensure what I read elsewhere is correct.

Questions:

  1. A formal language like \sRT\s., would the match method of a regexes in Java/C# solve membership in linear or non-linear time?
  2. In general, how would I know if a given Regular Language's expression membership problem is linear time for a Regex?

I do text analysis, finding out Java regexes aren't DFA's was really a downer.

like image 501
Lan Avatar asked Dec 11 '13 19:12

Lan


2 Answers

Because of my background, I had always assumed in languages like Java when I call compile for Pattern it produced a DFA or Transducer in the background.

This belief is common in academics. In practice, regular expression compilation does not produce a DFA and then execute it. I have only a small amount of experience in this; I worked briefly on the regular expression compilation system in Microsoft's implementation of JavaScript in the 1990s. We chose to compile the "regular" expression into a simple domain-specific bytecode language and then built an interpreter for that language.

As you note, this can lead to situations where repeated backtracking has exponentially bad time behavior in the length of the input, but the construction of the compiled state is essentially linear in the size of the expression.

So let me counter your questions with another two questions -- and I note that these are genuine questions, not rhetorical.

1) Every actually regular expression corresponds to an NDFA with let's say n states. The corresponding DFA might require superposition of up to 2n states. So what stops the time taken to construct the DFA from being exponential in pathological cases? The runtime might be linear in the input, but if the runtime is exponential in the pattern size then basically you're just trading one nonlinearity for another.

2) So-called "regular" expressions these days are nothing of the sort; they can do parenthesis matching. They correspond to pushdown automata, not nondeterministic finite automata. Is there a linear algorithm for constructing the corresponding pushdown automaton for a "regular" expression?

like image 65
Eric Lippert Avatar answered Oct 07 '22 00:10

Eric Lippert


Regular expressions are implemented in many languages as NFAs to support backtracking (see http://msdn.microsoft.com/en-us/library/e347654k(v=vs.110).aspx). Because of backtracking, you can construct regular expressions which have terrible performance on some strings (see http://www.regular-expressions.info/catastrophic.html)

As far as analysing regular expressions to determine their performance, I doubt there is a very good way in general. You can look for warning flags, like compounded backtracking in the second link's example, but even that may be hard to detect properly in some cases.

like image 37
Sean Avatar answered Oct 06 '22 22:10

Sean