Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capture groups using DFA-based (linear-time) regular expressions: possible?

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

like image 877
user541686 Avatar asked Mar 09 '15 11:03

user541686


People also ask

Can all DFA be converted to regex?

Every DFA/NFA can be converted to a regular expression and there are standard techniques for doing this.

Is DFA a regular expression?

It turns out that for any regular expression, a deterministic finite automaton (DFA) can be constructed that recognizes any string that the regular expression describes in time linear in the length of the string! (For those who care about language: "automaton" is the singular, "automata" is the plural.

How do you capture a regular expression?

To capture all matches to a regex group we need to use the finditer() method. The finditer() method finds all matches and returns an iterator yielding match objects matching the regex pattern. Next, we can iterate each Match object and extract its value.


2 Answers

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Yes - at least when the capture groups are deterministic. Consider the example regex /a|(a)/; matching that against the input "a" could either produce a captured group or none.

I think that capture groups could be based on a theoretical foundation using finite state transducers, which are like automatons but also may output strings while changing states. You may echo the input, but surround each capture group with parenthesis for example.

Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

Indeed, this is a problem. I think you can solve it by tagging your sets with the capture state, and similarly distinguish the states of your result DFA. You may fail to produce a fully deterministic automaton for regular expression like the above, as Wikipedia writes: "some non-deterministic transducers do not admit equivalent deterministic transducers".

However, a modification of the subset construction procedure is possible, see Determinization of Transducers. Their algorithm seems to revolve around the following:

local ambiguities […] are solved by delaying the outputs as far as needed, until these symbols can be written out deterministically.

For example, the regexes /ab|(a)c/ and even /(a[bc])|ad/ can be resolved into deterministic transducers. Notice that their memory representation may be much larger than if they had no capture groups.

like image 122
Bergi Avatar answered Oct 25 '22 16:10

Bergi


My http://github.com/hoehrmann/demo-parselov does this. I do not currently explain the construction on the web page, but suppose you have a grammar like

X = "a" B "c"
B = "b"

You can turn this regular grammar into a graph with labeled vertices

  1. start X
  2. "a"
  3. start B
  4. "b"
  5. final B
  6. "c"
  7. final X

DFA states correspond to sets of these vertices. The first one would consist of vertices 1 and 2, the second one of vertices 3 and 4, then 5 and 6, and finally 7. If you parse the string "abc", you have

  1. { offset: 0, vertices: [1, 2] }
  2. { offset: 1, vertices: [3, 4] }
  3. { offset: 2, vertices: [5, 6] }
  4. { offset: EOF, vertices: [7] }

That is also a graph. You can write out the edges using (offset, vertex) pairs as vertices:

  1. (o0, v1) -> (o0, v2)
  2. (o0, v2) -> (o1, v3)
  3. (o1, v3) -> (o1, v4)
  4. ...

Such a graph might contain vertices that do not ultimately reach the final vertex (EOF, v7), but such vertices can be eliminated in O(n) time. If the grammar is ambiguous, a match would be a path through the resulting graph. There may be many possible paths.

like image 34
Björn Höhrmann Avatar answered Oct 25 '22 17:10

Björn Höhrmann