Capture groups using DFA-based (linear-time) regular expressions: possible?

Tags:

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

877

asked Mar 09 '15 11:03

user541686

2 Answers

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Yes - at least when the capture groups are deterministic. Consider the example regex /a|(a)/; matching that against the input "a" could either produce a captured group or none.

I think that capture groups could be based on a theoretical foundation using finite state transducers, which are like automatons but also may output strings while changing states. You may echo the input, but surround each capture group with parenthesis for example.

Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

Indeed, this is a problem. I think you can solve it by tagging your sets with the capture state, and similarly distinguish the states of your result DFA. You may fail to produce a fully deterministic automaton for regular expression like the above, as Wikipedia writes: "some non-deterministic transducers do not admit equivalent deterministic transducers".

However, a modification of the subset construction procedure is possible, see Determinization of Transducers. Their algorithm seems to revolve around the following:

local ambiguities […] are solved by delaying the outputs as far as needed, until these symbols can be written out deterministically.

For example, the regexes /ab|(a)c/ and even /(a[bc])|ad/ can be resolved into deterministic transducers. Notice that their memory representation may be much larger than if they had no capture groups.

122

answered Oct 25 '22 16:10

Bergi

My http://github.com/hoehrmann/demo-parselov does this. I do not currently explain the construction on the web page, but suppose you have a grammar like

X = "a" B "c"
B = "b"

You can turn this regular grammar into a graph with labeled vertices

start X
"a"
start B
"b"
final B
"c"
final X

DFA states correspond to sets of these vertices. The first one would consist of vertices 1 and 2, the second one of vertices 3 and 4, then 5 and 6, and finally 7. If you parse the string "abc", you have

{ offset: 0, vertices: [1, 2] }
{ offset: 1, vertices: [3, 4] }
{ offset: 2, vertices: [5, 6] }
{ offset: EOF, vertices: [7] }

That is also a graph. You can write out the edges using (offset, vertex) pairs as vertices:

(o0, v1) -> (o0, v2)
(o0, v2) -> (o1, v3)
(o1, v3) -> (o1, v4)
...

Such a graph might contain vertices that do not ultimately reach the final vertex (EOF, v7), but such vertices can be eliminated in O(n) time. If the grammar is ambiguous, a match would be a path through the resulting graph. There may be many possible paths.

answered Oct 25 '22 17:10

Björn Höhrmann

Related questions
                            
                                Find digits in file names and cross reference them with others
                            
                                What is the Python way of doing a \G anchored parsing loop?
                            
                                RequestMapping with slashes and dot
                            
                                How do I fuzzy match word to a full word (and only full word) in a sentence?
                            
                                Is it possible to match multiple heredoc expressions with regexes?
                            
                                What’s the equivalent of rsplit() with re.split()?
                            
                                Limit access to an URL with query parameters
                            
                                Regex challenge: changing formats of negative numbers
                            
                                Regex quirk in tcl
                            
                                Given mixed accented and normal characters in string not working in java when searching
                            
                                optional regex lookahead
                            
                                IE8 parses this simple regex differently from all other browsers
                            
                                Use of re.MULTILINE and re.DOTALL together python
                            
                                Enumerate Possible Matches of Regular Expression in Java
                            
                                Remove all non-word char except if &amp; or &apos; pattern
                            
                                Don't replace regex if it is enclosed by a character
                            
                                Replace every character with an element
                            
                                Joining regular expressions
                            
                                Negative lookahead in Text.Regex.Posix
                            
                                Regex input mask for angular?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Capture groups using DFA-based (linear-time) regular expressions: possible?

Tags:

regex

finite-automata

dfa

user541686

People also ask

2 Answers

Bergi

Björn Höhrmann

Recent Activity

Donate For Us