Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement regular expression assertions/lookaround (i.e. \b style word boundary) using a DFA regular expression matcher

I would like to implement "word boundary" matches within a DFA based regular expression matcher. Can someone tell me how this is done?

To give some background, I'm currently using the "dk.brics.automaton" library, but it does not support assertions (e.g. \b, word boundary). I need to use a DFA based engine because my main goal is actually determining equivalence of regular expressions, not doing the actual matching.

Additionally, the answer to the following question seems to indicate this is possible: DFA based regular expression matching - how to get all matches? by saying

"Again, we manage this by adding an epsilon transition with special instructions to the simulator. If the assertion passes, then the state pointer continues, otherwise it is discarded."

I can't quite figure out what this means, however. Is it suggesting that it can only be done with a special type of epsilon transition that looks at its endpoints and can only be traversed if its endpoint meet the assertion, or can it be done with "normal" epsilon transitions configured in some way? If I need these "special" type of epsilon transitions, then how can these be determinized (i.e. converted to a standard DFA)?

Pointers to any descriptions of how to actually implement this are greatly appreciated.

like image 325
user1255384 Avatar asked Nov 13 '22 07:11

user1255384


1 Answers

You cannot do a lookaround type regular expression engine using a pure DFA implementation. Since you will need to keep track of what has been previously seen you are making the engine into a different beast which keeps context in memory in order to do the pattern matching.

For a regular expression engine to handle this means that it will need to have special transitions that look at the context of what has already been parsed. A normal DFA cannot do this, since this context is thrown away. Incidentally, this is also why capture groups are slow and why matching (.*)something(.*) is deadly slow on some engines since it will copy a LOT of characters into buffers in order to keep this context around.

I suppose that you are going to try to minimize two resulting DFAs and see if they are equal in order to solve your problem. This might still be achievable if you handle each "special" transition as unique and only mergeable with a transition equal to itself when doing the state minimization algorithm.

like image 102
Dervall Avatar answered Nov 16 '22 03:11

Dervall