Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: check if one regex covers another regex [duplicate]

Tags:

regex

I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.

For example, we know that the regular expressions below are quite different but they both match xy50:

'^xy1\d'
'[^\d]\d2$'

Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?

like image 608
Tom Avatar asked Aug 04 '10 22:08

Tom


4 Answers

There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.

I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:

  • http://sulzmann.blogspot.com/2008/11/playing-with-regular-expressions.html

And then there's RAGEL

  • http://www.complang.org/ragel/

which can compute the intersection of regular expressions too.

UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:

('xy1' digit any*) & (any* ^digit digit '2') 

and has the following state machine:

enter image description here

While the empty intersection:

('xy1' digit any*) & ('q' any* ^digit digit '2')

looks like this:

enter image description here

So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.

like image 116
Nordic Mainframe Avatar answered Oct 17 '22 03:10

Nordic Mainframe


The problem can be restated as, "do the languages described by two or more regular expressions have a non-empty intersection"?

If you confine the question to pure regular expressions (no backreferences, lookahead, lookbehind, or other features that allow recognition of context-free or more complex languages), the question is at least decidable. Regular languages are closed under intersection, and there is an algorithm that takes the two regular expressions as inputs and produces, in finite time, a DFA that recognizes the intersection.

Each regular expression can be converted to a nondeterministic finite automaton, and then to a deterministic finite automaton. A pair of DFAs can be converted to a DFA that recognizes the intersection. If there is a path from the start state to any accepting state of that final DFA, the intersection is non-empty (a "conflict", using your terminology).

Unfortunately, there is a possibly-exponential blowup when converting the initial NFA to a DFA, so the problem quickly becomes infeasible in practice as the size of the input expressions grows.

And if extensions to pure regular expressions are permitted, all bets are off -- such languages are no longer closed under intersection, so this construction won't work.

like image 24
Jim Lewis Avatar answered Oct 17 '22 02:10

Jim Lewis


Yes I think this is solvable: instead of thinking of regular expressions as matching strings, you can also think of them as generating strings. That is, all the strings they would match.

Let [R] be the set of strings generated by the regular expression R. Then given to regular expressions R and T, we could try to match T against [R]. That is [R] matches T iff there is a string in [R] which matches T.

It should be possible to develop this into an algorithm where [R] is lazily constructed as needed: where normal regular expression matching would try to match the next character from an input string and then advance to the next character in the string, the modified algorithm would check whether the FSM corresponding to the input regular expression can generate a matching character at its current state and then advances to 'all next states' simultaneously.

like image 44
michid Avatar answered Oct 17 '22 03:10

michid


Another approach would be to leverage Dan Kogai's Perl Regexp::Optimizer instead.

  use Regexp::Optimizer;
  my $o  = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
  # $re is now qr/foo(?:[bx]ar|zap)/

.. first, optimize and then discard all redundant patterns.

Maybe Ron Savage's Regexp::Assemble could be even more helpful. It allows assembling an arbitrary number of regular expressions into a single regular expression that matches all that the individual REs match.* Or a combination of both.

* However, you need to be aware of some differences between Perl and Java or other PCRE-flavors.

like image 24
wp78de Avatar answered Oct 17 '22 01:10

wp78de