Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if one regex covers another regex

I'm attempting to implement a text clustering algorithm. The algorithm clusters similar lines of raw text by replacing them with regexes, and aggregates the number of patterns matching each regex so as to provide a neat summary of the input text instead of showing repetitive patterns from the input text. In this attempt I ran into the need to find if one regex covers another.

Assume we are concerned only about regular expressions with '*' and '+' wild cards i.e. '*' meaning zero or more occurences of an alphabet, and '+' meaning 1 or more occurences of an alphabet. Also assume the character set to be ASCII.

For example:

1. AB covers AB
      This is straightforward.
2. ABC* covers ABC
      Because ABC* can generate: ABC, ABCC, ABCCC etc.
3. A*B+C* covers AB+C*
      Because A*B+C* can generate ABBC, AABBC, AABBCC etc. which covers
      all strings generated by AB+C*.
4. A+M+BC* covers AMM+B+C+M+BC*
      Similar to case [3] above.

Basically I'm looking for an efficient implementation of the following method which tells if strA (may contain a regex) covers for strB (may contain a regex). Note that there should also be a way to escape the regex characters '*' and '+' in the input strings strA and strB.

Method signature in C++:

bool isParentRegex(const string& strA, const string& strB)

My thinking is that a recursive approach is required for the implementation and it may be a bit complicated. But I'm curious to know if I could reuse existing implementations instead of re-inventing the wheel or if there are any other straightforward ways to do it.

like image 652
Kowshik Avatar asked Mar 27 '12 10:03

Kowshik


2 Answers

Considering the simple regex grammar you propose, the solution is fairly trivial.

Take your more complex example, A+M+BC* covers AMM+B+C+M+BC* You can rewrite it as A{1,}M{1,}B{1,1}C{0,} covers A{1,1}M{2,}B{1,}C{1,}M{1,}B{1,1}C{0,}

This leads us to the simple rule: R1 covers R2 if all symbols appear in the same order, all lower bounds of R1 are less or equal to those of R2, and the upper bounds of R1 are more or equal than those of R2.

Now there's one slight problem with the simple rule. AB*C covers AC, i.e. there's a possibility that an optional symbol appears in R1 and not in R2. You can solve that by inserting a {0,0} in R2 when there's an (optional) symbol in R1 that doesn't appear in the equivalent position in R2. E.g. AB*C does cover AB{0,0}C.

The "optional symbol" rule is an optimization. If the symbol in R1 isn't optional, R1 certainly doesn't cover R2. E.g. AB+C doesn't cover AC. Therefore there's no need to insert a B{0,0}. But if you do, you'll see that A{1,1}B{1,}C{1,1} doesn't cover A{1,1}B{0,0}C{1,1} since the R1 lower bound on B (1) is more that the R2 lower bound on B (0)

like image 194
MSalters Avatar answered Oct 21 '22 04:10

MSalters


I would do something like implementing a function for finding the minimal DFA from a given Regular Expression. Let's Assume that

DFA GetMinimalDFA(Regex r1) does that.

bool isParentRegex(Regex r1, Regex r2) {
    DFA a = GetMinimalDFA(r1);
    DFA b = GetMinimalDFA(Regex.OR(r1,r2))
    return a.Equals(b);
}
like image 38
Stavros Zotalis Avatar answered Oct 21 '22 02:10

Stavros Zotalis