I'm attempting to implement a text clustering algorithm. The algorithm clusters similar lines of raw text by replacing them with regexes, and aggregates the number of patterns matching each regex so as to provide a neat summary of the input text instead of showing repetitive patterns from the input text. In this attempt I ran into the need to find if one regex covers another.
Assume we are concerned only about regular expressions with '*' and '+' wild cards i.e. '*' meaning zero or more occurences of an alphabet, and '+' meaning 1 or more occurences of an alphabet. Also assume the character set to be ASCII.
For example:
1. AB covers AB
This is straightforward.
2. ABC* covers ABC
Because ABC* can generate: ABC, ABCC, ABCCC etc.
3. A*B+C* covers AB+C*
Because A*B+C* can generate ABBC, AABBC, AABBCC etc. which covers
all strings generated by AB+C*.
4. A+M+BC* covers AMM+B+C+M+BC*
Similar to case [3] above.
Basically I'm looking for an efficient implementation of the following method which tells if strA (may contain a regex) covers for strB (may contain a regex). Note that there should also be a way to escape the regex characters '*' and '+' in the input strings strA and strB.
Method signature in C++:
bool isParentRegex(const string& strA, const string& strB)
My thinking is that a recursive approach is required for the implementation and it may be a bit complicated. But I'm curious to know if I could reuse existing implementations instead of re-inventing the wheel or if there are any other straightforward ways to do it.
Considering the simple regex grammar you propose, the solution is fairly trivial.
Take your more complex example, A+M+BC* covers AMM+B+C+M+BC*
You can rewrite it as A{1,}M{1,}B{1,1}C{0,}
covers A{1,1}M{2,}B{1,}C{1,}M{1,}B{1,1}C{0,}
This leads us to the simple rule: R1
covers R2
if all symbols appear in the same order, all lower bounds of R1
are less or equal to those of R2
, and the upper bounds of R1
are more or equal than those of R2
.
Now there's one slight problem with the simple rule. AB*C
covers AC
, i.e. there's a possibility that an optional symbol appears in R1
and not in R2
. You can solve that by inserting a {0,0}
in R2
when there's an (optional) symbol in R1 that doesn't appear in the equivalent position in R2
. E.g. AB*C
does cover AB{0,0}C
.
The "optional symbol" rule is an optimization. If the symbol in R1
isn't optional, R1
certainly doesn't cover R2
. E.g. AB+C
doesn't cover AC
. Therefore there's no need to insert a B{0,0}
. But if you do, you'll see that A{1,1}B{1,}C{1,1}
doesn't cover A{1,1}B{0,0}C{1,1}
since the R1
lower bound on B
(1) is more that the R2
lower bound on B
(0)
I would do something like implementing a function for finding the minimal DFA from a given Regular Expression. Let's Assume that
DFA GetMinimalDFA(Regex r1) does that.
bool isParentRegex(Regex r1, Regex r2) {
DFA a = GetMinimalDFA(r1);
DFA b = GetMinimalDFA(Regex.OR(r1,r2))
return a.Equals(b);
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With