Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you detect if two regular expressions overlap in the strings they can match?

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?

like image 551
Joseph Garvin Avatar asked Dec 04 '09 20:12

Joseph Garvin


People also ask

How do you know if two regular expressions are equal?

We say that two regular expressions R and S are equivalent if they describe the same language. In other words, if L(R) = L(S) for two regular expressions R and S then R = S. Examples. Are R and S equivalent?

Can regex matches overlap?

The first items are the matches – they are empty strings because all matchable characters are inside the lookahead/lookbehind assertions. The second items are the matches from the capturing groups. And that's how we can do overlapping matches with regular expressions.

How can you match regular expressions?

The Match(String) method returns the first substring that matches a regular expression pattern in an input string. For information about the language elements used to build a regular expression pattern, see Regular Expression Language - Quick Reference.

What is the regular expression matching one or more specific characters?

The character + in a regular expression means "match the preceding character one or more times". For example A+ matches one or more of character A. The plus character, used in a regular expression, is called a Kleene plus .


1 Answers

There's no easy way.

As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.

Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).

The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).

Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:

Some classes of regular languages can only be described by deterministic finite automata whose size grows exponentially in the size of the shortest equivalent regular expressions. The standard example are here the languages L_k consisting of all strings over the alphabet {a,b} whose kth-last letter equals a.

By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.

Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:

if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')

and X ->a Y (in the other NFA)

then (A,X) ->a (B,Y) in the intersection

(C,Z) is final iff C is final in the one NFA and Z is final in the other.

To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.

If you allow epsilon transitions (that don't output a letter), that's fine:

if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.

Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.

Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.

This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).

like image 170
Jonathan Graehl Avatar answered Oct 03 '22 00:10

Jonathan Graehl