How can you detect if two regular expressions overlap in the strings they can match?

Tags:

I have a container of regular expressions. I'd like to analyze them to determine if it's possible to generate a string that matches more than 1 of them. Short of writing my own regex engine with this use case in mind, is there an easy way in C++ or Python to solve this problem?

551

asked Dec 04 '09 20:12

Joseph Garvin

1 Answers

There's no easy way.

As long as your regular expressions use only standard features (Perl lets you embed arbitrary code in matching, I think), you can produce from each one a nondeterministic finite-state automaton (NFA) that compactly encodes all the strings that the RE matches.

Given any pair of NFA, it's decidable whether their intersection is empty. If the intersection isn't empty, then some string matches both REs in the pair (and conversely).

The standard decidability proof is to determinize them into DFAs first, and then construct a new DFA whose states are pairs of the two DFAs' states, and whose final states are exactly those in which both states in the pair are final in their original DFA. Alternatively, if you've already shown how to compute the complement of a NFA, then you can (DeMorgan's law style) get the intersection by complement(union(complement(A),complement(B))).

Unfortunately, NFA->DFA involves a potentially exponential size explosion (because states in the DFA are subsets of states in the NFA). From Wikipedia:

Some classes of regular languages can only be described by deterministic finite automata whose size grows exponentially in the size of the shortest equivalent regular expressions. The standard example are here the languages L_k consisting of all strings over the alphabet {a,b} whose kth-last letter equals a.

By the way, you should definitely use OpenFST. You can create automata as text files and play around with operations like minimization, intersection, etc. in order to see how efficient they are for your problem. There already exist open source regexp->nfa->dfa compilers (I remember a Perl module); modify one to output OpenFST automata files and play around.

Fortunately, it's possible to avoid the subset-of-states explosion, and intersect two NFA directly using the same construction as for DFA:

if A ->a B (in one NFA, you can go from state A to B outputting the letter 'a')

and X ->a Y (in the other NFA)

then (A,X) ->a (B,Y) in the intersection

(C,Z) is final iff C is final in the one NFA and Z is final in the other.

To start the process off, you start in the pair of start states for the two NFAs e.g. (A,X) - this is the start state of the intersection-NFA. Each time you first visit a state, generate an arc by the above rule for every pair of arcs leaving the two states, and then visit all the (new) states those arcs reach. You'd store the fact that you expanded a state's arcs (e.g. in a hash table) and end up exploring all the states reachable from the start.

If you allow epsilon transitions (that don't output a letter), that's fine:

if A ->epsilon B in the first NFA, then for every state (A,Y) you reach, add the arc (A,Y) ->epsilon (B,Y) and similarly for epsilons in the second-position NFA.

Epsilon transitions are useful (but not necessary) in taking the union of two NFAs when translating a regexp to an NFA; whenever you have alternation regexp1|regexp2|regexp3, you take the union: an NFA whose start state has an epsilon transition to each of the NFAs representing the regexps in the alternation.

Deciding emptiness for an NFA is easy: if you ever reach a final state in doing a depth-first-search from the start state, it's not empty.

This NFA-intersection is similar to finite state transducer composition (a transducer is an NFA that outputs pairs of symbols, that are concatenated pairwise to match both an input and output string, or to transform a given input to an output).

170

answered Oct 03 '22 00:10

Jonathan Graehl

Related questions
                            
                                Does in class member initialization takes place at compile time or run-time?
                            
                                How can I execute a command line command from a C++ program
                            
                                Sort filenames naturally with Qt
                            
                                Should we prefer Boost or standard lib? [closed]
                            
                                Difference between inotify and epoll
                            
                                Are two function pointers to the same function always equal?
                            
                                Structs vs classes in C++ [duplicate]
                            
                                Why does C++ linking use virtually no CPU?
                            
                                C++ nested classes accessibility
                            
                                Default initialization of C++ Member arrays?
                            
                                best way to do variant visitation with lambdas
                            
                                Qt foreach loop ordering vs. for loop for QList
                            
                                why is std::lock_guard not movable?
                            
                                Qt - add a hyperlink to a dialog
                            
                                Why define operator + or += outside a class, and how to do it properly?
                            
                                Simple object detection using OpenCV and machine learning
                            
                                Creating new types in C++
                            
                                How do I invoke the MinGW cross-compiler on Linux?
                            
                                Using std::tie as a range for loop target
                            
                                What are _mm_prefetch() locality hints?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can you detect if two regular expressions overlap in the strings they can match?

Tags:

c++

python

regex

algorithm

overlap

Joseph Garvin

People also ask

1 Answers

Jonathan Graehl

Recent Activity

Donate For Us