Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find a "minimal spanning set" for a collection of regular expressions?

CONTEXT:

I have a smallish (currently less than 100) but growing collection of Regular Expressions, and I want to optimize the process of determining for a given text string which of the REs in my collection match the text string.

Some of the REs have an ordering relationship - for example if I know that the string $t matches /windows/i then I also know that $t matches /windows.*2000/i. So when testing $t against the REs in my collection I can skip testing /windows/i if I've already tested $t against /windows.*2000/i and found a match (although if /windows.*2000/i does not match then of course I cannot skip the test against /windows/i).

Note that none of the REs in my collection are entirely equivalent (for any pair of REs there is at least one text string which matches one and does not match the other).

STRATEGY:

I want to build a directed graph G with a node for each RE in my collection and a directed edge for each pair of REs with an ordering relationship (A -> B means "match against A implies match against B"), and find a "minimal spanning set" of nodes for the graph (minimal set of nodes S such that every node in G lies on a directed path which originates in S).

THE EASY PART:

There are lots of freely available algorithms for working with Directed Acyclic Graphs. So once the graph G is built for my collection of REs (which being distinct should guarantee that G is acyclic) I don't expect to have much difficulty finding an appropriate algorithm for finding a minimal spanning set for G.

WHERE I NEED HELP:

I'd like to find an efficient way to find all the ordering relationships between the REs in my collection - and perhaps also to ensure that no two REs in the collection are equivalent (I will need a way to automatically verify this as new REs are added).

My (essentially random) web searches have thus turned up at least one plausible claim that a reasonable way to compute what (if any) ordering relationship exists between two REs does indeed exist, but have not yet turned up any descriptions of a complete algorithm.

Does anyone know of an existing implementation (for comparing REs) which is reasonably efficient, freely available, and (ideally) implemented either in one of the popular scripting languages or C/C++?

like image 491
Peter Avatar asked May 02 '11 18:05

Peter


1 Answers

I am not sure if you have flexibility in terms of the regular expression library that you need to use, but you could look at RE2 whose Set interface can match multiple regexes simultaneously. Note that RE2 uses primarily a DFA approach, and does not support all of the regex features that other, mostly backtracking, implementations do.

like image 172
e.dan Avatar answered Oct 06 '22 02:10

e.dan