Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simplifying regex OR patterns

I was asked today if there was a library to take a list of strings and to compute the most efficient regex to match only those strings. I think it's an NP Complete problem by itself, but I think we can refine the scope a bit.

How would I generate and simplify a regex to match a subset of hosts from a larger set of all hosts on my network? (Knowing that I might not get the most efficient regex.)

The first step is easy. From the following list;

  • appserver1.domain.tld
  • appserver2.domain.tld
  • appserver3.domain.tld

I can concatenate and escape them into

appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\.domain\.tld

And I know how to manually simplify the regex into

appserver[123]\.domain\.tld

From there I can test that pattern against the full list of hosts and verify that it only matches the selected 3 hosts. What I don't know is how to automate the simplifying process. Are there any libraries (in Perl, Javascript or C#) or common practices?

Thanks

Update I got some awesome perl modules but I would love a front end solution as well. That means Javascript. I've searched around but nobody has ported the perl modules to JS and I'm unsuccessful in finding the language to search for this type of library.

like image 723
reconbot Avatar asked Sep 24 '10 15:09

reconbot


3 Answers

Regexp::Assemble::Compressed / Regexp::Assemble know far more tricks than PreSuf. R::A comes with the command-line tool assemble (not installed by default) which makes building regexes even easier.

like image 164
daxim Avatar answered Oct 10 '22 16:10

daxim


The Regex::PreSuf module is designed to do exactly this.

To quote the Synopsis:

use Regex::PreSuf;

my $re = presuf(qw(foobar fooxar foozap));

# $re should be now 'foo(?:zap|[bx]ar)'
like image 43
Zaid Avatar answered Oct 10 '22 16:10

Zaid


The Perl regex compiler builds a branching trie data structure out of patterns with parts in common across alternatives:

 $ perl -Mre=debug -ce '"whatever" =~ /appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\.domain\.tld/'
Compiling REx "appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\."...
Final program:
   1: EXACT <appserver> (5)
   5: TRIEC-EXACT[123] (25)
      <1.domain.tld> 
      <2.domain.tld> 
      <3.domain.tld> 
  25: END (0)
anchored "appserver" at 0 (checking anchored) minlen 21 
-e syntax OK
Freeing REx: "appserver1\.domain\.tld|appserver2\.domain\.tld|appserver3\."...
like image 3
tchrist Avatar answered Oct 10 '22 16:10

tchrist