Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating the shortest regex to match an arbitrary word list

I'm hoping someone might know of a script that can take an arbitrary word list and generated the shortest regex that could match that list exactly (and nothing else).

For example, suppose my list is

1231
1233
1234
1236
1238
1247
1256
1258
1259

Then the output should be:

12(3[13468]|47|5[589])
like image 982
Asmor Avatar asked Feb 03 '23 13:02

Asmor


1 Answers

This is an old post, but for the benefit of those finding it through web searches as I did, there is a Perl module that does this, called Regexp::Optimizer, here: http://search.cpan.org/~dankogai/Regexp-Optimizer-0.23/lib/Regexp/Optimizer.pm

It takes a regular expression as input, which can consist just of the list of input strings separated with |, and outputs an optimal regular expression.

For example, this Perl command-line:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/1231|1233|1234|1236|1238|1247|1256|1258|1259/)"

generates this output:

(?^:(?^:12(?:3[13468]|5[689]|47)))

(assuming you have installed Regex::Optimizer), which matches the OP's expectation quite well.

Here's another example:

perl -mRegexp::Optimizer -e "print Regexp::Optimizer->new->optimize(qr/314|324|334|3574|384/)"

And the output:

(?^:(?^:3(?:[1238]|57)4))

For comparison, an optimal trie-based version would output 3(14|24|34|574|84). In the above output, you can also search and replace (?: and (?^: with just ( and eliminate redundant parentheses, to obtain this:

3([1238]|57)4
like image 126
Sei Lisa Avatar answered Feb 06 '23 11:02

Sei Lisa