Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many regular expressions can I chain together using alternation?

Tags:

regex

perl

I have some large files (hundreds of MB) that I need to search for several thousand ~20-character unique strings.

I've found that using the pipe alternation metacharacter for matching regular expressions like (string1|string2|string3) speeds up the search process a lot (versus searching for one string at a time).

What's the limit to how well this will scale? How many expressions can I chain together like this? Will it cause some kind of overflow at some point? Is there a better way to do this?

EDIT

In an effort to keep my question brief, I didn't emphasize the fact that I've already implemented code using this alternation approach and I found it to be helpful: On a test case with a typical data set, running time was reduced from 87 minutes down to 18 seconds--a 290x speedup, apparently with O(n) instead of O(n*m).

My question relates to how this approach can be expected to work when other users run this code in the future using much larger data sets with bigger files and more search terms. The original O(n*m) code was existing code that's been in use for 13 years, and its slowness was pointed out recently as the genome-related data sets it operates on have recently gotten much bigger.

like image 623
rmtheis Avatar asked Feb 26 '12 22:02

rmtheis


2 Answers

If you have a simple regular expression like (word1|word2|...|wordn), the regex engine will construct a state machine that can just pass over the input once to find whether the string matches.

Side-note: in theoretical computer science, "regular expressions" are defined in a way such that a single pass is always sufficient. However, practical regex implementation add features that allow construction of regex patterns which can't be always implemented as a single pass (see this example).

Again, for your pattern of regular expressions, the engine will almost certainly use a single pass. That is likely going to be faster than reading the data from memory multiple times ... and almost definitely a lot faster than reading the data multiple times from disk.

like image 80
Igor ostrovsky Avatar answered Oct 19 '22 05:10

Igor ostrovsky


If you are just going to have regular expression of the form (word1|word2|....|wordn), why not just create an associated array of booleans. That should be very fast.

EDIT

# before the loop, set up the hash

%words = (
   cat => 1,
   dog => 1,
   apple => 1,
    .... etc
);

# A the loop to check a sentence

foreach $aword (split(/ /, $sentence))
   if ($words{$aword}) print "Found $aword\n";
like image 37
Ed Heal Avatar answered Oct 19 '22 07:10

Ed Heal