Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match against multiple regexes in Perl?

Tags:

regex

perl

I've seen this previous post, about matching against multiple regexes How can I match against multiple regexes in Perl?

I'm looking for the fastest way to match all the values contained in an array against a very big file (500 MB).

The patterns are read from the stdin and may contain special characters that must be used in the regex (anchors, character classes etc). The match must happen when all the patterns are contained in the current row.

Currently I'm using a nested for cycle but I'm not very satisfied with the speed....

Thanks for your suggestions.

like image 916
user764169 Avatar asked May 21 '11 17:05

user764169


2 Answers

Try Regexp::Assemble as suggested in the post you linked to and compare that to an iterative approach like grep. Regexp::Assemble should produce the fastest solution since Perl can optimize the joined regexes rather than scanning the whole line for each one. Since you don't know your input beforehand, ymmv.

Which version of Perl you're using will affect performance. 5.10 introduced a lot of optimizations for exactly this purpose (see "tries"). One of the biggest use cases is spam scanners like SpamAssassin which build a big regex of all the patterns they scan for, just like Regexp::Assemble.

Finally, since your input is so large, it may be worthwhile to assemble the regex into a file and then run grep -P -f $regex_file $big_file. -P tells grep to use Perl compatible regular expressions. The file is used to avoid shell quoting or command size limits. grep may blow the doors off Perl.

In the end, you're going to have to do the benchmarking.

like image 80
Schwern Avatar answered Oct 04 '22 06:10

Schwern


Did you try using grep?

while($line=<>) {
    if (scalar(grep($line=~/$_/,@regexps))==scalar(@regexps)) {
       # ... All matched
    }
}
like image 27
Dov Grobgeld Avatar answered Oct 04 '22 07:10

Dov Grobgeld