I've seen this previous post, about matching against multiple regexes How can I match against multiple regexes in Perl?
I'm looking for the fastest way to match all the values contained in an array against a very big file (500 MB).
The patterns are read from the stdin and may contain special characters that must be used in the regex (anchors, character classes etc). The match must happen when all the patterns are contained in the current row.
Currently I'm using a nested for cycle but I'm not very satisfied with the speed....
Thanks for your suggestions.
Try Regexp::Assemble as suggested in the post you linked to and compare that to an iterative approach like grep
. Regexp::Assemble should produce the fastest solution since Perl can optimize the joined regexes rather than scanning the whole line for each one. Since you don't know your input beforehand, ymmv.
Which version of Perl you're using will affect performance. 5.10 introduced a lot of optimizations for exactly this purpose (see "tries"). One of the biggest use cases is spam scanners like SpamAssassin which build a big regex of all the patterns they scan for, just like Regexp::Assemble.
Finally, since your input is so large, it may be worthwhile to assemble the regex into a file and then run grep -P -f $regex_file $big_file
. -P
tells grep
to use Perl compatible regular expressions. The file is used to avoid shell quoting or command size limits. grep
may blow the doors off Perl.
In the end, you're going to have to do the benchmarking.
Did you try using grep?
while($line=<>) {
if (scalar(grep($line=~/$_/,@regexps))==scalar(@regexps)) {
# ... All matched
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With