I noticed some extreme delays in my Ruby (1.9) scripts and after some digging it boiled down to regular expression matching. I'm using the following test scripts in Perl and in Ruby:
Perl:
$fname = shift(@ARGV);
open(FILE, "<$fname" );
while (<FILE>) {
if ( /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/ ) {
print "$1: $2\n";
}
}
Ruby:
f = File.open( ARGV.shift )
while ( line = f.gets )
if /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/.match(line)
puts "#{$1}: #{$2}"
end
end
I use the same input for both scripts, a file with only 44290 lines. The timing for each one is:
Perl:
xenofon@cpm:~/bin/local/project$ time ./try.pl input >/dev/null
real 0m0.049s
user 0m0.040s
sys 0m0.000s
Ruby:
xenofon@cpm:~/bin/local/project$ time ./try.rb input >/dev/null
real 1m5.106s
user 1m4.910s
sys 0m0.010s
I guess I'm doing something awfully stupid, any suggestions?
Thank you
Regular Expression (Regex or Regexp or RE) in Perl is a special text string for describing a search pattern within a given text. Regex in Perl is linked to the host language and is not the same as in PHP, Python, etc. Sometimes it is termed as “Perl 5 Compatible Regular Expressions“.
A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings. Ruby regular expressions i.e. Ruby regex for short, helps us to find particular patterns inside a string. Two uses of ruby regex are Validation and Parsing.
In the 1980s, the more complicated regexes arose in Perl, which originally derived from a regex library written by Henry Spencer (1986), who later wrote an implementation of Advanced Regular Expressions for Tcl.
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/) f = File.open( ARGV.shift ).each do |line| if regex .match(line) puts "#{$1}: #{$2}" end end
Or
regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/) f = File.open( ARGV.shift ) f.each_line do |line| if regex.match(line) puts "#{$1}: #{$2}" end
One possible difference is the amount of backtracking being performed. Perl might do a better job of pruning the search tree when backtracking (i.e. noticing when part of a pattern can't possibly match). Its regex engine is highly optimised.
First, adding a leading «^
» could make a huge difference. If the pattern doesn't match starting at position 0, it's not going to match at starting position 1 either! So don't try to match at position 1.
Along the same lines, «.*?
» isn't as limiting as you might think, and replacing each instance of it with a more limiting pattern could prevent a lot of backtracking.
Why don't you try:
/
^
(.*?) [ ]\|
(?:(?!SENDING[ ]REQUEST).)* SENDING[ ]REQUEST
(?:(?!TID=).)* TID=
([^,]*) ,
/x
(Not sure if it was safe to replace the first «.*?
» with «[^|]
», so I didn't.)
(At least for patterns that match a single string, (?:(?!PAT).)
is to PAT
as [^CHAR]
is to CHAR
.)
Using /s
could possibly speed things up if «.
» is allowed to match newlines, but I think it's pretty minor.
Using «\space
» instead of «[space]
» to match a space under /x
might be slightly faster in Ruby. (They're the same in recent versions of Perl.) I used the latter because it's far more readable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With