Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression - Ruby vs Perl

Tags:

regex

ruby

perl

I noticed some extreme delays in my Ruby (1.9) scripts and after some digging it boiled down to regular expression matching. I'm using the following test scripts in Perl and in Ruby:

Perl:

$fname = shift(@ARGV);
open(FILE, "<$fname" );
while (<FILE>) {
    if ( /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/ ) {
        print "$1: $2\n";
    }
}

Ruby:

f = File.open( ARGV.shift )
while ( line = f.gets )
    if /(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/.match(line)
        puts "#{$1}: #{$2}"
    end
end

I use the same input for both scripts, a file with only 44290 lines. The timing for each one is:

Perl:

xenofon@cpm:~/bin/local/project$ time ./try.pl input >/dev/null

real    0m0.049s
user    0m0.040s
sys     0m0.000s

Ruby:

xenofon@cpm:~/bin/local/project$ time ./try.rb input >/dev/null

real    1m5.106s
user    1m4.910s
sys     0m0.010s

I guess I'm doing something awfully stupid, any suggestions?

Thank you

like image 801
xpapad Avatar asked Apr 20 '12 09:04

xpapad


People also ask

Does Perl use regex?

Regular Expression (Regex or Regexp or RE) in Perl is a special text string for describing a search pattern within a given text. Regex in Perl is linked to the host language and is not the same as in PHP, Python, etc. Sometimes it is termed as “Perl 5 Compatible Regular Expressions“.

What kind of regex does Ruby use?

A regular expression is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings. Ruby regular expressions i.e. Ruby regex for short, helps us to find particular patterns inside a string. Two uses of ruby regex are Validation and Parsing.

Did Perl invent regex?

In the 1980s, the more complicated regexes arose in Perl, which originally derived from a regex library written by Henry Spencer (1986), who later wrote an implementation of Advanced Regular Expressions for Tcl.


2 Answers

regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)  f = File.open( ARGV.shift ).each do |line|     if regex .match(line)         puts "#{$1}: #{$2}"     end end 

Or

regex = Regexp.new(/(.*?) \|.*?SENDING REQUEST.*?TID=(.*?),/)  f = File.open( ARGV.shift ) f.each_line do |line|   if regex.match(line)     puts "#{$1}: #{$2}"   end 
like image 195
LaGrandMere Avatar answered Oct 06 '22 23:10

LaGrandMere


One possible difference is the amount of backtracking being performed. Perl might do a better job of pruning the search tree when backtracking (i.e. noticing when part of a pattern can't possibly match). Its regex engine is highly optimised.

First, adding a leading «^» could make a huge difference. If the pattern doesn't match starting at position 0, it's not going to match at starting position 1 either! So don't try to match at position 1.

Along the same lines, «.*?» isn't as limiting as you might think, and replacing each instance of it with a more limiting pattern could prevent a lot of backtracking.

Why don't you try:

/
    ^
    (.*?)                       [ ]\|
    (?:(?!SENDING[ ]REQUEST).)* SENDING[ ]REQUEST
    (?:(?!TID=).)*              TID=
    ([^,]*)                     ,
/x

(Not sure if it was safe to replace the first «.*?» with «[^|]», so I didn't.)

(At least for patterns that match a single string, (?:(?!PAT).) is to PAT as [^CHAR] is to CHAR.)

Using /s could possibly speed things up if «.» is allowed to match newlines, but I think it's pretty minor.

Using «\space» instead of «[space]» to match a space under /x might be slightly faster in Ruby. (They're the same in recent versions of Perl.) I used the latter because it's far more readable.

like image 41
ikegami Avatar answered Oct 07 '22 01:10

ikegami