Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to count the number of words in a string in Perl?

Tags:

perl

I have a few functions that I'm running over a million times on various texts, which means small improvements in these functions translate to big gains overall. Currently, I've noticed that all of my functions which involve word counts take drastically longer to run than everything else, so I'm thinking I want to try doing word count in a different way.

Basically, what my function does is grab a number of objects that have text associated with them, verify that that text doesn't match certain patterns, and then count the number of words in that text. A basic version of the function is:

my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
   my $text = $self->_getTextFromNode($i);
   #If it looks like a node full of bogus text, or just a number, remove it.
   if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
   my @text_words = split(/\s+/, $text);
   $num_words += scalar(@text_words);
   if ($num_words > 30) { return 30; }
}
return $num_words;
}

I'm doing plenty of text comparisons similar to what I'm doing here elsewhere in my code, so I'm guessing my problem must be with my word counting. Is there a faster way to do it than splitting on \s+? If so, what is it and why is it faster (so I can understand what I'm doing wrong and can apply that knowledge to similar problems later on).

like image 440
Eli Avatar asked May 19 '11 19:05

Eli


2 Answers

Using a while loop with a regex is the fastest way that I have found to count words:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
    my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
    my $num; 
    $num++ while $text =~ /\S+/g;     #/
    $num
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
}

#          Rate  list array while
# list  303674/s    --  -22%  -55%
# array 389212/s   28%    --  -42%
# while 675295/s  122%   74%    --

The while loop is faster because memory does not need to be allocated for each of the found words. Also the regex is in boolean context which means it does not need to extract the actual match from the string.

like image 79
Eric Strom Avatar answered Jan 04 '23 11:01

Eric Strom


If words are seperated only by single spaces, counting spaces is fast.

sub count1
{
    my $str = shift;
    return 1 + ($str =~ tr{ }{ });
}

updated benchmark:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
   my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
   my $num; 
   $num++ while $text =~ /\S+/g;     #/
   $num
}

sub count_tr {
    1 + ($text =~ tr{ }{ });
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5
say count_tr; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
    tr    => \&count_tr,
}

#            Rate  list while array    tr
# list   220911/s    --  -24%  -44%  -94%
# while  291225/s   32%    --  -26%  -92%
# array  391769/s   77%   35%    --  -89%
# tr    3720197/s 1584% 1177%  850%    --
like image 33
hexcoder Avatar answered Jan 04 '23 12:01

hexcoder