What's the fastest way to count the number of words in a string in Perl?

Question

I have a few functions that I'm running over a million times on various texts, which means small improvements in these functions translate to big gains overall. Currently, I've noticed that all of my functions which involve word counts take drastically longer to run than everything else, so I'm thinking I want to try doing word count in a different way.

Basically, what my function does is grab a number of objects that have text associated with them, verify that that text doesn't match certain patterns, and then count the number of words in that text. A basic version of the function is:

my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
   my $text = $self->_getTextFromNode($i);
   #If it looks like a node full of bogus text, or just a number, remove it.
   if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
   if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
   my @text_words = split(/\s+/, $text);
   $num_words += scalar(@text_words);
   if ($num_words > 30) { return 30; }
}
return $num_words;
}

I'm doing plenty of text comparisons similar to what I'm doing here elsewhere in my code, so I'm guessing my problem must be with my word counting. Is there a faster way to do it than splitting on \s+? If so, what is it and why is it faster (so I can understand what I'm doing wrong and can apply that knowledge to similar problems later on).

Eric Strom · Accepted Answer

Using a while loop with a regex is the fastest way that I have found to count words:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
    my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
    my $num; 
    $num++ while $text =~ /\S+/g;     #/
    $num
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
}

#          Rate  list array while
# list  303674/s    --  -22%  -55%
# array 389212/s   28%    --  -42%
# while 675295/s  122%   74%    --

The while loop is faster because memory does not need to be allocated for each of the found words. Also the regex is in boolean context which means it does not need to extract the actual match from the string.

hexcoder · Answer

If words are seperated only by single spaces, counting spaces is fast.

sub count1
{
    my $str = shift;
    return 1 + ($str =~ tr{ }{ });
}

updated benchmark:

my $text = 'asdf asdf asdf asdf asdf';

sub count_array {
   my @text_words = split(/\s+/, $text);
   scalar(@text_words);
}

sub count_list {
   my $x =()= $text =~ /\S+/g;       #/
}

sub count_while {
   my $num; 
   $num++ while $text =~ /\S+/g;     #/
   $num
}

sub count_tr {
    1 + ($text =~ tr{ }{ });
}

say count_array; # 5
say count_list;  # 5
say count_while; # 5
say count_tr; # 5

use Benchmark 'cmpthese';

cmpthese -2 => {
    array => \&count_array,
    list  => \&count_list,
    while => \&count_while,
    tr    => \&count_tr,
}

#            Rate  list while array    tr
# list   220911/s    --  -24%  -44%  -94%
# while  291225/s   32%    --  -26%  -92%
# array  391769/s   77%   35%    --  -89%
# tr    3720197/s 1584% 1177%  850%    --

What's the fastest way to count the number of words in a string in Perl?

Tags:

perl

Eli

2 Answers

Eric Strom

hexcoder

Recent Activity

Donate For Us

What's the fastest way to count the number of words in a string in Perl?

Tags:

perl

Eli

2 Answers

Eric Strom

hexcoder

Related questions

Recent Activity

Donate For Us