I have a few functions that I'm running over a million times on various texts, which means small improvements in these functions translate to big gains overall. Currently, I've noticed that all of my functions which involve word counts take drastically longer to run than everything else, so I'm thinking I want to try doing word count in a different way.
Basically, what my function does is grab a number of objects that have text associated with them, verify that that text doesn't match certain patterns, and then count the number of words in that text. A basic version of the function is:
my $num_words = 0;
for (my $i=$begin_pos; $i<=$end_pos; $i++) {
my $text = $self->_getTextFromNode($i);
#If it looks like a node full of bogus text, or just a number, remove it.
if ($text =~ /^\s*\<.*\>\s*$/ && $begin_pos == $end_pos) { return 0; }
if ($text =~ /^\s*(?:Page\s*\d+)|http/i && $begin_pos == $end_pos) { return 0; }
if ($text =~ /^\s*\d+\s*$/ && $begin_pos == $end_pos) { return 0; }
my @text_words = split(/\s+/, $text);
$num_words += scalar(@text_words);
if ($num_words > 30) { return 30; }
}
return $num_words;
}
I'm doing plenty of text comparisons similar to what I'm doing here elsewhere in my code, so I'm guessing my problem must be with my word counting. Is there a faster way to do it than splitting on \s+
? If so, what is it and why is it faster (so I can understand what I'm doing wrong and can apply that knowledge to similar problems later on).
Using a while loop with a regex is the fastest way that I have found to count words:
my $text = 'asdf asdf asdf asdf asdf';
sub count_array {
my @text_words = split(/\s+/, $text);
scalar(@text_words);
}
sub count_list {
my $x =()= $text =~ /\S+/g; #/
}
sub count_while {
my $num;
$num++ while $text =~ /\S+/g; #/
$num
}
say count_array; # 5
say count_list; # 5
say count_while; # 5
use Benchmark 'cmpthese';
cmpthese -2 => {
array => \&count_array,
list => \&count_list,
while => \&count_while,
}
# Rate list array while
# list 303674/s -- -22% -55%
# array 389212/s 28% -- -42%
# while 675295/s 122% 74% --
The while loop is faster because memory does not need to be allocated for each of the found words. Also the regex is in boolean context which means it does not need to extract the actual match from the string.
If words are seperated only by single spaces, counting spaces is fast.
sub count1
{
my $str = shift;
return 1 + ($str =~ tr{ }{ });
}
updated benchmark:
my $text = 'asdf asdf asdf asdf asdf';
sub count_array {
my @text_words = split(/\s+/, $text);
scalar(@text_words);
}
sub count_list {
my $x =()= $text =~ /\S+/g; #/
}
sub count_while {
my $num;
$num++ while $text =~ /\S+/g; #/
$num
}
sub count_tr {
1 + ($text =~ tr{ }{ });
}
say count_array; # 5
say count_list; # 5
say count_while; # 5
say count_tr; # 5
use Benchmark 'cmpthese';
cmpthese -2 => {
array => \&count_array,
list => \&count_list,
while => \&count_while,
tr => \&count_tr,
}
# Rate list while array tr
# list 220911/s -- -24% -44% -94%
# while 291225/s 32% -- -26% -92%
# array 391769/s 77% 35% -- -89%
# tr 3720197/s 1584% 1177% 850% --
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With