I'm writing a Perl script in which I need to loop over each character of a string. There's a lot of strings, and each is 100 characters long (they're short DNA sequences, in case you're wondering).
So, is it faster to use substr
to extract each character one at a time, or is it faster to split
the string into an array and then iterate over the array?
While I'm waiting for an answer, I suppose I'll go read up on how to benchmark things in Perl.
substr() in Perl returns a substring out of the string passed to the function starting from a given index up to the length specified. This function by default returns the remaining part of the string starting from the given index if the length is not specified.
If you need to split a string into characters, you can do this: @array = split(//); After this statement executes, @array will be an array of characters. split recognizes the empty pattern as a request to make every character into a separate array element.
A string is splitted based on delimiter specified by pattern. By default, it whitespace is assumed as delimiter. split syntax is: Split /pattern/, variableName.
First, define a string. Next, create a for-loop where the loop variable will start from index 0 and end at the length of the given string. Print the character present at every index in order to separate each individual character. For better visualization, separate each individual character by space.
It really depends on exactly what you're doing with your data -- but hey, you're headed the right way with your last question! Don't guess, benchmark.
Perl provides the Benchmark module for exactly this kind of thing, and using it is really pretty straightforward. Here's a little sample code to get started with:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);
my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 100;
sub frequency_substr {
my $length = length $dna;
my %hist;
for my $pos (0 .. $length) {
$hist{$pos}{substr $dna, $pos, 1} ++;
}
\%hist;
}
sub frequency_split {
my %hist;
my $pos = 0;
for my $char (split //, $dna) {
$hist{$pos ++}{$char} ++;
}
\%hist;
}
sub frequency_regmatch {
my %hist;
while ($dna =~ /(.)/g) {
$hist{pos($dna)}{$1} ++;
}
\%hist;
}
cmpthese(-5, # Run each for at least 5 seconds
{
substr => \&frequency_substr,
split => \&frequency_split,
regex => \&frequency_regmatch
}
);
And a sample result:
Rate regex split substr
regex 6254/s -- -26% -32%
split 8421/s 35% -- -9%
substr 9240/s 48% 10% --
Turns out substr is surprisingly fast. :)
Here is what I would do instead of first trying to choose between substr
and split
:
#!/usr/bin/perl
use strict; use warnings;
my %dist;
while ( my $s = <> ) {
while ( $s =~ /(.)/g ) {
++ $dist{ pos($s) }{ $1 };
}
}
My curiosity got the best of me. Here is a benchmark:
#!/usr/bin/perl
use strict; use warnings;
use Benchmark qw( cmpthese );
my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
join '', map $chars[rand @chars], 1 .. 100
} 1 .. 1_000;
cmpthese -1, {
'split' => \&bench_split,
'substr' => \&bench_substr,
'match' => \&bench_match,
};
sub bench_split {
my %dist;
for my $s ( @to_split ) {
my @s = split //, $s;
for my $i ( 0 .. $#s ) {
++ $dist{ $i }{ $s[$i] };
}
}
}
sub bench_substr {
my %dist;
for my $s ( @to_substr ) {
my $u = length($s) - 1;
for my $i (0 .. $u) {
++ $dist{ $i }{ substr($s, $i, 1) };
}
}
}
sub bench_match {
my %dist;
for my $s ( @to_match ) {
while ( $s =~ /(.)/g ) {
++ $dist{ pos($s) }{ $1 };
}
}
}
Output:
Rate split match substr split 4.93/s -- -31% -65% match 7.11/s 44% -- -49% substr 14.0/s 184% 97% --
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With