Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When accessing individual characters in a string in Perl, is substr or splitting to an array faster?

I'm writing a Perl script in which I need to loop over each character of a string. There's a lot of strings, and each is 100 characters long (they're short DNA sequences, in case you're wondering).

So, is it faster to use substr to extract each character one at a time, or is it faster to split the string into an array and then iterate over the array?

While I'm waiting for an answer, I suppose I'll go read up on how to benchmark things in Perl.

like image 698
Ryan C. Thompson Avatar asked Oct 21 '10 00:10

Ryan C. Thompson


People also ask

What does Substr do in Perl?

substr() in Perl returns a substring out of the string passed to the function starting from a given index up to the length specified. This function by default returns the remaining part of the string starting from the given index if the length is not specified.

How do I split a string into a character in Perl?

If you need to split a string into characters, you can do this: @array = split(//); After this statement executes, @array will be an array of characters. split recognizes the empty pattern as a request to make every character into a separate array element.

How do I split a string into multiple strings in Perl?

A string is splitted based on delimiter specified by pattern. By default, it whitespace is assumed as delimiter. split syntax is: Split /pattern/, variableName.

How do you separate the characters in a string?

First, define a string. Next, create a for-loop where the loop variable will start from index 0 and end at the length of the given string. Print the character present at every index in order to separate each individual character. For better visualization, separate each individual character by space.


2 Answers

It really depends on exactly what you're doing with your data -- but hey, you're headed the right way with your last question! Don't guess, benchmark.

Perl provides the Benchmark module for exactly this kind of thing, and using it is really pretty straightforward. Here's a little sample code to get started with:

#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(cmpthese);

my $dna;
$dna .= [qw(G A T C)]->[rand 4] for 1 .. 100;

sub frequency_substr {
  my $length = length $dna;
  my %hist;

  for my $pos (0 .. $length) {
    $hist{$pos}{substr $dna, $pos, 1} ++;
  }

  \%hist;
}

sub frequency_split {
  my %hist;
  my $pos = 0;
  for my $char (split //, $dna) {
    $hist{$pos ++}{$char} ++;
  }

  \%hist;
}

sub frequency_regmatch {
  my %hist;

  while ($dna =~ /(.)/g) {
    $hist{pos($dna)}{$1} ++;
  }

  \%hist;
}


cmpthese(-5, # Run each for at least 5 seconds
  { 
    substr => \&frequency_substr,
    split => \&frequency_split,
    regex => \&frequency_regmatch
  }
);

And a sample result:

         Rate  regex  split substr
regex  6254/s     --   -26%   -32%
split  8421/s    35%     --    -9%
substr 9240/s    48%    10%     --

Turns out substr is surprisingly fast. :)

like image 56
hobbs Avatar answered Nov 15 '22 15:11

hobbs


Here is what I would do instead of first trying to choose between substr and split:

#!/usr/bin/perl

use strict; use warnings;

my %dist;
while ( my $s = <> ) {
    while ( $s =~ /(.)/g ) {
        ++ $dist{ pos($s) }{ $1 };
    }
}

Update:

My curiosity got the best of me. Here is a benchmark:

#!/usr/bin/perl

use strict; use warnings;
use Benchmark qw( cmpthese );

my @chars = qw(A C G T);
my @to_split = my @to_substr = my @to_match = map {
    join '', map $chars[rand @chars], 1 .. 100
} 1 .. 1_000;

cmpthese -1, {
    'split'  => \&bench_split,
    'substr' => \&bench_substr,
    'match'  => \&bench_match,
};

sub bench_split {
    my %dist;
    for my $s ( @to_split ) {
        my @s = split //, $s;
        for my $i ( 0 .. $#s ) {
            ++ $dist{ $i }{ $s[$i] };
        }
    }
}

sub bench_substr {
    my %dist;
    for my $s ( @to_substr ) {
        my $u = length($s) - 1;
        for my $i (0 .. $u) {
            ++ $dist{ $i }{ substr($s, $i, 1) };
        }
    }
}

sub bench_match {
    my %dist;
    for my $s ( @to_match ) {
        while ( $s =~ /(.)/g ) {
            ++ $dist{ pos($s) }{ $1 };
        }
    }
}

Output:

         Rate  split  match substr
split  4.93/s     --   -31%   -65%
match  7.11/s    44%     --   -49%
substr 14.0/s   184%    97%     --
like image 32
Sinan Ünür Avatar answered Nov 15 '22 15:11

Sinan Ünür