Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C-like arrays in perl

Tags:

arrays

perl

pdl

I want to create and manipulate large arrays of (4 byte) integers in-memory. By large, I mean on the order of hundreds of million. Each cell in the array will act as a counter for a position on a chromosome. All I need is for it to fit in memory, and have fast (O(1)) access to elements. The thing I'm counting is not a sparse feature, so I can't use a sparse array.

I can't do this with a regular perl list, because perl (at least on my machine) uses 64 bytes per element, so the genomes of most of the organisms I work with are just too big. I've tried storing the data on-disk via SQLite and hash tying, and though they work, are very slow, especially on ordinary drives. (It works reasonably ok when I run on 4-drive raid 0's).

I thought I could use PDL arrays, b/c PDL stores its arrays just as C does, using only 4 bytes per element. However, I found that update speed to be excruciatingly slow compared to perl lists:

use PDL;
use Benchmark qw/cmpthese/;

my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;

cmpthese(-1,{ 
    perl => sub{
        $perl[int(rand($N))]++;
    },
    pdl => sub{
        # note that I'm not even incrementing here just setting to 1
        $pdl->set(int(rand($N)), 1);
    }
});

Returns:

          Rate  pdl perl
pdl   481208/s   -- -87%
perl 3640889/s 657%   --    

Does anyone know how to increase pdl set() performance, or know of a different module that can accomplish this?

like image 641
user1481 Avatar asked Mar 16 '12 01:03

user1481


6 Answers

I cannot tell what sort of performance you will get, but I recommend using the vec function, documented here, to split a string into bit fields. I have experimented and found that my Perl will tolerate a string up to 500_000_000 characters long. which corresponds to 125,000,000 32-bit values.

my $data = "\0" x 500_000_000;
vec($data, 0, 32)++;            # Increment data[0]
vec($data, 100_000_000, 32)++;  # Increment data[100_000_000]

If this isn't enough there may be something in the build of Perl that controls the limit. Alternatively if you think you can get smaller fields - say 16-bit counts - vec will accept field widths of any power of 2 up to 32.

Edit: I believe the string size limit is related to the 2GB maximum private working set on 32-bit Windows processes. If you are running Linux or have a 64-bit perl you may be luckier than me.


I have added to your benchmark program like this

my $vec = "\0" x ($N * 4);

cmpthese(-3,{ 
    perl => sub{
        $perl[int(rand($N))]++;
    },
    pdl => sub{
        # note that I'm not even incrementing here just setting to 1
        $pdl->set(int(rand($N)), 1);
    },
    vec => sub {
        vec($vec, int(rand($N)), 32)++; 
    },
});

giving these results

          Rate  pdl  vec perl
pdl   472429/s   -- -76% -85%
vec  1993101/s 322%   -- -37%
perl 3157570/s 568%  58%   --

so using vec is two-thirds the speed of a native array. Presumably that's acceptable.

like image 154
Borodin Avatar answered Nov 18 '22 07:11

Borodin


The PDL command you want is indadd. (Thanks to Chris Marshall, PDL Pumpking, for pointing that out to me elsewhere.)

PDL is designed for what I call "vectorized" operations. Compared with C operations, Perl operations are quite slow, so you want to keep the number of PDL method invocations to a minimum, and have each invocation do a lot of work. For example, this benchmark lets you specify the number of updates to perform in a single go (as a command-line parameter). The perl side has to loop, but the PDL side only performs five or so function calls:

use PDL;
use Benchmark qw/cmpthese/;

my $updates_per_round = shift || 1;

my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;

cmpthese(-1,{ 
    perl => sub{
        $perl[int(rand($N))]++ for (1..$updates_per_round);
    },
    pdl => sub{
        my $to_update = long(random($updates_per_round) * $N);
        indadd(1,$to_update,$pdl);
    }
});

When I run this with an argument of 1, I get even worse performance than when using set, which is what I expected:

$ perl script.pl 1
          Rate   pdl  perl
pdl    21354/s    --  -98%
perl 1061925/s 4873%    --

That's a lot of ground to make up! But hold in there. If we do 100 iterations per round, we get an improvement:

$ perl script.pl 100
        Rate  pdl perl
pdl  16906/s   -- -18%
perl 20577/s  22%   --

And with 10,000 updates per round, PDL outperforms Perl by a factor of four:

$ perl script.pl 10000
      Rate perl  pdl
perl 221/s   -- -75%
pdl  881/s 298%   --

PDL continues to perform roughly 4x faster than plain Perl for even larger values.

Note that PDL's performance can degrade for more complex operations. This is because PDL will allocate and tear down large but temporary workspaces for intermediate operations. In that case, you may want to consider using Inline::Pdlpp. However, that's not a beginner's tool, so don't jump there until you've determined it's really the best thing for you.

Another alternative to all of this is to use Inline::C like so:

use PDL;
use Benchmark qw/cmpthese/;

my $updates_per_round = shift || 1;

my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;
my $inline = pack "d*", @perl;
my $max_PDL_per_round = 5_000;

use Inline 'C';

cmpthese(-1,{ 
    perl => sub{
        $perl[int(rand($N))]++ for (1..$updates_per_round);
    },
    pdl => sub{
        my $to_update = long(random($updates_per_round) * $N);
        indadd(1,$to_update,$pdl);
    },
    inline => sub{
        do_inline($inline, $updates_per_round, $N);
    },
});


__END__

__C__

void do_inline(char * packed_data, int N_updates, int N_data) {
    double * actual_data = (double *) packed_data;
    int i;
    for (i = 0; i < N_updates; i++) {
        int index = rand() % N_data;
        actual_data[index]++;
    }
}

For me, the Inline function consistently beats both Perl and PDL. For largish values of $updates_per_round, say 1,000, I get Inline::C's version roughly 5x faster than pure Perl and between 1.2x and 2x faster than PDL. Even when $updates_per_round is just 1, where Perl beats PDL handily, the Inline code is 2.5x faster than the Perl code.

If this is all you need to accomplish, I recommend using Inline::C.

But if you need to perform many manipulations to your data, you're best sticking with PDL for its power, flexibility, and performance. See below for how you can use vec() with PDL data.

like image 43
David Mertens Avatar answered Nov 18 '22 07:11

David Mertens


PDL::set() and PDL::get() are intended more as a learning aid than anything else. They constitute the pessimal way to access PDL variables. You would be far better off using some of the built-in bulk access routines. The PDL constructor itself accepts Perl lists:

$pdl = pdl(@list)

and is reasonably fast. You can also load your data directly from an ASCII file using PDL::rcols, or from a binary file using one of many IO routines. If you have the data as a packed string in machine order, you can even gain access to the PDL memory directly:

$pdl = PDL->new_from_specification(long,$elements);
$dr = $pdl->get_dataref;
$$dr = get_my_packed_string();
$pdl->upd_data;

Also note that you can "have your cake and eat it too" by using PDL objects to hold your arrays of integers, PDL computations (such as indadd) for large-scale manipulation of the data, but also use vec() directly on the PDL data as a string which you can get via the get_dataref method:

vec($$dr,int(rand($N)),32);

You'll will need to bswap4 the data if you are on a little-endian system:

$pdl->bswap4;
$dr = $pdl->get_dataref;
vec($$dr,int(rand($N)),32)++;
$pdl->upd_data;
$pdl->bswap4;

Et, voila!

like image 30
Craig DeForest Avatar answered Nov 18 '22 09:11

Craig DeForest


PDL wins when the operations can be threaded, apparently its not optimized for random access and assignment. Perhaps someone with more PDL knowledge could help.

like image 2
Joel Berger Avatar answered Nov 18 '22 08:11

Joel Berger


Packed::Array on CPAN might help.

Packed::Array provides a packed signed integer array class. Arrays built using Packed::Array can only hold signed integers that match your platform-native integers, but take only as much memory as is actually needed to hold those integers. So, for 32-bit systems, rather than taking about 20 bytes per array entry, they take only 4.

like image 2
Ilion Avatar answered Nov 18 '22 09:11

Ilion


since were using Integers, which should be ok for use with chromosomes try this

use PDL;
use Benchmark qw/cmpthese/;

my $N =  1_000_000;
my @perl;
@perl = (0 .. $N - 1);
my $pdl;
$pdl = (zeroes($N));

cmpthese(-1,{ 
perl => sub{
    $perl[int(rand($N))]++;
},
pdl2 => sub{
    # note that I'm not even incrementing here just setting to 1
    $pdl->set(int(rand($N)), 1);
    $pdl2 = pack "w*", $pdl;
}
});

and the out put I got from this was ...

           Rate  pdl2  perl
pdl2   46993/s    --  -97%
perl 1641607/s 3393%    --

which shows a big performance difference from when I first tried this code with out adding in my 2 cents I got

          Rate  pdl perl
pdl   201972/s   -- -86%
perl 1472123/s 629%   -- 
like image 2
Mark R Baker Avatar answered Nov 18 '22 07:11

Mark R Baker