I want to create and manipulate large arrays of (4 byte) integers in-memory. By large, I mean on the order of hundreds of million. Each cell in the array will act as a counter for a position on a chromosome. All I need is for it to fit in memory, and have fast (O(1)) access to elements. The thing I'm counting is not a sparse feature, so I can't use a sparse array.
I can't do this with a regular perl list, because perl (at least on my machine) uses 64 bytes per element, so the genomes of most of the organisms I work with are just too big. I've tried storing the data on-disk via SQLite and hash tying, and though they work, are very slow, especially on ordinary drives. (It works reasonably ok when I run on 4-drive raid 0's).
I thought I could use PDL arrays, b/c PDL stores its arrays just as C does, using only 4 bytes per element. However, I found that update speed to be excruciatingly slow compared to perl lists:
use PDL;
use Benchmark qw/cmpthese/;
my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;
cmpthese(-1,{
perl => sub{
$perl[int(rand($N))]++;
},
pdl => sub{
# note that I'm not even incrementing here just setting to 1
$pdl->set(int(rand($N)), 1);
}
});
Returns:
Rate pdl perl
pdl 481208/s -- -87%
perl 3640889/s 657% --
Does anyone know how to increase pdl set() performance, or know of a different module that can accomplish this?
I cannot tell what sort of performance you will get, but I recommend using the vec
function, documented here, to split a string into bit fields. I have experimented and found that my Perl will tolerate a string up to 500_000_000
characters long. which corresponds to 125,000,000 32-bit values.
my $data = "\0" x 500_000_000;
vec($data, 0, 32)++; # Increment data[0]
vec($data, 100_000_000, 32)++; # Increment data[100_000_000]
If this isn't enough there may be something in the build of Perl that controls the limit. Alternatively if you think you can get smaller fields - say 16-bit counts - vec
will accept field widths of any power of 2 up to 32.
Edit: I believe the string size limit is related to the 2GB maximum private working set on 32-bit Windows processes. If you are running Linux or have a 64-bit perl you may be luckier than me.
I have added to your benchmark program like this
my $vec = "\0" x ($N * 4);
cmpthese(-3,{
perl => sub{
$perl[int(rand($N))]++;
},
pdl => sub{
# note that I'm not even incrementing here just setting to 1
$pdl->set(int(rand($N)), 1);
},
vec => sub {
vec($vec, int(rand($N)), 32)++;
},
});
giving these results
Rate pdl vec perl
pdl 472429/s -- -76% -85%
vec 1993101/s 322% -- -37%
perl 3157570/s 568% 58% --
so using vec
is two-thirds the speed of a native array. Presumably that's acceptable.
The PDL command you want is indadd
. (Thanks to Chris Marshall, PDL Pumpking, for pointing that out to me elsewhere.)
PDL is designed for what I call "vectorized" operations. Compared with C operations, Perl operations are quite slow, so you want to keep the number of PDL method invocations to a minimum, and have each invocation do a lot of work. For example, this benchmark lets you specify the number of updates to perform in a single go (as a command-line parameter). The perl side has to loop, but the PDL side only performs five or so function calls:
use PDL;
use Benchmark qw/cmpthese/;
my $updates_per_round = shift || 1;
my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;
cmpthese(-1,{
perl => sub{
$perl[int(rand($N))]++ for (1..$updates_per_round);
},
pdl => sub{
my $to_update = long(random($updates_per_round) * $N);
indadd(1,$to_update,$pdl);
}
});
When I run this with an argument of 1, I get even worse performance than when using set
, which is what I expected:
$ perl script.pl 1
Rate pdl perl
pdl 21354/s -- -98%
perl 1061925/s 4873% --
That's a lot of ground to make up! But hold in there. If we do 100 iterations per round, we get an improvement:
$ perl script.pl 100
Rate pdl perl
pdl 16906/s -- -18%
perl 20577/s 22% --
And with 10,000 updates per round, PDL outperforms Perl by a factor of four:
$ perl script.pl 10000
Rate perl pdl
perl 221/s -- -75%
pdl 881/s 298% --
PDL continues to perform roughly 4x faster than plain Perl for even larger values.
Note that PDL's performance can degrade for more complex operations. This is because PDL will allocate and tear down large but temporary workspaces for intermediate operations. In that case, you may want to consider using Inline::Pdlpp
. However, that's not a beginner's tool, so don't jump there until you've determined it's really the best thing for you.
Another alternative to all of this is to use Inline::C
like so:
use PDL;
use Benchmark qw/cmpthese/;
my $updates_per_round = shift || 1;
my $N = 1_000_000;
my @perl = (0 .. $N - 1);
my $pdl = zeroes $N;
my $inline = pack "d*", @perl;
my $max_PDL_per_round = 5_000;
use Inline 'C';
cmpthese(-1,{
perl => sub{
$perl[int(rand($N))]++ for (1..$updates_per_round);
},
pdl => sub{
my $to_update = long(random($updates_per_round) * $N);
indadd(1,$to_update,$pdl);
},
inline => sub{
do_inline($inline, $updates_per_round, $N);
},
});
__END__
__C__
void do_inline(char * packed_data, int N_updates, int N_data) {
double * actual_data = (double *) packed_data;
int i;
for (i = 0; i < N_updates; i++) {
int index = rand() % N_data;
actual_data[index]++;
}
}
For me, the Inline function consistently beats both Perl and PDL. For largish values of $updates_per_round
, say 1,000, I get Inline::C
's version roughly 5x faster than pure Perl and between 1.2x and 2x faster than PDL. Even when $updates_per_round
is just 1, where Perl beats PDL handily, the Inline code is 2.5x faster than the Perl code.
If this is all you need to accomplish, I recommend using Inline::C
.
But if you need to perform many manipulations to your data, you're best sticking with PDL for its power, flexibility, and performance. See below for how you can use vec()
with PDL data.
PDL::set()
and PDL::get()
are intended more as a learning aid than anything else. They constitute the pessimal way to access PDL variables. You would be far better off using some of the built-in bulk access routines. The PDL constructor itself accepts Perl lists:
$pdl = pdl(@list)
and is reasonably fast. You can also load your data directly from an ASCII file using PDL::rcols
, or from a binary file using one of many IO routines. If you have the data as a packed string in machine order, you can even gain access to the PDL memory directly:
$pdl = PDL->new_from_specification(long,$elements);
$dr = $pdl->get_dataref;
$$dr = get_my_packed_string();
$pdl->upd_data;
Also note that you can "have your cake and eat it too" by using PDL objects to hold your arrays of integers, PDL computations (such as indadd
) for large-scale manipulation of the data, but also use vec()
directly on the PDL data as a string which you can get via the get_dataref
method:
vec($$dr,int(rand($N)),32);
You'll will need to bswap4
the data if you are on a little-endian system:
$pdl->bswap4;
$dr = $pdl->get_dataref;
vec($$dr,int(rand($N)),32)++;
$pdl->upd_data;
$pdl->bswap4;
Et, voila!
PDL wins when the operations can be threaded, apparently its not optimized for random access and assignment. Perhaps someone with more PDL knowledge could help.
Packed::Array on CPAN might help.
Packed::Array provides a packed signed integer array class. Arrays built using Packed::Array can only hold signed integers that match your platform-native integers, but take only as much memory as is actually needed to hold those integers. So, for 32-bit systems, rather than taking about 20 bytes per array entry, they take only 4.
since were using Integers, which should be ok for use with chromosomes try this
use PDL;
use Benchmark qw/cmpthese/;
my $N = 1_000_000;
my @perl;
@perl = (0 .. $N - 1);
my $pdl;
$pdl = (zeroes($N));
cmpthese(-1,{
perl => sub{
$perl[int(rand($N))]++;
},
pdl2 => sub{
# note that I'm not even incrementing here just setting to 1
$pdl->set(int(rand($N)), 1);
$pdl2 = pack "w*", $pdl;
}
});
and the out put I got from this was ...
Rate pdl2 perl
pdl2 46993/s -- -97%
perl 1641607/s 3393% --
which shows a big performance difference from when I first tried this code with out adding in my 2 cents I got
Rate pdl perl
pdl 201972/s -- -86%
perl 1472123/s 629% --
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With