I am writing some Perl scripts to manipulate large amounts (in total about 42 million rows, but it won't be done in one hit) of data in two PostgreSQL databases.
For some of my queries it makes good sense to use fetchall_hashref
because I have synthetic keys. However, in other instances, I'm going to have use an array of three columns as the unique key.
This has got me wondering about performance differences between fetchall_arrayref
and fetchall_hashref
. I know that in both cases everything is going in to memory so selecting several GB of data probably isn't a good idea but other than that there appears to be very little guidance in the documentation when it comes to performance.
My googling has been unsuccessful so if anyone can point me in the direction of some general performance studies I'd be grateful.
(I know I could benchmark this myself but unfortunately for dev purposes I don't have access to a machine which has identical hardware to production which is why I'm looking for general guidelines or even best practices).
Most of the choices between fetch methods depend on what format you want the data to end up in and how much of the work for that you want DBI to do for you.
My recollection is that iterating with fetchrow_arrayref and using bind_columns is the fastest (least DBI overhead) way to read through returned data.
First question is whether you really need to use a fetchall
in the first place. If you don't need all 42 million rows in memory at once, then don't read them all in at once! bind_columns
and fetchrow_arrayref
are generally the way to go whenever possible, as ysth already pointed out.
Assuming that fetchall
really is needed, my gut intuition is that fetchall_arrayref
will be marginally faster, since an array is a simpler data structure and doesn't need to compute hashes of the inserted keys, but the savings in time would be dwarfed by database read times, so it's unlikely to be significant.
Memory requirements are another matter entirely, though. The structure returned by fetchall_hashref
is a hash of id => row
, with each row being represented as a hash of field name => field value
. If you get 42 million rows, that means your list of field names is repeated in each of 42 million sets of hash keys... That's going to require a good deal more memory to store than the array of arrays of arrays returned by fetchall_arrayref
. (Unless DBI is doing some magic with tie
to optimize the fetchall_hashref
structure, I suppose.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With