Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I access the data in many large CSV files quickly from Perl?

I have a number of scripts that currently read in a lot of data from some .CSV files. For efficiency, I use the Text::CSV_XS module to read them in and then create a hash using one of the columns as an index. However, I have a lot of files and they are quite large. And each of the scripts needs to read in the data all over again.

The question is: How can I have persistent storage of these Perl hashes so that all them can be read back in with a minimum of CPU?

Combining the scripts is not an option. I wish...

I applied the 2nd rule of optimization and used profiling to find that the vast majority of the CPU (about 90%) was in:

Text::CSV_XS::fields
Text::CSV_XS::Parse
Text::CSV_XS::parse

So, I made a test script that read in all the .CSV files (Text::CSV_XS), dumped them using the Storable module, and then went back and read them back in using the Storable module. I profiled this so I could see the CPU times:

$ c:/perl/bin/dprofpp.bat
Total Elapsed Time = 1809.397 Seconds
  User+System Time = 950.5560 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 25.6   243.6 243.66    126   1.9338 1.9338  Storable::pretrieve
 20.5   194.9 194.92 893448   0.0002 0.0002  Text::CSV_XS::fields
 9.49   90.19 90.198 893448   0.0001 0.0001  Text::CSV_XS::Parse
 7.48   71.07 71.072    126   0.5641 0.5641  Storable::pstore
 4.45   42.32 132.52 893448   0.0000 0.0001  Text::CSV_XS::parse
 (the rest was in terms of 0.07% or less and can be ignored)

So, using Storable costs about 25.6% to load back in as compared to Text::CSV_XS at about 35%. Not a lot of savings...

Has anybody got a suggestion on how I can read in these data more efficiently?

Thanks for your help.

like image 221
Harold Bamford Avatar asked Jul 24 '09 20:07

Harold Bamford


2 Answers

The easiest way to put a very large hash on disk, IMHO, is with BerkeleyDB. It's fast, time-tested and rock-solid, and the CPAN module provides a tied API. That means you can continue using your hash as if it were an in-memory data structure, but it will automatically read and write through BerkeleyDB to disk.

like image 113
friedo Avatar answered Nov 08 '22 12:11

friedo


Parse the data once and put it in an SQLite db. Query using DBI.

like image 44
Sinan Ünür Avatar answered Nov 08 '22 13:11

Sinan Ünür