Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

very fast look up's in Perl : re-loading hash values possible?

Tags:

hash

perl

I have about 100 million rows such as:

A : value of A
B : value of B
|
|
|
Z : value of Z  upto 100 million unique entries

Currently each time I run my program I load the entire file as a hash which takes some time. During the run time I need access to value of A,B given I know A,B etc.

I am wondering if I can make a hash once and store it as a binary data structure or index the file. What would be possible in in perl with least programming.

Thanks! -Abhi

like image 280
Abhi Avatar asked Dec 04 '22 05:12

Abhi


1 Answers

I suggest an on-disk key/value database. Due to Perl's tie function, they can be used identically to normal, in-memory hashes. They'll be faster than Perl's hashes for reading/writing if your hash is very large, and they support saving/loading to disk automatically.

BerkeleyDB is an old favourite:

use BerkeleyDB;
# Make %db an on-disk database stored in database.dbm. Create file if needed
tie my %db, 'BerkeleyDB::Hash', -Filename => "database.dbm", -Flags => DB_CREATE
    or die "Couldn't tie database: $BerkeleyDB::Error";

$db{foo} = 1;            # get value
print $db{foo}, "\n";    # set value
for my $key (keys %db) {
    print "$key -> $db{$key}\n";  # iterate values
}

%db = ();  # wipe

Changes to the database are automatically saved to disk and will persist through multiple invocations of your script.

Check the perldoc for options, but the most important are:

# Increase memory allocation for database (increases performance), e.g. 640 MB
tie my %db, 'BerkeleyDB::Hash', -Filename => $filename, -CacheSize => 640*1024*1024;

# Open database in readonly mode
tie my %db, 'BerkeleyDB::Hash', -Filename => $filename, -Flags => DB_RDONLY;

A more complex but much faster database library would be Tokyo Cabinet, and there are of course many other options (this is Perl after all...)

like image 66
rjh Avatar answered May 31 '23 15:05

rjh