Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use a list as a hash in R? If so, why is it so slow?

Tags:

r

hash

perl

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

For example, the following code will populate a hash with up to 10000 key/value pairs, where the keys are random letters and the values are random integers. Then, it does 10000 random lookups in that hash.

#!/usr/bin/perl -w use strict;  my @letters = ('a'..'z');  print @letters . "\n"; my %testHash;  for(my $i = 0; $i < 10000; $i++) {     my $r1 = int(rand(26));     my $r2 = int(rand(26));     my $r3 = int(rand(26));     my $key = $letters[$r1] . $letters[$r2] . $letters[$r3];     my $value = int(rand(1000));     $testHash{$key} = $value; }  my @keyArray = keys(%testHash); my $keyLen = scalar @keyArray;  for(my $j = 0; $j < 10000; $j++) {     my $key = $keyArray[int(rand($keyLen))];     my $lookupValue = $testHash{$key};     print "key " .  $key . " Lookup $lookupValue \n"; } 

Now that increasingly, I am wanting to have a hash-like data structure in R. The following is the equivalent R code:

testHash <- list()  for(i in 1:10000) {   key.tmp = paste(letters[floor(26*runif(3))], sep="")   key <- capture.output(cat(key.tmp, sep=""))   value <- floor(1000*runif(1))   testHash[[key]] <- value }  keyArray <- attributes(testHash)$names keyLen = length(keyArray);  for(j in 1:10000) {   key <- keyArray[floor(keyLen*runif(1))]   lookupValue = testHash[[key]]   print(paste("key", key, "Lookup", lookupValue)) } 

The code seem to be doing equivalent things. However, the Perl one is much faster:

>time ./perlHashTest.pl real    0m4.346s user    **0m0.110s** sys 0m0.100s 

Comparing to R:

time R CMD BATCH RHashTest.R  real    0m8.210s user    **0m7.630s** sys 0m0.200s 

What explains the discrepancy? Are lookups in R lists just not good?

Increasing to 100,000 list length and 100,000 lookups only exaggerates the discrepancy? Is there a better alternative for the hash data structure in R than the native list()?

like image 596
stevejb Avatar asked Aug 12 '10 17:08

stevejb


People also ask

When should we not use hashes?

There are some operations which are not efficiently supported by hash tables, such as iterating over all the elements whose keys are within a certain range, finding the element with the largest key or smallest key, and so on.

What is a hash in R?

HASH VALUES can be any R value, vector or object. PASS-BY REFERENCE. Environments and hashes are special objects in R because only one copy exists globally. When provide as an argument to a function, no local copy is made and any changes to the hash in the functions are reflected globally.

Does R have hash maps?

R Language Hashmaps Environments as hash maps Note: in the subsequent passages, the terms hash map and hash table are used interchangeably and refer to the same concept, namely, a data structure providing efficient key lookup through use of an internal hash function.

What should hash function return?

The hash() function returns a 128-bit, 160-bit, or 256-bit hash of the input data, depending on the algorithm that you select. This function provides 2128- 2256 distinct return values and is intended for cryptographic purposes. The function almost never produces the same output for two different inputs.


2 Answers

The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

like image 193
Vince Avatar answered Oct 05 '22 12:10

Vince


You could try environments and/or the hash package by Christopher Brown (which happens to use environments under the hood).

like image 44
Dirk Eddelbuettel Avatar answered Oct 05 '22 12:10

Dirk Eddelbuettel