Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R environment / hash table slows down as it grows to millions

Tags:

r

hash

I am using an environment as a hash table. The keys are words coming from regular text documents, the values are a single integer (an index into some other structure).

As I load millions of elements, both the updates and lookups slow to a crawl. Below is some code to show the behavior.

It looks like the behavior from the start is more in O(n) than O(1) even though the hash table is lightly populated and collisions are minimal.

Any suggestion on why the update and lookup time are constantly increasing and how to have better performance ?

thanks

testhInt = function (I=100L,N=100000L) {
   cat("create initial table: ", I*N,
       system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv())), "\n")
   for ( i in 1L:I){
       cat("Updated", i*N,": ", 
           system.time(for (j in ((i-1L)*N+1):(i*N)) {
              h[[as.character(j)]] <- j
                }), 
           "\n")
       p = env.profile(h)
       cat(sprintf("Hash size: %i, nchains: %i, 1:200:%s\n",
                   p$size, p$nchains, toString(p$counts[1:200])))
       cat("Lookup 1000  hash:",
           system.time(resultv <<- sapply(sample(1L:(i*N), 1000L),
                                          function(i) h[[as.character(i)]])),
           "\n")
        } 
}
testhInt()
create initial table:  10000000 0.089 0.081 0.169 
Updated 100000 :  2.352 0.045 2.392 
Hash size: 10000000, nchains: 100000, 1:200:[text removed]
Lookup 1000  hash: 0.016 0.001 0.018 
Updated 200000 :  3.587 0.057 3.622 
Hash size: 10000000, nchains: 200000, 1:200:[text removed]
Lookup 1000  hash: 0.014 0.002 0.017 
Updated 300000 :  4.649 0.064 4.695 
Hash size: 10000000, nchains: 300000, 1:200:[text removed]
Lookup 1000  hash: 0.024 0.003 0.027 
Updated 400000 :  5.76 0.076 5.8 
Hash size: 10000000, nchains: 400000, 1:200:[text removed]
Lookup 1000  hash: 0.023 0.003 0.026 
...
Updated 1200000 :  12.299 0.167 12.469 
Hash size: 10000000, nchains: 1200000, 1:200:[text removed]
Lookup 1000  hash: 0.071 0.01 0.084 
...
Updated 2600000 :  28.537 0.273 28.836 
Hash size: 10000000, nchains: 2600000, 1:200:[text removed]
Lookup 1000  hash: 0.138 0.02 0.158 
like image 570
tjean Avatar asked Mar 18 '14 01:03

tjean


1 Answers

The environment updates and lookups are not the problem. What becomes slower as you load millions of elements is the time it takes to: 1) generate sequences, 2) convert them to character, 3) calculate random samples.

If you move those operations outside of your timings, and only time the hash table updates and lookups, you'll see the performance of the hash table does not degrade.

testhInt2 <- function (I=100L, N=100000L) {
   t1 <- system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv()))
   cat("create initial table: ", I*N, t1, "\n")

   for ( i in 1L:I) {
       jSeq <- ((i-1L)*N+1):(i*N)
       jName <- as.character(jSeq)
       t2 <- system.time(for(j in seq_along(jSeq)) h[[jName[i]]] <- jSeq[i])
       cat("Updated", i*N,": ", t2, "\n")
       p <- env.profile(h)
       cat(sprintf("Hash size: %i, nchains: %i, 1:200: %s\n",
                   p$size, p$nchains, "..."))
       v <- as.character(sample(1L:(i*N), 1000L))
       t3 <- system.time(resultv <<- sapply(v, function(i) h[[i]]))
       cat("Lookup 1000  hash:", t3, "\n")
   }
}
testhInt2()
create initial table:  10000000 0.012 0.028 0.04 0 0 
Updated 100000 :  6.148 0.004 6.174 0 0 
Hash size: 10000000, nchains: 1, 1:200: ...
Lookup 1000  hash: 0.084 0 0.086 0 0 
Updated 200000 :  6.872 0 6.9 0 0 
Hash size: 10000000, nchains: 2, 1:200: ...
Lookup 1000  hash: 0.088 0 0.089 0 0 
...
Updated 2500000 :  5.528 0.012 5.557 0 0 
Hash size: 10000000, nchains: 25, 1:200: ...
Lookup 1000  hash: 0.052 0 0.052 0 0 
Updated 2600000 :  4.844 0 4.863 0 0 
Hash size: 10000000, nchains: 26, 1:200: ...
Lookup 1000  hash: 0.052 0 0.052 0 0
like image 54
Joshua Ulrich Avatar answered Nov 06 '22 06:11

Joshua Ulrich