I am using an environment as a hash table. The keys are words coming from regular text documents, the values are a single integer (an index into some other structure).
As I load millions of elements, both the updates and lookups slow to a crawl. Below is some code to show the behavior.
It looks like the behavior from the start is more in O(n) than O(1) even though the hash table is lightly populated and collisions are minimal.
Any suggestion on why the update and lookup time are constantly increasing and how to have better performance ?
thanks
testhInt = function (I=100L,N=100000L) {
cat("create initial table: ", I*N,
system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv())), "\n")
for ( i in 1L:I){
cat("Updated", i*N,": ",
system.time(for (j in ((i-1L)*N+1):(i*N)) {
h[[as.character(j)]] <- j
}),
"\n")
p = env.profile(h)
cat(sprintf("Hash size: %i, nchains: %i, 1:200:%s\n",
p$size, p$nchains, toString(p$counts[1:200])))
cat("Lookup 1000 hash:",
system.time(resultv <<- sapply(sample(1L:(i*N), 1000L),
function(i) h[[as.character(i)]])),
"\n")
}
}
testhInt()
create initial table: 10000000 0.089 0.081 0.169
Updated 100000 : 2.352 0.045 2.392
Hash size: 10000000, nchains: 100000, 1:200:[text removed]
Lookup 1000 hash: 0.016 0.001 0.018
Updated 200000 : 3.587 0.057 3.622
Hash size: 10000000, nchains: 200000, 1:200:[text removed]
Lookup 1000 hash: 0.014 0.002 0.017
Updated 300000 : 4.649 0.064 4.695
Hash size: 10000000, nchains: 300000, 1:200:[text removed]
Lookup 1000 hash: 0.024 0.003 0.027
Updated 400000 : 5.76 0.076 5.8
Hash size: 10000000, nchains: 400000, 1:200:[text removed]
Lookup 1000 hash: 0.023 0.003 0.026
...
Updated 1200000 : 12.299 0.167 12.469
Hash size: 10000000, nchains: 1200000, 1:200:[text removed]
Lookup 1000 hash: 0.071 0.01 0.084
...
Updated 2600000 : 28.537 0.273 28.836
Hash size: 10000000, nchains: 2600000, 1:200:[text removed]
Lookup 1000 hash: 0.138 0.02 0.158
The environment updates and lookups are not the problem. What becomes slower as you load millions of elements is the time it takes to: 1) generate sequences, 2) convert them to character, 3) calculate random samples.
If you move those operations outside of your timings, and only time the hash table updates and lookups, you'll see the performance of the hash table does not degrade.
testhInt2 <- function (I=100L, N=100000L) {
t1 <- system.time(h <<- new.env(hash=TRUE,size=I*N,parent=emptyenv()))
cat("create initial table: ", I*N, t1, "\n")
for ( i in 1L:I) {
jSeq <- ((i-1L)*N+1):(i*N)
jName <- as.character(jSeq)
t2 <- system.time(for(j in seq_along(jSeq)) h[[jName[i]]] <- jSeq[i])
cat("Updated", i*N,": ", t2, "\n")
p <- env.profile(h)
cat(sprintf("Hash size: %i, nchains: %i, 1:200: %s\n",
p$size, p$nchains, "..."))
v <- as.character(sample(1L:(i*N), 1000L))
t3 <- system.time(resultv <<- sapply(v, function(i) h[[i]]))
cat("Lookup 1000 hash:", t3, "\n")
}
}
testhInt2()
create initial table: 10000000 0.012 0.028 0.04 0 0
Updated 100000 : 6.148 0.004 6.174 0 0
Hash size: 10000000, nchains: 1, 1:200: ...
Lookup 1000 hash: 0.084 0 0.086 0 0
Updated 200000 : 6.872 0 6.9 0 0
Hash size: 10000000, nchains: 2, 1:200: ...
Lookup 1000 hash: 0.088 0 0.089 0 0
...
Updated 2500000 : 5.528 0.012 5.557 0 0
Hash size: 10000000, nchains: 25, 1:200: ...
Lookup 1000 hash: 0.052 0 0.052 0 0
Updated 2600000 : 4.844 0 4.863 0 0
Hash size: 10000000, nchains: 26, 1:200: ...
Lookup 1000 hash: 0.052 0 0.052 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With