I am using h2o to carry out some modelling, and having tuned the model, i would now like it to be used to carry out a lot of predictions approx 6bln predictions/rows, per prediction row it needs 80 columns of data
The dataset I have already broken down the input dataset down so that it is in about 500 x 12 million row chunks each with the relevant 80 columns of data.
However to upload a data.table
that is 12 million by 80 columns to h2o takes quite a long time, and doing it 500 times for me is taking a prohibitively long time...I think its because it is parsing the object first before it is uploaded.
The prediction part is relatively quick in comparison....
Are there any suggestions to speed this part up? Would changing the number of cores help?
Below is an reproducible example of the issues...
# Load libraries
library(h2o)
library(data.table)
# start up h2o using all cores...
localH2O = h2o.init(nthreads=-1,max_mem_size="16g")
# create a test input dataset
temp <- CJ(v1=seq(20),
v2=seq(7),
v3=seq(24),
v4=seq(60),
v5=seq(60))
temp <- do.call(cbind,lapply(seq(16),function(y){temp}))
colnames(temp) <- paste0('v',seq(80))
# this is the part that takes a long time!!
system.time(tmp.obj <- as.h2o(localH2O,temp,key='test_input'))
#|======================================================================| 100%
# user system elapsed
#357.355 6.751 391.048
Since you are running H2O locally, you want to save that data as a file and then use:
h2o.importFile(localH2O, file_path, key='test_intput')
This will have each thread read their parts of the file in parallel. If you run H2O on a separate server, then you would need to copy the data to a location that the server can read from (most people don't set the servers to read from the file system on their laptops).
as.h2o()
serially uploads the file to H2O. With h2o.importFile()
, the H2O server finds the file and reads it in parallel.
It looks like you are using version 2 of H2O. The same commands will work in H2Ov3, but some of the parameter names have changed a little. The new parameter names are here: http://cran.r-project.org/web/packages/h2o/h2o.pdf
Having also struggled with this problem, I did some tests and found that for objects in R memory (i.e. you don't have the luxury of already having them available in .csv or .txt form), by far the quickest way to load them (~21 x) is to use the fwrite function in data.table to write a csv to disk and read it using h2o.importFile.
The four approaches I tried:
I performed the tests on a data.frame of varying size, and the results seem pretty clear.
The code, if anyone is interested in reproducing, is below.
library(h2o)
library(data.table)
h2o.init()
testdf <-as.data.frame(matrix(nrow=4000000,ncol=100))
testdf[1:1000000,] <-1000 # R won't let me assign the whole thing at once
testdf[1000001:2000000,] <-1000
testdf[2000001:3000000,] <-1000
testdf[3000001:4000000,] <-1000
resultsdf <-as.data.frame(matrix(nrow=20,ncol=5))
names(resultsdf) <-c("subset","method 1 time","method 2 time","method 3 time","method 4 time")
for(i in 1:20){
subdf <- testdf[1:(200000*i),]
resultsdf[i,1] <-100000*i
# 1: use as.h2o()
start <-Sys.time()
as.h2o(subdf)
stop <-Sys.time()
resultsdf[i,2] <-as.numeric(stop)-as.numeric(start)
# 2: use write.csv then h2o.importFile()
start <-Sys.time()
write.csv(subdf,"hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,3] <-as.numeric(stop)-as.numeric(start)
# 3: Split dataset in half, load both halves, then merge
start <-Sys.time()
length_subdf <-dim(subdf)[1]
h2o1 <-as.h2o(subdf[1:(length_subdf/2),])
h2o2 <-as.h2o(subdf[(1+length_subdf/2):length_subdf,])
h2o.rbind(h2o1,h2o2)
stop <-Sys.time()
resultsdf[i,4] <- as.numeric(stop)-as.numeric(start)
# 4: use fwrite then h2o.importfile()
start <-Sys.time()
fwrite(subdf,file="hundredsandthousands.csv",row.names=FALSE)
h2o.importFile("hundredsandthousands.csv")
stop <-Sys.time()
resultsdf[i,5] <-as.numeric(stop)-as.numeric(start)
plot(resultsdf[,1],resultsdf[,2],xlim=c(0,4000000),ylim=c(0,900),xlab="rows",ylab="time/s",main="Scaling of different methods of h2o frame loading")
for (i in 1:3){
points(resultsdf[,1],resultsdf[,(i+2)],col=i+1)
}
legendtext <-c("as.h2o","write.csv then h2o.importFile","Split in half, as.h2o and rbind","fwrite then h2o.importFile")
legend("topleft",legend=legendtext,col=c(1,2,3,4),pch=1)
print(resultsdf)
flush.console()
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With