What my question isnt:
Hardware/Space:
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o()
as a convenience function, that does these steps:
data.table::fwrite()
if available (*), otherwise write.csv()
)h2o.uploadFile()
on that temp fileAs your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile()
instead of the quicker h2o.importFile()
. The decision of which to use is visibility:
h2o.uploadFile()
your client has to be able to see the file.h2o.importFile()
your cluster has to be able to see the file.When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile()
. (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind()
them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame
(without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE)
; you can also optionally switch it on/off with the h2o.fwrite
option.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With