Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading data bigger than the memory size in h2o

I am experimenting with loading data bigger than the memory size in h2o.

H2o blog mentions: A note on Bigger Data and GC: We do a user-mode swap-to-disk when the Java heap gets too full, i.e., you’re using more Big Data than physical DRAM. We won’t die with a GC death-spiral, but we will degrade to out-of-core speeds. We’ll go as fast as the disk will allow. I’ve personally tested loading a 12Gb dataset into a 2Gb (32bit) JVM; it took about 5 minutes to load the data, and another 5 minutes to run a Logistic Regression.

Here is the R code to connect to h2o 3.6.0.8:

h2o.init(max_mem_size = '60m') # alloting 60mb for h2o, R is running on 8GB RAM machine

gives

java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)

.Successfully connected to http://127.0.0.1:54321/ 

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 561 milliseconds 
    H2O cluster version:        3.6.0.8 
    H2O cluster name:           H2O_started_from_R_RILITS-HWLTP_tkn816 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   0.06 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  2 
    H2O cluster healthy:        TRUE 

Note:  As started, H2O is limited to the CRAN default of 2 CPUs.
       Shut down and restart H2O as shown below to use all your CPUs.
           > h2o.shutdown()
           > h2o.init(nthreads = -1)

IP Address: 127.0.0.1 
Port      : 54321 
Session ID: _sid_b2e0af0f0c62cd64a8fcdee65b244d75 
Key Count : 3

I tried to load a 169 MB csv into h2o.

dat.hex <- h2o.importFile('dat.csv')

which threw an error,

Error in .h2o.__checkConnectionHealth() : 
  H2O connection has been severed. Cannot connect to instance at http://127.0.0.1:54321/
Failed to connect to 127.0.0.1 port 54321: Connection refused

which is indicative of out of memory error.

Question: If H2o promises loading a data set larger than its memory capacity(swap to disk mechanism as the blog quote above says), is this the correct way to load the data?

like image 779
talegari Avatar asked Dec 04 '15 07:12

talegari


1 Answers

Swap-to-disk was disabled by default awhile ago, because performance was so bad. The bleeding-edge (not latest stable) has a flag to enable it: "--cleaner" (for "memory cleaner").
Note that your cluster has an EXTREMELY tiny memory: H2O cluster total memory: 0.06 GB That's 60MB! Barely enough to start a JVM with, much less run H2O. I would be surprised if H2O could come up properly there at all, never mind the swap-to-disk. Swapping is limited to swapping the data alone. If you're trying to do a swap-test, up your JVM to 1 or 2 Gigs ram, and then load datasets that sum more than that.

Cliff

like image 135
Cliff Click Avatar answered Nov 12 '22 00:11

Cliff Click