Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

H2 database Load csv data faster

Tags:

h2

I want to load about 2 million rows from CSV formatted file to database and run some SQL statement for analysis, and then remove the data. File size is 2GB in size. Data is web server log message. Did some research and found H2 in-memory database seems to be faster, since its keep the data in memory. When I try to load the data got OutOfMemory error message because of 32 bit java. Planning to try with 64 bit java.

I am looking for all the optimization option to load the quickly and run the SQL.

test.sql

CREATE TABLE temptable (
  f1 varchar(250) NOT NULL DEFAULT '',
  f2 varchar(250) NOT NULL DEFAULT '',
  f3 reponsetime NOT NULL DEFAULT ''
  ) as select * from CSVREAD('log.csv');

Running like this in 64 bit java:

java -Xms256m -Xmx4096m -cp h2*.jar org.h2.tools.RunScript -url 'jdbc:h2:mem:test;LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0' -script  test.sql

If any other database available to use in AIX please let me know.

thanks

like image 913
sfgroups Avatar asked Apr 03 '13 01:04

sfgroups


1 Answers

If the CSV file is 2 GB, then it will need more than 4 GB of heap memory when using a pure in-memory database. The exact memory requirements depend a lot on how redundant the data is. If the same values appear over and over again, then the database will need less memory as common objects are re-used (no matter if it's a string, long, timestamp,...).

Please note the LOCK_MODE=0, UNDO_LOG=0, and LOG=0 are not needed when using create table as select. In addition, the CACHE_SIZE does not help when using the mem: prefix (but it helps for in-memory file systems).

I suggest to try using the in-memory file system first (memFS: instead of mem:), which is slightly slower than mem:, but needs less memory usually:

jdbc:h2:memFS:test;CACHE_SIZE=65536

If this is not enough, try the compressed in-memory mode (memLZF:), which is again slower but uses even less memory:

jdbc:h2:memLZF:test;CACHE_SIZE=65536

If this is still not enough, I suggest to try the regular persistent mode and see how fast this is:

jdbc:h2:~/data/test;CACHE_SIZE=65536
like image 70
Thomas Mueller Avatar answered Sep 23 '22 07:09

Thomas Mueller