Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to store and retrieve large data sets with Java

I am currently working on a project in Java where I must perform several Information Retrieval and Classification tasks over a very large dataset. A small collection would have 10K documents. From each document about 100 150-dimensional vectors of doubles. So about 1M vectors of 150 doubles or 150M doubles. After storing I need to recall all of them OR a percentage of them and perform clustering (e.g. KMEANS). Actual collections have many more documents (I am currently dealing with 200K documents).

Of course I have dealt with OutOfMemoryError several times and my last solution to the problem was storing in 10 huge XML files with total size >5GB. The files had to be 10 because DOM Writer got the memory full. For the reading I used SAX Parser which did the job without loading them in memory. In addition storing a double into any kind of text multiplies his actual size and adds the computational cost of parsing and converting. Finally clustering algorithms usually are iterative, so they will need the same data again and again. My method didn't cache anything, it just read from disk many times.

I am now searching for a more compact way of storing any amount of data in binary format (Database, raw binary file etc.) and an efficient way of reading them. Does anyone have any ideas to propose?

like image 460
Lazaros Tsochatzidis Avatar asked Oct 03 '12 16:10

Lazaros Tsochatzidis


2 Answers

Embedded database or key-value storage. There are plenty of them, e.g. JDBM3. And what a strange idea to store in xml format? You could simply dump an array on a file using standard serialization technique.

like image 97
Alexei Kaigorodov Avatar answered Sep 21 '22 12:09

Alexei Kaigorodov


I am not so sure about your case, but for our "large data handling" needs we used noSQL DB and it worked quite fine.

like image 22
jakub.petr Avatar answered Sep 21 '22 12:09

jakub.petr