Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practice for holding huge lists of data in Java

I'm writing a small system in Java in which i extract n-gram feature from text files and later need to perform Feature Selection process in order to select the most discriminators features.

The Feature Extraction process for a single file return a Map which contains for each unique feature, its occurrences in the file. I merge all the file's Maps (Map) into one Map that contain the Document Frequency (DF) of all unique features extracted from all the files. The unified Map can contain above 10,000,000 entries.

Currently the Feature Extraction process is working great and i want to perform Feature Selection in which i need to implement Information Gain or Gain Ratio. I will have to sort the Map first, perform computations and save the results in order to finally get a list of (for each feature, its Feature Selection score)

My question is: What is the best practice and the best data structure to hold this large amount of data (~10M) and perform computations?

like image 807
Aviadjo Avatar asked Jan 14 '15 13:01

Aviadjo


People also ask

Which is the best data structure to store and access a large amount of information?

To store huge data donot use arrays where implementation is easy but insertion and deletion is very difficult. Hence use dynamic like linked list where insertion and deletion is easy. Implement your problem using linked list where the members of linked list are arrays of 100 characters size.

How does list store data in Java?

As you already noticed, the List interface can not store data. But the ArrayList can and stores the data in an array. A LinkedList stores that information as linked list. But you can use both of them interchangeable when you only use the List interface.


2 Answers

This is a very broad question, so the answer is going to broad too. The solution depends on (at least) these three things:

  1. The size of your entries

Storing 10,000,000 integers will require about 40MiB of memory, while storing 10,000,000 x 1KiB records will require more than 9GiB. These are two different problems. Ten million integers are trivial to store in memory in any stock Java collection, while keeping 9GiB in memory will force you to tweak and tune the Java Heap and garbage collector. If the entries are even larger, say 1MiB, then you can forget about in-memory storage entirely. Instead, you'll need to focus on finding a good disk backed data structure, maybe a database.

  1. The hardware you're using

Storing ten million 1KiB records on a machine with 8 GiB of ram is not the same as storing them on a server with 128GiB. Things that are pretty much impossible with the former machine are trivial with the latter.

  1. The type of computation(s) you want to do

You've mentioned sorting, so things like TreeMap or maybe PriorityQueue come to mind. But is that the most intensive computation? And what is the key you're using to sort them? Do you plan on locating (getting) entities based on other properties that aren't the key? If so, that requires separate planning. Otherwise you'd need to iterate over all ten million entries.

Do your computations run in a single thread or multiple threads? If you might have concurrent modifications of your data, that requires a separate solution. Data structures such as TreeMap and PriorityQueue would have to be either locked or replaced with concurrent structures such as ConcurrentLinkedHashMap or ConcurrentSkipListMap.

like image 139
Malt Avatar answered Sep 21 '22 13:09

Malt


You can use a caching system, check MapDB it's very efficient and has a tree map implementation (so you can have your data ordered without any effort). Also, it provides data stores to save your data to disk when it cannot be held on memory.

// here a sample that uses the off-heap memory to back the map
Map<String, String> map = DBMaker.newMemoryDirectDB().make().getTreeMap("words");

//put some stuff into map
map.put("aa", "bb");
map.put("cc", "dd");
like image 26
bachr Avatar answered Sep 21 '22 13:09

bachr