Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best way of loading a large text file in java

I have a text file, with a sequence of integer per line:

47202 1457 51821 59788 
49330 98706 36031 16399 1465
...

The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.

Edit: Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

like image 370
user3639557 Avatar asked Oct 02 '14 07:10

user3639557


People also ask

Is BufferedReader faster?

BufferedReader is a bit faster as compared to scanner because the scanner does the parsing of input data and BufferedReader simply reads a sequence of characters.

How do I read large files?

To be able to open such large CSV files, you need to download and use a third-party application. If all you want is to view such files, then Large Text File Viewer is the best choice for you. For actually editing them, you can try a feature-rich text editor like Emacs, or go for a premium tool like CSV Explorer.

What is the best way to read a file in Java?

You can use BufferedReader to read large files line by line. If you want to read a file that has its content separated by a delimiter, use the Scanner class. Also you can use Java NIO Files class to read both small and large files.


1 Answers

Reading/Parsing the file:

The best way to handle large files, in any language, is to try and NOT load them into memory.

In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.

You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.

Handling the resulting objects

For dealing with the objects you produce while parsing, there are several options:

  1. Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.

  2. Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly

  3. Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.

a note on java collections and maps in particular

For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

like image 191
radai Avatar answered Oct 14 '22 04:10

radai