best way of loading a large text file in java

Tags:

I have a text file, with a sequence of integer per line:

47202 1457 51821 59788 
49330 98706 36031 16399 1465
...

The file has 3 million lines of this format. I have to load this file into the memory and extract 5-grams out of it and do some statistics on it. I do have memory limitation (8GB RAM). I tried to minimize the number of objects I create (only have 1 class with 6 float variables, and some methods). And each line of that file, basically generates number of objects of this class (proportional to the size of the line in temrs of #ofwords). I started to feel that Java is not a good way to do these things when C++ is around.

Edit: Assume that each line produces (n-1) objects of that class. Where n is the number of tokens in that line separated by space (i.e. 1457). So considering the average size of 10 words per line, each line gets mapped to 9 objects on average. So, there will be 9*3*10^6 objects.So, the memory needed is: 9*3*10^6*(8 bytes obj header + 6 x 4 byte floats) + (a map(String,Objects) and another map (Integer,ArrayList(Objects))). I need to keep everything in the memory, because there will be some mathematical optimization happening afterwards.

370

asked Oct 02 '14 07:10

user3639557

1 Answers

Reading/Parsing the file:

The best way to handle large files, in any language, is to try and NOT load them into memory.

In java, have a look at MappedByteBuffer. it allows you to map a file into process memory and access its contents without loading the whole thing into your heap.

You might also try reading the file line-by-line and discarding each line after you read it - again to avoid holding the entire file in memory at once.

Handling the resulting objects

For dealing with the objects you produce while parsing, there are several options:

Same as with the file itself - if you can perform whatever it is you want to perform without keeping all of them in memory (while "streaming" the file) - that is the best solution. you didnt describe the problem youre trying to solve so i dont know if thats possible.
Compression of some sort - switch from Wrapper objects (Float) to primitives (float), use something like the flyweight pattern to store your data in giant float[] arrays and only construct short-lived objects to access it, find some pattern in your data that allows you to store it more compactly
Caching/offload - if your data still doesnt fit in memory "page it out" to disk. this can be as simple as extending guava to page out to disk or bringing in a library like ehcache or the likes.

a note on java collections and maps in particular

For small objects java collections and maps in particular incur a large memory penalty (due mostly to everything being wrapped as Objects and the existence of the Map.Entry inner class instances). at the cost of a slightly less elegant API, you should probably look at gnu trove collections if memory consumption is an issue.

191

answered Oct 14 '22 04:10

radai

Related questions
                            
                                How to make PMD run at the start of the maven build rather than at the end of it?
                            
                                java.lang.NumberFormatException: Invalid int: "null"
                            
                                ResourceManager : unable to find resource 'emailTemplate.vm' in any resource loader
                            
                                Combinatory issue due to Factorial overflow
                            
                                Kafka Java consumer never receives any messages
                            
                                Android: format string and add bold style
                            
                                Cucumber Exception: java.lang.ClassNotFoundException: cucumber.io.ResourceLoader
                            
                                Why a ClassCastException but not a compilation error?
                            
                                Reference is not allowed in prolog
                            
                                Configure bean in Spring-Boot
                            
                                WICKET: Control visibility of components on change CheckBox
                            
                                How to configure log4j2's additivity to respect parent's level?
                            
                                Is it possible to force JVM to create an object in stack other than heap?
                            
                                Is Java single-threaded or multi-threaded by default? [closed]
                            
                                How can I optimize this class that solves this math sequence
                            
                                Hibernate OnDelete Cascade not working for MySql but Works on postgres and Ms-Sql
                            
                                How to download multiple files concurrently using intentservice in Android?
                            
                                Long request processing with DropWizard
                            
                                Java - how to sort object in many ways: Arrays.sort(), Comparable<T>
                            
                                SQLiteDatabase nested transaction and workaround

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

best way of loading a large text file in java

Tags:

java

string

memory

user3639557

People also ask

1 Answers

radai

Recent Activity

Donate For Us