Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which API in Java to use for file reading to have best performance?

In my place where I work, used to have files with more than million rows per file. Even though the server memory are more than 10GB with 8GB for JVM, sometimes the server get hanged for few moments and chokes the other tasks.

I profiled the code and found that while file reading memory use rises in Giga bytes frequently(1GB to 3GB) and then suddenly comes back to normal. It seems that this frequent high and low memory uses hangs my servers. Of course this was due to Garbage collection.

Which API should I use to read the files for better performance?

Righ now I am using BufferedReader(new FileReader(...)) to read these CSV files.

Process : How am I reading the file?

  1. I read files line by line.
  2. Every line has few columns. based on the types I parse them correspondingly(cost column in double, visit column in int, keyword column in String, etc..).
  3. I push the eligible content(visit > 0) in a HashMap and finally clears that Map at the end of the task

Update

I do this reading of 30 or 31 files(one month's data) and store the eligible in a Map. Later this map is used to get some culprits in different tables. Therefore reading is must and storing that data is also must. Although I have switched the HashMap part to BerkeleyDB now but the issue at the time of reading file is same or even worse.

like image 429
DKSRathore Avatar asked Dec 05 '22 04:12

DKSRathore


1 Answers

BufferedReader is one of the two best APIs to use for this. If you really had trouble with file reading, an alternative might be to use the stuff in NIO to memory-map your files and then read the contents directly out of memory.

But your problem is not with the reader. Your problem is that every read operation creates a bunch of new objects, most likely in the stuff you do just after reading.

You should consider cleaning up your input processing with an eye on reducing the number and/or size of objects you create, or simply getting rid of objects more quickly once no longer needed. Would it be possible to process your file one line or chunk at a time rather than inhaling the whole thing into memory for processing?

Another possibility would be to fiddle with garbage collection. You have two mechanisms:

  • Explicitly call the garbage collector every once in a while, say every 10 seconds or every 1000 input lines or something. This will increase the amount of work done by the GC, but it will take less time for each GC, your memory won't swell as much and so hopefully there will be less impact on the rest of the server.

  • Fiddle with the JVM's garbage collector options. These differ between JVMs, but java -X should give you some hints.

Update: Most promising approach:

Do you really need the whole dataset in memory at one time for processing?

like image 112
Carl Smotricz Avatar answered Feb 13 '23 14:02

Carl Smotricz