Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Processing huge files in java

Tags:

java

file

nio

I have a huge file of around 10 GB. I have to do operations such as sort, filter, etc on the files in Java. Each operation can be done in parallel.

Is it good to start 10 threads and read the file in parallel ? Each thread reads 1 GB of the file. Is there any other option to solve the issue with extra large files and processing them as fast as possible? Is NIO good for such scenarios?

Currently, I am performing operations in serial and it takes around 20 mins to process such files.

Thanks,

like image 701
jumpa Avatar asked Mar 14 '12 20:03

jumpa


2 Answers

Is it good to start 10 threads and read the file in parallel ?

Almost certainly not - although it depends. If it's from an SSD (where there's effectively no seek time) then maybe. If it's a traditional disk, definitely not.

That doesn't mean you can't use multiple threads though - you could potentially create one thread to read the file, performing only the most rudimentary tasks to get the data into processable chunks. Then use a producer/consumer queue to let multiple threads process the data.

Without knowing more than "sort, filter, etc" (which is pretty vague) we can't really tell how parallelizable the process is in the first place - but trying to perform the IO in parallel on a single file will probably not help.

like image 159
Jon Skeet Avatar answered Oct 13 '22 21:10

Jon Skeet


Try profiling the code to see where the bottlenecks are. Have you tried having one thread read the whole file (or as much as possible), and give that off to 10 threads for processing? If File I/O is your bottleneck (which seems plausible), this should improve your overall run time.

like image 29
Oleksi Avatar answered Oct 13 '22 22:10

Oleksi