I have a huge file of around 10 GB. I have to do operations such as sort, filter, etc on the files in Java. Each operation can be done in parallel.
Is it good to start 10 threads and read the file in parallel ? Each thread reads 1 GB of the file. Is there any other option to solve the issue with extra large files and processing them as fast as possible? Is NIO good for such scenarios?
Currently, I am performing operations in serial and it takes around 20 mins to process such files.
Thanks,
Is it good to start 10 threads and read the file in parallel ?
Almost certainly not - although it depends. If it's from an SSD (where there's effectively no seek time) then maybe. If it's a traditional disk, definitely not.
That doesn't mean you can't use multiple threads though - you could potentially create one thread to read the file, performing only the most rudimentary tasks to get the data into processable chunks. Then use a producer/consumer queue to let multiple threads process the data.
Without knowing more than "sort, filter, etc" (which is pretty vague) we can't really tell how parallelizable the process is in the first place - but trying to perform the IO in parallel on a single file will probably not help.
Try profiling the code to see where the bottlenecks are. Have you tried having one thread read the whole file (or as much as possible), and give that off to 10 threads for processing? If File I/O is your bottleneck (which seems plausible), this should improve your overall run time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With