I'm trying to think how should I utilize threads in my program. Right now I have a single threaded program that reads a single huge file. Very simple program, just reads line by line and collects some statistics about the words. Now, I would like to use multi threads to make it faster. I'm not sure how to approach this.
One solution is to separate the data into X pieces in advance, then have X threads, each runs on one piece simultaneously, with one sync method to write the stats to memory. Is there a better approach? specifically, I would like to avoid separating the data in advance.
Thanks!
First of all, do some profiling to make sure that your process is actually compute-bound and not I/O bound. That is, that your statistics collection is slower than accessing the file. Otherwise, multi-threading will slow your program, not speed it, particularly if you are running on a single-core CPU (or an ancient JVM).
Also consider: if your file resides on a hard drive: how will you schedule reads? You risk adding hard drive seek delays otherwise, stalling all threads that have managed to finish their chunk of work, while one thread is asking the hard drive to seek to position 0x03457000...
You could have a look at the producer-consumer approach. It is a classical threading problem where one thread produces data (in your case the one that reads the file) and writes it to a shared buffer from where another thread reads that data (consumer) which is your calculation thread (some Java examples).
Also have a look at Javas non-blocking IO.
The assumption that multithreaded disk access is faster might be wrong, as disguessed here: Directory walker on modern operating systems slower when it's multi-threaded?
Performance improvement could be achieved by splitting reading and processing of data in separate threads.
But wait, reading files line-by-line? That doesn't sounds optimal. Better read them as stream of characters (using FileReader
).
See this sun tutorial.
if your problem is I/O Bound, maybe you can consider splitting your data into multiple files and putting it into a distributed filesystem such as Hadoop Filesystem (HDFS) and then run Map/Reduce operation on it?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With