Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Multithreading in Java to read data

I'm trying to think how should I utilize threads in my program. Right now I have a single threaded program that reads a single huge file. Very simple program, just reads line by line and collects some statistics about the words. Now, I would like to use multi threads to make it faster. I'm not sure how to approach this.

One solution is to separate the data into X pieces in advance, then have X threads, each runs on one piece simultaneously, with one sync method to write the stats to memory. Is there a better approach? specifically, I would like to avoid separating the data in advance.

Thanks!

like image 561
user247866 Avatar asked Mar 02 '10 08:03

user247866


4 Answers

First of all, do some profiling to make sure that your process is actually compute-bound and not I/O bound. That is, that your statistics collection is slower than accessing the file. Otherwise, multi-threading will slow your program, not speed it, particularly if you are running on a single-core CPU (or an ancient JVM).

Also consider: if your file resides on a hard drive: how will you schedule reads? You risk adding hard drive seek delays otherwise, stalling all threads that have managed to finish their chunk of work, while one thread is asking the hard drive to seek to position 0x03457000...

like image 197
Pontus Gagge Avatar answered Oct 25 '22 07:10

Pontus Gagge


You could have a look at the producer-consumer approach. It is a classical threading problem where one thread produces data (in your case the one that reads the file) and writes it to a shared buffer from where another thread reads that data (consumer) which is your calculation thread (some Java examples).

Also have a look at Javas non-blocking IO.

like image 25
Daff Avatar answered Oct 25 '22 07:10

Daff


The assumption that multithreaded disk access is faster might be wrong, as disguessed here: Directory walker on modern operating systems slower when it's multi-threaded?

Performance improvement could be achieved by splitting reading and processing of data in separate threads.

But wait, reading files line-by-line? That doesn't sounds optimal. Better read them as stream of characters (using FileReader).

See this sun tutorial.

like image 39
java.is.for.desktop Avatar answered Oct 25 '22 08:10

java.is.for.desktop


if your problem is I/O Bound, maybe you can consider splitting your data into multiple files and putting it into a distributed filesystem such as Hadoop Filesystem (HDFS) and then run Map/Reduce operation on it?

like image 25
Lydon Ch Avatar answered Oct 25 '22 06:10

Lydon Ch