Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a single file with Multiple Thread: should speed up?

I'm reading a file which conatins 500000 rows. I'm testing to see how multiple thread speed up the process....

private void multiThreadRead(int num){

    for(int i=1; i<= num; i++) { 
        new Thread(readIndivColumn(i),""+i).start(); 
     } 
}

private Runnable readIndivColumn(final int colNum){
    return new Runnable(){
        @Override
        public void run() {
            // TODO Auto-generated method stub
            try {

                long startTime = System.currentTimeMillis();
                System.out.println("From Thread no:"+colNum+" Start time:"+startTime);

                RandomAccessFile raf = new RandomAccessFile("./src/test/test1.csv","r");
                String line = "";
                //System.out.println("From Thread no:"+colNum);

                while((line = raf.readLine()) != null){
                    //System.out.println(line);
                    //System.out.println(StatUtils.getCellValue(line, colNum));
                }


                long elapsedTime = System.currentTimeMillis() - startTime;

                String formattedTime = String.format("%d min, %d sec",  
                        TimeUnit.MILLISECONDS.toMinutes(elapsedTime), 
                        TimeUnit.MILLISECONDS.toSeconds(elapsedTime) -  
                        TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(elapsedTime)) 
                    );

                System.out.println("From Thread no:"+colNum+" Finished Time:"+formattedTime);
            } 
            catch (Exception e) {
                // TODO Auto-generated catch block
                System.out.println("From Thread no:"+colNum +"===>"+e.getMessage());

                e.printStackTrace();
            }
        }
    };
}

private void sequentialRead(int num){
    try{
        long startTime = System.currentTimeMillis();
        System.out.println("Start time:"+startTime);

        for(int i =0; i < num; i++){
            RandomAccessFile raf = new RandomAccessFile("./src/test/test1.csv","r");
            String line = "";

            while((line = raf.readLine()) != null){
                //System.out.println(line);
            }               
        }

        long elapsedTime = System.currentTimeMillis() - startTime;

        String formattedTime = String.format("%d min, %d sec",  
                TimeUnit.MILLISECONDS.toMinutes(elapsedTime), 
                TimeUnit.MILLISECONDS.toSeconds(elapsedTime) -  
                TimeUnit.MINUTES.toSeconds(TimeUnit.MILLISECONDS.toMinutes(elapsedTime)) 
            );

        System.out.println("Finished Time:"+formattedTime);
    }
    catch (Exception e) {
        e.printStackTrace();
        // TODO: handle exception
    }

}
    public TesterClass() {

    sequentialRead(1);      
    this.multiThreadRead(1);

}

for num = 1 I get following result:

Start time:1326224619049

Finished Time:2 min, 14 sec

Sequential read ENDS...........

Multi-Thread read starts:

From Thread no:1 Start time:1326224753606

From Thread no:1 Finished Time:2 min, 13 sec

Multi-Thread read ENDS.....

for num = 5 I get following result:

    formatted Time:10 min, 20 sec

Sequential read ENDS...........

Multi-Thread read starts:

From Thread no:1 Start time:1326223509574
From Thread no:3 Start time:1326223509574
From Thread no:4 Start time:1326223509574
From Thread no:5 Start time:1326223509574
From Thread no:2 Start time:1326223509574
From Thread no:4 formatted Time:5 min, 54 sec
From Thread no:2 formatted Time:6 min, 0 sec
From Thread no:3 formatted Time:6 min, 7 sec
From Thread no:5 formatted Time:6 min, 23 sec
From Thread no:1 formatted Time:6 min, 23 sec
Multi-Thread read ENDS.....

My question is: shouldn't multi-threaded read takes approx. 2.13 sec ? Can you please explain why is it taking too long with multi-threaded solution?

Thanks in advance.

like image 287
Hasan Avatar asked Jan 10 '12 20:01

Hasan


3 Answers

The reason you are seeing a slow down when reading in parallel is because the magnetic hard disk head needs to seek the next read position (taking about 5ms) for each thread. Thus, reading with multiple threads effectively bounces the disk between seeks, slowing it down. The only recommended way to read a file from a single disk is to read sequentially with one thread.

like image 181
Tudor Avatar answered Nov 10 '22 17:11

Tudor


Since file reading is mainly waiting for disk I/O, you have the problem that the disk won't spin faster just because it's used by many threads :)

like image 21
Joachim Isaksson Avatar answered Nov 10 '22 17:11

Joachim Isaksson


Reading from a file is an inherently serial process, assuming no caching, meaning there is a limit to how fast you can retrieve data from a file. Even without file locks (i.e. opening the file read-only) all the threads after the 1st will just block on the disk read so you make all the other threads wait and whichever one is active when the data becomes available is the one that processes the next block.

like image 43
Kelly S. French Avatar answered Nov 10 '22 18:11

Kelly S. French