Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Parallel File Processing

I have following code:

import java.io.*;
import java.util.concurrent.* ;
public class Example{
public static void main(String args[]) {
    try {
        FileOutputStream fos = new FileOutputStream("1.dat");
        DataOutputStream dos = new DataOutputStream(fos);

        for (int i = 0; i < 200000; i++) {
            dos.writeInt(i);
        }
        dos.close();                                                         // Two sample files created

        FileOutputStream fos1 = new FileOutputStream("2.dat");
        DataOutputStream dos1 = new DataOutputStream(fos1);

        for (int i = 200000; i < 400000; i++) {
            dos1.writeInt(i);
        }
        dos1.close();

        Exampless.createArray(200000); //Create a shared array
        Exampless ex1 = new Exampless("1.dat");
        Exampless ex2 = new Exampless("2.dat");
        ExecutorService executor = Executors.newFixedThreadPool(2); //Exexuted parallaly to cont number of matches in two file
        long startTime = System.nanoTime();
        long endTime;
        Future<Integer> future1 = executor.submit(ex1);
        Future<Integer> future2 = executor.submit(ex2);
        int count1 = future1.get();
        int count2 = future2.get();
        endTime = System.nanoTime();
        long duration = endTime - startTime;
        System.out.println("duration with threads:"+duration);
        executor.shutdown();
        System.out.println("Matches: " + (count1 + count2));

        startTime = System.nanoTime();
        ex1.call();
        ex2.call();
        endTime = System.nanoTime();
        duration = endTime - startTime;
        System.out.println("duration without threads:"+duration);

    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
}
}

class Exampless implements Callable {

public static int[] arr = new int[20000];
public String _name;

public Exampless(String name) {
    this._name = name;
}

static void createArray(int z) {
    for (int i = z; i < z + 20000; i++) { //shared array
        arr[i - z] = i;
    }
}

public Object call() {
    try {
        int cnt = 0;
        FileInputStream fin = new FileInputStream(_name);
        DataInputStream din = new DataInputStream(fin);      // read file and calculate number of matches
        for (int i = 0; i < 20000; i++) {
            int c = din.readInt();
            if (c == arr[i]) {
                cnt++;
            }
        }
        return cnt ;
    } catch (Exception e) {
        System.err.println("Error: " + e.getMessage());
    }
    return -1 ;
}

}

Where I am trying to count number of matches in an array with two files. Now, though I am running it on two threads, code is not doing well because:

(running it on single thread, file 1 + file 2 reading time) < (file 1 || file 2 reading time in multiple thread).

Can anyone help me how to solve this (I have 2 core CPU and file size is approx. 1.5 GB).

like image 971
Arpssss Avatar asked Jul 31 '12 16:07

Arpssss


People also ask

What is Java parallel streams?

What is Java Parallel Streams? Java Parallel Streams is a feature of Java 8 and higher, meant for utilizing multiple cores of the processor. Normally any java code has one stream of processing, where it is executed sequentially.

How do you use parallel processing in Java?

Parallel Processing in Java. A result of each subtask needs to be compared with each other. This task is a little bit harder to code. RecursiveAction does not return any result, you can use it e.g. to initialize a big array with some custom values. Each of subtask works alone on its own piece of that array.

What is parallel programming?

Why Parallel Programming? With the advent of multicore CPUs in recent years, parallel programming is the way to take full advantage of the new processing workhorses. Parallel programming refers to the concurrent execution of processes due to the availability of multiple processing cores.

How does the parallelstream () method of the collection interface work?

The parallelStream () method of the Collection interface returns a possible parallel stream with the collection as the source. Let us explain the working with the help of an example. In the code given below, we are again using parallel streams but here we are using a List to read from the text file.


1 Answers

In the first case you are reading sequentially one file, byte-by-byte, block-by-block. This is as fast as disk I/O can be, providing the file is not very fragmented. When you are done with the first file, disk/OS finds the beginning of the second file and continues very efficient, linear reading of disk.

In the second case you are constantly switching between the first and the second file, forcing the disk to seek from one place to another. This extra seeking time (approximately 10 ms) is the root of your confusion.

Oh, and you know that disk access is single-threaded and your task is I/O bound so there is no way splitting this task to multiple threads could help, as long as your reading from the same physical disk? Your approach could only be justified if:

  • each thread, except reading from a file, was also performing some CPU intensive or blocking operations, slower by an order of magnitude compared to I/O.

  • files are on different physical drives (different partition is not enough) or on some RAID configurations

  • you are using SSD drive

like image 101
Tomasz Nurkiewicz Avatar answered Sep 18 '22 13:09

Tomasz Nurkiewicz