Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrent reading of a File (java preferred)

I have a large file that takes multiple hours to process. So I am thinking of trying to estimate chunks and read the chunks in parallel.

Is it possible to to concurrent read on a single file? I have looked at both RandomAccessFile as well as nio.FileChannel but based on other posts am not sure if this approach would work.

like image 507
user1132593 Avatar asked Aug 08 '12 14:08

user1132593


3 Answers

The most important question here is what is the bottleneck in your case.

If the bottleneck is your disk IO, then there isn't much you can do at the software part. Parallelizing the computation will only make things worse, because reading the file from different parts simultaneously will degrade disk performance.

If the bottleneck is processing power, and you have multiple CPU cores, then you can take an advantage of starting multiple threads to work on different parts of the file. You can safely create several InputStreams or Readers to read different parts of the file in parallel (as long as you don't go over your operating system's limit for the number of open files). You could separate the work into tasks and run them in parallel, like in this example:

import java.io.*;
import java.util.*;
import java.util.concurrent.*;

public class Split {
    private File file;

    public Split(File file) {
        this.file = file;
    }

    // Processes the given portion of the file.
    // Called simultaneously from several threads.
    // Use your custom return type as needed, I used String just to give an example.
    public String processPart(long start, long end)
        throws Exception
    {
        InputStream is = new FileInputStream(file);
        is.skip(start);
        // do a computation using the input stream,
        // checking that we don't read more than (end-start) bytes
        System.out.println("Computing the part from " + start + " to " + end);
        Thread.sleep(1000);
        System.out.println("Finished the part from " + start + " to " + end);

        is.close();
        return "Some result";
    }

    // Creates a task that will process the given portion of the file,
    // when executed.
    public Callable<String> processPartTask(final long start, final long end) {
        return new Callable<String>() {
            public String call()
                throws Exception
            {
                return processPart(start, end);
            }
        };
    }

    // Splits the computation into chunks of the given size,
    // creates appropriate tasks and runs them using a 
    // given number of threads.
    public void processAll(int noOfThreads, int chunkSize)
        throws Exception
    {
        int count = (int)((file.length() + chunkSize - 1) / chunkSize);
        java.util.List<Callable<String>> tasks = new ArrayList<Callable<String>>(count);
        for(int i = 0; i < count; i++)
            tasks.add(processPartTask(i * chunkSize, Math.min(file.length(), (i+1) * chunkSize)));
        ExecutorService es = Executors.newFixedThreadPool(noOfThreads);

        java.util.List<Future<String>> results = es.invokeAll(tasks);
        es.shutdown();

        // use the results for something
        for(Future<String> result : results)
            System.out.println(result.get());
    }

    public static void main(String argv[])
        throws Exception
    {
        Split s = new Split(new File(argv[0]));
        s.processAll(8, 1000);
    }
}
like image 56
Petr Avatar answered Oct 16 '22 14:10

Petr


You can parallelise reading a large file provided you have multiple independent spindals. E.g. if you have a Raid 0 + 1 stripped file system, you can see a performance improvement by triggering multiple concurrent reads to the same file.

If however you have a combined file system like Raid 5 or 6 or a plain single disk. It is highly likely that reading the file sequentially is the fastest way to read from that disk. Note: the OS is smart enough to pre-fetch reads when it sees you are reading sequentially so using an additional thread to do this is unlikely to help.

i.e. using multiple threads will not make you disk any faster.

If you want to read from disk faster, use a faster drive. A typical SATA HDD can read about 60 MB/second and perform 120 IOPS. A typical SATA SSD drive can read at about 400 MB/s and perform 80,000 IOPS and a typical PCI SSD can read at 900 MB/s and perform 230,000 IOPS.

like image 42
Peter Lawrey Avatar answered Oct 16 '22 14:10

Peter Lawrey


If you're reading a file from a hard drive, then the fastest way to get the data is to read the file from start to end, that is, not concurrently.

Now if it's the processing that takes time, then that might benefit from having several threads processing different chunks of data concurrently, but that has nothing to do with how you're reading the file.

like image 2
Buhb Avatar answered Oct 16 '22 14:10

Buhb