Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading and writing multiple files in parallel

I need to write a program in Java which will read a relatively large number (~50,000) files in a directory tree, process the data, and output the processed data in a separate (flat) directory.

Currently I have something like this:

private void crawlDirectoyAndProcessFiles(File directory) {
  for (File file : directory.listFiles()) {
    if (file.isDirectory()) {
      crawlDirectoyAndProcessFiles(file);
    } else { 
      Data d = readFile(file);
      ProcessedData p = d.process();
      writeFile(p,file.getAbsolutePath(),outputDir);
    }
  }
}

Suffice to say that each of those methods is removed and trimmed down for ease of reading, but they all work fine. The whole process works fine, except that it is slow. The processing of data occurs via a remote service and takes between 5-15 seconds. Multiply that by 50,000...

I've never done anything multi-threaded before, but I figure I can get some pretty good speed increases if I do. Can anyone give some pointers how I can effectively parallelise this method?

like image 580
Trasvi Avatar asked Jan 05 '12 04:01

Trasvi


People also ask

Can you read files in parallel?

Although you can read in parallel, reading sequentially is likely to be much faster as it is more cache-friendly. Anyway you can never be sure without a proper benchmark. Of course it depends.


1 Answers

I would use a ThreadPoolExecutor to manage the threads. You can do something like this:

private class Processor implements Runnable {
    private final File file;

    public Processor(File file) {
        this.file = file;
    }

    @Override
    public void run() {
        Data d = readFile(file);
        ProcessedData p = d.process();
        writeFile(p,file.getAbsolutePath(),outputDir);
    }
}

private void crawlDirectoryAndProcessFiles(File directory, Executor executor) {
    for (File file : directory.listFiles()) {
        if (file.isDirectory()) {
          crawlDirectoryAndProcessFiles(file,executor);
        } else {
            executor.execute(new Processor(file); 
        }
    }
}

You would obtain an Executor using:

ExecutorService executor = Executors.newFixedThreadPool(poolSize);

where poolSize is the maximum number of threads you want going at once. (It's important to have a reasonable number here; 50,000 threads isn't exactly a good idea. A reasonable number might be 8.) Note that after you've queued all the files, your main thread can wait until things are done by calling executor.awaitTermination.

like image 59
Ted Hopp Avatar answered Oct 04 '22 03:10

Ted Hopp