Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with a very large number of files

I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.

With all the zip and rar files extracted, I get a final total of one million files.

I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.

I have a few questions:

  1. Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.

  2. Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.

  3. Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...

  4. Should I use asynchronous programming or stick to synchronous in this case?

like image 945
Martijn Avatar asked May 05 '12 14:05

Martijn


1 Answers

async/await definitely helps when dealing with external resources - typically web requests, file system or db operations. The interesting problem here is that you need to fulfill multiple requirements at the same time:

  • consume as small amount of CPU as possible (this is where async/await will help)
  • perform multiple operations at the same time, in parallel
  • control the amount of tasks that are started (!) - if you do not take this into account, you will likely run out of threads when dealing with many files.

You may take a look at a small project I published on github:

Parallel tree walker

It is able to enumerate any number of files in a directory structure efficiently. You can define the async operation to perform on every file (in your case indexing it) while still controlling the maximum number of files that are processed at the same time.

For example:

await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
    MaxDegreeOfParallelism = 10,
    ProcessElementAsync = async (element) =>
    {
        var el = element as FileSystemElement;
        var path = el.Path;
        var isDirectory = el.IsDirectory;

        await DoStuffAsync(el);
    }
});

(if you cannot use the tool directly as a dll, you may still find some useful examples in the source code)

like image 137
Miklós Tóth Avatar answered Oct 21 '22 13:10

Miklós Tóth