I am currently working on a research project which involves indexing a large number of files (240k); they are mostly html, xml, doc, xls, zip, rar, pdf, and text with filesizes ranging from a few KB to more than 100 MB.
With all the zip and rar files extracted, I get a final total of one million files.
I am using Visual Studio 2010, C# and .NET 4.0 with support for TPL Dataflow and Async CTP V3. To extract the text from these files I use Apache Tika (converted with ikvm) and I use Lucene.net 2.9.4 as indexer. I would like the use the new TPL dataflow library and asynchronous programming.
I have a few questions:
Would I get performance benefits if I use TPL? It is mainly an I/O process and from what I understand, TPL doesn't offer much benefit when you heavily use I/O.
Would a producer/consumer approach be the best way to deal with this type of file processing or are there any other models that are better? I was thinking of creating one producer with multiple consumers using blockingcollections.
Would the TPL dataflow library be of any use for this type of process? It seems TPL Dataflow is best used in some sort of messaging system...
Should I use asynchronous programming or stick to synchronous in this case?
async/await definitely helps when dealing with external resources - typically web requests, file system or db operations. The interesting problem here is that you need to fulfill multiple requirements at the same time:
You may take a look at a small project I published on github:
Parallel tree walker
It is able to enumerate any number of files in a directory structure efficiently. You can define the async operation to perform on every file (in your case indexing it) while still controlling the maximum number of files that are processed at the same time.
For example:
await TreeWalker.WalkAsync(root, new TreeWalkerOptions
{
MaxDegreeOfParallelism = 10,
ProcessElementAsync = async (element) =>
{
var el = element as FileSystemElement;
var path = el.Path;
var isDirectory = el.IsDirectory;
await DoStuffAsync(el);
}
});
(if you cannot use the tool directly as a dll, you may still find some useful examples in the source code)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With