Multi threaded file processing with .NET

Tags:

There is a folder that contains 1000s of small text files. I aim to parse and process all of them while more files are being populated into the folder. My intention is to multithread this operation as the single threaded prototype took six minutes to process 1000 files.

I like to have reader and writer thread(s) as the following. While the reader thread(s) are reading the files, I'd like to have writer thread(s) to process them. Once the reader is started reading a file, I d like to mark it as being processed, such as by renaming it. Once it's read, rename it to completed.

How do I approach such a multithreaded application?

Is it better to use a distributed hash table or a queue?

Which data structure do I use that would avoid locks?

Is there a better approach to this scheme?

531

asked May 11 '10 02:05

DarthVader

2 Answers

Since there's curiosity on how .NET 4 works with this in comments, here's that approach. Sorry, it's likely not an option for the OP. Disclaimer: This is not a highly scientific analysis, just showing that there's a clear performance benefit. Based on hardware, your mileage may vary widely.

Here's a quick test (if you see a big mistake in this simple test, it's just an example. Please comment, and we can fix it to be more useful/accurate). For this, I just dropped 12,000 ~60 KB files into a directory as a sample (fire up LINQPad; you can play with it yourself, for free! - be sure to get LINQPad 4 though):

var files = 
Directory.GetFiles("C:\\temp", "*.*", SearchOption.AllDirectories).ToList();

var sw = Stopwatch.StartNew(); //start timer
files.ForEach(f => File.ReadAllBytes(f).GetHashCode()); //do work - serial
sw.Stop(); //stop
sw.ElapsedMilliseconds.Dump("Run MS - Serial"); //display the duration

sw.Restart();
files.AsParallel().ForAll(f => File.ReadAllBytes(f).GetHashCode()); //parallel
sw.Stop();
sw.ElapsedMilliseconds.Dump("Run MS - Parallel");

Slightly changing your loop to parallelize the query is all that's needed in most simple situations. By "simple" I mostly mean that the result of one action doesn't affect the next. Something to keep in mind most often is that some collections, for example our handy List<T> is not thread safe, so using it in a parallel scenario isn't a good idea :) Luckily there were concurrent collections added in .NET 4 that are thread safe. Also keep in mind if you're using a locking collection, this may be a bottleneck as well, depending on the situation.

This uses the .AsParallel<T>(IEnumeable<T>) and .ForAll<T>(ParallelQuery<T>) extensions available in .NET 4.0. The .AsParallel() call wraps the IEnumerable<T> in a ParallelEnumerableWrapper<T> (internal class) which implements ParallelQuery<T>. This now allows you to use the parallel extension methods, in this case we're using .ForAll().

.ForAll() internally crates a ForAllOperator<T>(query, action) and runs it synchronously. This handles the threading and merging of the threads after it's running... There's quite a bit going on in there, I'd suggest starting here if you want to learn more, including additional options.

The results (Computer 1 - Physical Hard Disk):

Serial: 1288 - 1333ms
Parallel: 461 - 503ms

Computer specs - for comparison:

Quad Core i7 920 @ 2.66 GHz
12 GB RAM (DDR 1333)
300 GB 10k rpm WD VelociRaptor

The results (Computer 2 - Solid State Drive):

Serial: 545 - 601 ms
Parallel: 248 - 278 ms

Computer specifications - for comparison:

Quad Core 2 Quad Q9100 @ 2.26 GHz
8 GB RAM (DDR 1333)
120 GB OCZ Vertex SSD (Standard Version - 1.4 Firmware)

I don't have links for the CPU/RAM this time, these came installed. This is a Dell M6400 Laptop (here's a link to the M6500... Dell's own links to the 6400 are broken).

These numbers are from 10 runs, taking the min/max of the inner 8 results (removing the original min/max for each as possible outliers). We hit an I/O bottleneck here, especially on the physical drive, but think about what the serial method does. It reads, processes, reads, processes, rinse repeat. With the parallel approach, you are (even with a I/O bottleneck) reading and processing simultaneously. In the worst bottleneck situation, you're processing one file while reading the next. That alone (on any current computer!) should result in some performance gain. You can see that we can get a bit more than one going at a time in the results above, giving us a healthy boost.

Another disclaimer: Quad core + .NET 4 parallel isn't going to give you four times the performance, it doesn't scale linearly... There are other considerations and bottlenecks in play.

I hope this was on interest in showing the approach and possible benefits. Feel free to criticize or improve... This answer exists solely for those curious as indicated in the comments :)

answered Oct 18 '22 13:10

Nick Craver

Design

The Producer/Consumer pattern will probably be the most useful for this situation. You should create enough threads to maximize the throughput.

Here are some questions about the Producer/Consumer pattern to give you an idea of how it works:

C# Producer/Consumer pattern
C# producer/consumer

You should use a blocking queue and the producer should add files to the queue while the consumers process the files from the queue. The blocking queue requires no locking, so it's about the most efficient way to solve your problem.

If you're using .NET 4.0 there are several concurrent collections that you can use out of the box:

ConcurrentQueue: http://msdn.microsoft.com/en-us/library/dd267265%28v=VS.100%29.aspx
BlockingCollection: http://msdn.microsoft.com/en-us/library/dd267312%28VS.100%29.aspx

Threading

A single producer thread will probably be the most efficient way to load the files from disk and push them onto the queue; subsequently multiple consumers will be popping items off the queue and they'll process them. I would suggest that you try 2-4 consumer threads per core and take some performance measurements to determine which is most optimal (i.e. the number of threads that provide you with the maximum throughput). I would not recommend the use a ThreadPool for this specific example.

P.S. I don't understand what's the concern with a single point of failure and the use of distributed hash tables? I know DHTs sound like a really cool thing to use, but I would try the conventional methods first unless you have a specific problem in mind that you're trying to solve.

answered Oct 18 '22 13:10

Kiril

Related questions
                            
                                Accept Cookies in WebClient?
                            
                                Forcing HttpClient to use Content-Type: text/xml
                            
                                Is there a way to automatically generate equals and hashcode method in Visual Studio
                            
                                Prevent $id/$ref when serializing objects using Web API and JSON.NET
                            
                                Pausing within a MVC controller action
                            
                                Rounded edges in button C# (WinForms)
                            
                                Cannot find ConfigurationManager in class library
                            
                                WPF Grid.IsSharedSizeScope across multiple grids
                            
                                Why do we assign child class object to parent class reference variable?
                            
                                TaskCompletionSource throws "An attempt was made to transition a task to a final state when it had already completed"
                            
                                TargetName property cannot be set on a Style Setter, so how is it set?
                            
                                How to add json to RestSharp POST request
                            
                                Mock IEnumerable<T> using moq
                            
                                FormattedText.FormttedText is obsolete. Use the PixelsPerDip override
                            
                                what should be the key length in signingCredentials jwt asp.net core
                            
                                Convert Unicode surrogate pair to literal string
                            
                                Programmatically Clear Selection in WPF ComboBox
                            
                                Accessing the ASP.NET Cache from a Separate Thread?
                            
                                search text file using c# and display the line number and the complete line that contains the search keyword
                            
                                How to set PDF paragraph or font line-height with iTextSharp?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multi threaded file processing with .NET

Tags:

c#

.net

architecture

multithreading