Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to correctly write to a file using Parallel.ForEach?

I have a task which reads a large file line by line, does some logic with it, and returns a string I need to write to a file. The order of the output does not matter. However, when I try the code below, it stops/get really slow after reading 15-20k lines of my file.

public static Object FileLock = new Object();
...
Parallel.ForEach(System.IO.File.ReadLines(inputFile), (line, _, lineNumber) =>
{
    var output = MyComplexMethodReturnsAString(line);
    lock (FileLock)
    {
        using (var file = System.IO.File.AppendText(outputFile))
        {
            file.WriteLine(output);
        }
    }
});

Why is my program slow down after some time running? Is there a more correct way to perform this task?

like image 690
justindao Avatar asked Feb 12 '16 22:02

justindao


People also ask

How do you write parallel ForEach loop in C#?

Parallel. ForEach is like the foreach loop in C#, except the foreach loop runs on a single thread and processing take place sequentially, while the Parallel. ForEach loop runs on multiple threads and the processing takes place in a parallel manner.

When should I use parallel ForEach When should I use Plinq?

use the Parallel. ForEach method for the simplest use case, where you just need to perform an action for each item in the collection. use the PLINQ methods when you need to do more, e.g. query the collection or to stream the data.

Should you use parallel ForEach?

The short answer is no, you should not just use Parallel. ForEach or related constructs on each loop that you can. Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.

How do you break a parallel loop in ForEach?

ForEach methods support cancellation through the use of cancellation tokens. For more information about cancellation in general, see Cancellation. In a parallel loop, you supply the CancellationToken to the method in the ParallelOptions parameter and then enclose the parallel call in a try-catch block.


2 Answers

You've essentially serialized your query by having all threads try to write to the file. Instead, you should calculate what needs to be written then write them as they come at the end.

var processedLines = File.ReadLines(inputFile).AsParallel()
    .Select(l => MyComplexMethodReturnsAString(l));
File.AppendAllLines(outputFile, processedLines);

If you need to flush the data as it comes, open a stream and enable auto flushing (or flush manually):

var processedLines = File.ReadLines(inputFile).AsParallel()
    .Select(l => MyComplexMethodReturnsAString(l));
using (var output = File.AppendText(outputFile))
{
    output.AutoFlush = true;
    foreach (var processedLine in processedLines)
        output.WriteLine(processedLine);
}
like image 176
Jeff Mercado Avatar answered Oct 21 '22 19:10

Jeff Mercado


This has to do with how Parallel.ForEach's internal load balancer works. When it sees that your threads spend a lot of time blocking, it reasons that it can speed things up by throwing more threads at the problem, leading to higher parallel overheads, contention for your FileLock and overall performance degradation.

Why is this happening? Because Parallel.ForEach is not meant for IO work.

How can you fix this? Use Parallel.ForEach for CPU work only and perform all IO outside of the parallel loop.

A quick workaround is to limit the number of threads Parallel.ForEach is allowed to enlist, by using the overload which accepts ParallelOptions, like so:

Parallel.ForEach(
    System.IO.File.ReadLines(inputFile),
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
    (line, _, lineNumber) =>
    {
        ...
    }
like image 37
Kirill Shlenskiy Avatar answered Oct 21 '22 19:10

Kirill Shlenskiy