I have a task which reads a large file line by line, does some logic with it, and returns a string I need to write to a file. The order of the output does not matter. However, when I try the code below, it stops/get really slow after reading 15-20k lines of my file.
public static Object FileLock = new Object();
...
Parallel.ForEach(System.IO.File.ReadLines(inputFile), (line, _, lineNumber) =>
{
var output = MyComplexMethodReturnsAString(line);
lock (FileLock)
{
using (var file = System.IO.File.AppendText(outputFile))
{
file.WriteLine(output);
}
}
});
Why is my program slow down after some time running? Is there a more correct way to perform this task?
Parallel. ForEach is like the foreach loop in C#, except the foreach loop runs on a single thread and processing take place sequentially, while the Parallel. ForEach loop runs on multiple threads and the processing takes place in a parallel manner.
use the Parallel. ForEach method for the simplest use case, where you just need to perform an action for each item in the collection. use the PLINQ methods when you need to do more, e.g. query the collection or to stream the data.
The short answer is no, you should not just use Parallel. ForEach or related constructs on each loop that you can. Parallel has some overhead, which is not justified in loops with few, fast iterations. Also, break is significantly more complex inside these loops.
ForEach methods support cancellation through the use of cancellation tokens. For more information about cancellation in general, see Cancellation. In a parallel loop, you supply the CancellationToken to the method in the ParallelOptions parameter and then enclose the parallel call in a try-catch block.
You've essentially serialized your query by having all threads try to write to the file. Instead, you should calculate what needs to be written then write them as they come at the end.
var processedLines = File.ReadLines(inputFile).AsParallel()
.Select(l => MyComplexMethodReturnsAString(l));
File.AppendAllLines(outputFile, processedLines);
If you need to flush the data as it comes, open a stream and enable auto flushing (or flush manually):
var processedLines = File.ReadLines(inputFile).AsParallel()
.Select(l => MyComplexMethodReturnsAString(l));
using (var output = File.AppendText(outputFile))
{
output.AutoFlush = true;
foreach (var processedLine in processedLines)
output.WriteLine(processedLine);
}
This has to do with how Parallel.ForEach
's internal load balancer works. When it sees that your threads spend a lot of time blocking, it reasons that it can speed things up by throwing more threads at the problem, leading to higher parallel overheads, contention for your FileLock
and overall performance degradation.
Why is this happening? Because Parallel.ForEach
is not meant for IO work.
How can you fix this? Use Parallel.ForEach
for CPU work only and perform all IO outside of the parallel loop.
A quick workaround is to limit the number of threads Parallel.ForEach
is allowed to enlist, by using the overload which accepts ParallelOptions
, like so:
Parallel.ForEach(
System.IO.File.ReadLines(inputFile),
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
(line, _, lineNumber) =>
{
...
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With