We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop. The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 @ 2 GHz): <pre class="prettyprint lang-cs prettyprint-override"><code>using System; using System.Collections.Concurrent; using System.Linq; using System.IO; using System.IO.Compression; using System.Threading.Tasks; namespace ParallelLineCount { public class ScriptMain { static void Main(String[] args) { int maxMaxDOP = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2; string fileLocation = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ; string filePattern = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz"; string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts"; Console.WriteLine("Start: {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")); Console.WriteLine("Processing file(s): {0}", filePattern); Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString()); Console.WriteLine(""); Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]"); for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++) { // Construct ConcurrentStacks for resulting strings and counters ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>(); ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>(); ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>(); ConcurrentStack<int> TotalFiles = new ConcurrentStack<int>(); DateTime FullStartTime = DateTime.Now; string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern); var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP }; // Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body) Parallel.ForEach(files, options, currentFile => { string filename = System.IO.Path.GetFileName(currentFile); DateTime fileStartTime = DateTime.Now; using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open)) { Int64 lines = 0, someBookLines = 0, length = 0; String line = ""; using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress))) { while (!reader.EndOfStream) { line = reader.ReadLine(); lines++; // total lines length += line.Length; // total line length if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later } TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length); TotalFiles.Push(1); // silly way to count processed files :) } } } ); TimeSpan runningTime = DateTime.Now - FullStartTime; // Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]"); Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}", maxDOP.ToString(), TotalFiles.Sum().ToString(), Convert.ToInt32(runningTime.TotalMilliseconds).ToString(), TotalLength.Sum().ToString(), TotalLines.Sum(), TotalSomeBookLines.Sum().ToString(), Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(), Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString()); } Console.WriteLine(); Console.WriteLine("Finish: " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ")); } } } </code></pre> Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8: <img src="https://i.stack.imgur.com/VvFz1.png" alt="enter image description here"> The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range): <img src="https://i.stack.imgur.com/hZoUW.png" alt="enter image description here"> The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files. Can someone find a bottleneck?

It's likely that one problem is the small buffer size used by the default <code>FileStream</code> constructor. I suggest you use a larger input buffer. Such as: <pre class="prettyprint"><code>using (FileStream infile = new FileStream( name, FileMode.Open, FileAccess.Read, FileShare.None, 65536)) </code></pre> The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently. I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance. Also, although this is unlikely to have a major effect on performance, you probably should replace those <code>ConcurrentStack</code> instances with 64-bit integers and use <code>Interlocked.Add</code> or <code>Interlocked.Increment</code> to update them. It simplifies your code and removes the need to manage the collections. Update: Re-reading your problem description, I was struck by this statement: <blockquote> The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files. </blockquote> That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened. If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a <code>BlockingCollection</code> or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.

Parallel GZip Decompression of Log Files - Tweaking MaxDegreeOfParallelism for the Highest Throughput

Tags:

performance

c#

multithreading

parallel-processing

c#-4.0

We have up to 30 GB of GZipped log files per day. Each file holds 100.000 lines and is between 6 and 8 MB when compressed. The simplified code in which the parsing logic has been stripped out, utilises the Parallel.ForEach loop.

The maximum number of lines processed peaks at MaxDegreeOfParallelism of 8 on the two-NUMA node, 32 logical CPU box (Intel Xeon E7-2820 @ 2 GHz):

using System;

using System.Collections.Concurrent;

using System.Linq;
using System.IO;
using System.IO.Compression;

using System.Threading.Tasks;

namespace ParallelLineCount
{
    public class ScriptMain
    {
        static void Main(String[] args)
        {
            int    maxMaxDOP      = (args.Length > 0) ? Convert.ToInt16(args[0]) : 2;
            string fileLocation   = (args.Length > 1) ? args[1] : "C:\\Temp\\SomeFiles" ;
            string filePattern    = (args.Length > 1) ? args[2] : "*2012-10-30.*.gz";
            string fileNamePrefix = (args.Length > 1) ? args[3] : "LineCounts";

            Console.WriteLine("Start:                 {0}", DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
            Console.WriteLine("Processing file(s):    {0}", filePattern);
            Console.WriteLine("Max MaxDOP to be used: {0}", maxMaxDOP.ToString());
            Console.WriteLine("");

            Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");

            for (int maxDOP = 1; maxDOP <= maxMaxDOP; maxDOP++)
            {

                // Construct ConcurrentStacks for resulting strings and counters
                ConcurrentStack<Int64> TotalLines = new ConcurrentStack<Int64>();
                ConcurrentStack<Int64> TotalSomeBookLines = new ConcurrentStack<Int64>();
                ConcurrentStack<Int64> TotalLength = new ConcurrentStack<Int64>();
                ConcurrentStack<int>   TotalFiles = new ConcurrentStack<int>();

                DateTime FullStartTime = DateTime.Now;

                string[] files = System.IO.Directory.GetFiles(fileLocation, filePattern);

                var options = new ParallelOptions() { MaxDegreeOfParallelism = maxDOP };

                //  Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body)
                Parallel.ForEach(files, options, currentFile =>
                    {
                        string filename = System.IO.Path.GetFileName(currentFile);
                        DateTime fileStartTime = DateTime.Now;

                        using (FileStream inFile = File.Open(fileLocation + "\\" + filename, FileMode.Open))
                        {
                            Int64 lines = 0, someBookLines = 0, length = 0;
                            String line = "";

                            using (var reader = new StreamReader(new GZipStream(inFile, CompressionMode.Decompress)))
                            {
                                while (!reader.EndOfStream)
                                {
                                    line = reader.ReadLine();
                                    lines++; // total lines
                                    length += line.Length;  // total line length

                                    if (line.Contains("book")) someBookLines++; // some special lines that need to be parsed later
                                }

                                TotalLines.Push(lines); TotalSomeBookLines.Push(someBookLines); TotalLength.Push(length);
                                TotalFiles.Push(1); // silly way to count processed files :)
                            }
                        }
                    }
                );

                TimeSpan runningTime = DateTime.Now - FullStartTime;

                // Console.WriteLine("MaxDOP,FilesProcessed,ProcessingTime[ms],BytesProcessed,LinesRead,SomeBookLines,LinesPer[ms],BytesPer[ms]");
                Console.WriteLine("{0},{1},{2},{3},{4},{5},{6},{7}",
                    maxDOP.ToString(),
                    TotalFiles.Sum().ToString(),
                    Convert.ToInt32(runningTime.TotalMilliseconds).ToString(),
                    TotalLength.Sum().ToString(),
                    TotalLines.Sum(),
                    TotalSomeBookLines.Sum().ToString(),
                    Convert.ToInt64(TotalLines.Sum() / runningTime.TotalMilliseconds).ToString(),
                    Convert.ToInt64(TotalLength.Sum() / runningTime.TotalMilliseconds).ToString());

            }
            Console.WriteLine();
            Console.WriteLine("Finish:                " + DateTime.UtcNow.ToString("yyyy-MM-ddTHH:mm:ss.fffffffZ"));
        }
    }
}

Here's a summary of the results, with a clear peak at MaxDegreeOfParallelism = 8:

enter image description here

The CPU load (shown aggregated here, most of the load was on a single NUMA node, even when DOP was in 20 to 30 range):

enter image description here

The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.

Can someone find a bottleneck?

473

asked Nov 01 '12 11:11

milivojeviCH

1 Answers

It's likely that one problem is the small buffer size used by the default FileStream constructor. I suggest you use a larger input buffer. Such as:

using (FileStream infile = new FileStream(
    name, FileMode.Open, FileAccess.Read, FileShare.None, 65536))

The default buffer size is 4 kilobytes, which has the thread making many calls to the I/O subsystem to fill its buffer. A buffer of 64K means that you will make those calls much less frequently.

I've found that a buffer size of between 32K and 256K gives the best performance, with 64K being the "sweet spot" when I did some detailed testing a while back. A buffer size larger than 256K actually begins to reduce performance.

Also, although this is unlikely to have a major effect on performance, you probably should replace those ConcurrentStack instances with 64-bit integers and use Interlocked.Add or Interlocked.Increment to update them. It simplifies your code and removes the need to manage the collections.

Update:

Re-reading your problem description, I was struck by this statement:

The only way I've found to make CPU load cross 95% mark was to split the files across 4 different folders and execute the same command 4 times, each one targeting a subset of all files.

That, to me, points to a bottleneck in opening files. As though the OS is using a mutual exclusion lock on the directory. And even if all the data is in the cache and there's no physical I/O required, processes still have to wait on this lock. It's also possible that the file system is writing to the disk. Remember, it has to update the Last Access Time for a file whenever it's opened.

If I/O really is the bottleneck, then you might consider having a single thread that does nothing but load files and stuff them into a BlockingCollection or similar data structure so that the processing threads don't have to contend with each other for a lock on the directory. Your application becomes a producer/consumer application with one producer and N consumers.

answered Sep 21 '22 13:09

Jim Mischel

Related questions
                            
                                Can T-SQL store ulong's?
                            
                                algorithm challenge: merging date range
                            
                                UserPrincipal.FindByIdentity Permissions
                            
                                Retrieve an object from hashset in C# [duplicate]
                            
                                .NET doesn't trust my self-signed certificate, but IE does?
                            
                                Solve DisconnectedContext in Visual Studio
                            
                                Entity framework: StoreGeneratedPattern="Computed" property
                            
                                How To Avoid Locking Database In Entity Framework 4 When Doing Many Updates
                            
                                Why does this code compile without error even though the class is marked Obsoleted?
                            
                                Implementing INotifyCollectionChanged on a collection without indexes
                            
                                What is the purpose of String.IsInterned?
                            
                                Windows service will not start (Error 1053)
                            
                                How do I use a circuit breaker?
                            
                                How to simulate C# thread starvation
                            
                                Manipulating lines of data
                            
                                Determine the URL hostname without using HttpContext.Current?
                            
                                HttpWebRequest NameResolutionFailure exception in .NET (with Mono on Ubuntu)
                            
                                How to share the most code between a WPF and an ASP.NET MVC application?
                            
                                Build certificate chain in BouncyCastle in C#
                            
                                Round-twice error in .NET's Double.ToString method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With