I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:
class Alaki { private static List<string> input = new List<string>(); private static void exec(int threadcount) { ParallelOptions options = new ParallelOptions(); options.MaxDegreeOfParallelism = threadcount; Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) => { var dic = new Dictionary<string, List<int>>(); for (int i = range.Item1; i < range.Item2; i++) { //make some delay! //for (int x = 0; x < 400000; x++) ; var tokens = input[i].Split(); foreach (var token in tokens) { if (!dic.ContainsKey(token)) dic[token] = new List<int>(); dic[token].Add(1); } } }); } public static void Main(String[] args) { StreamReader reader=new StreamReader((@"c:\txt-set\agg.txt")); while(true) { var line=reader.ReadLine(); if(line==null) break; input.Add(line); } DateTime t0 = DateTime.Now; exec(Environment.ProcessorCount); Console.WriteLine("Parallel: " + (DateTime.Now - t0)); t0 = DateTime.Now; exec(1); Console.WriteLine("Serial: " + (DateTime.Now - t0)); } }
It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:
Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171
The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.
In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.
Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!
This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.
What is your idea? Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions. Is there a flaw in .NET's run-time?
Thanks in advance.
Since the work in your parallel function is very small, the overhead of the management the parallelism has to do becomes significant, thus slowing down the overall work.
The execution of Parallel. Foreach is faster than normal ForEach.
In many cases, Parallel. For and Parallel. ForEach can provide significant performance improvements over ordinary sequential loops. However, the work of parallelizing the loop introduces complexity that can lead to problems that, in sequential code, are not as common or are not encountered at all.
No, it doesn't block and returns control immediately. The items to run in parallel are done on background threads.
Parallel.For
doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism
); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For
can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)
Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.
Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:
Over 10% of the total CPU time was spent in the garbage collector. Most well tuned applications are in the 0-10% range. This is typically caused by an allocation pattern that allows objects to live just long enough to require an expensive Gen 2 collection.
This program had a peak GC heap allocation rate of over 10 MB/sec. This is quite high. It is not uncommon that this is simply a performance bug.
Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.
However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).
If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).
When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With