I was playing with C# and wanted to speed up a program. I made changes and was able to do so. However, I need help understanding why the change made it faster.
I've attempted to reduce the code to something easier to understand in a question. Score1 and Report1 is the slower way. Score2 and Report2 is the faster way. The first method first stores a string and an int in a struct in parallel. Next, in a serial loop, it loops through an array of those structs and writes their data to a buffer. The second method first writes the data to a string buffer in parallel. Next, in a serial loop, it writes the string data to a buffer. Here are some sample run times:
Run 1 Total Average Time = 0.492087 sec Run 2 Total Average Time = 0.273619 sec
When I was working with an earlier non-parallel version of this, the times were almost the same. Why the difference with the parallel version?
Even if I reduce the loop in Report1 to write a single line of output to the buffer it is still slower (total time about .42 sec).
Here is the simplified code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.IO;
namespace OptimizationQuestion
{
class Program
{
struct ValidWord
{
public string word;
public int score;
}
ValidWord[] valid;
StringBuilder output;
int total;
public void Score1(string[] words)
{
valid = new ValidWord[words.Length];
for (int i = 0; i < words.Length; i++)
{
StringBuilder builder = new StringBuilder();
foreach (char c in words[i])
{
if (c != 'U')
builder.Append(c);
}
if (words[i].Length == 3)
{
valid[i] = new ValidWord
{ word = builder.ToString(), score = words[i].Length };
}
}
}
public void Report1(StringBuilder outputBuffer)
{
int total = 0;
foreach (ValidWord wordInfo in valid)
{
if (wordInfo.score > 0)
{
outputBuffer.AppendLine(String.Format("{0} {1}", wordInfo.word.ToString(), wordInfo.score));
total += wordInfo.score;
}
}
outputBuffer.AppendLine(string.Format("Total = {0}", total));
}
public void Score2(string[] words)
{
output = new StringBuilder();
total = 0;
for (int i = 0; i < words.Length; i++)
{
StringBuilder builder = new StringBuilder();
foreach (char c in words[i])
{
if (c != 'U')
builder.Append(c);
}
if (words[i].Length == 3)
{
output.AppendLine(String.Format("{0} {1}", builder.ToString(), words[i].Length));
total += words[i].Length;
}
}
}
public void Report2(StringBuilder outputBuffer)
{
outputBuffer.Append(output.ToString());
outputBuffer.AppendLine(string.Format("Total = {0}", total));
}
static void Main(string[] args)
{
Program[] program = new Program[100];
for (int i = 0; i < program.Length; i++)
program[i] = new Program();
string[] words = File.ReadAllLines("words.txt");
Stopwatch stopwatch = new Stopwatch();
const int TIMING_REPETITIONS = 20;
double averageTime1 = 0.0;
StringBuilder output = new StringBuilder();
for (int i = 0; i < TIMING_REPETITIONS; ++i)
{
stopwatch.Reset();
stopwatch.Start();
output.Clear();
Parallel.ForEach<Program>(program, p =>
{
p.Score1(words);
});
for (int k = 0; k < program.Length; k++)
program[k].Report1(output);
stopwatch.Stop();
averageTime1 += stopwatch.Elapsed.TotalSeconds;
GC.Collect();
}
averageTime1 /= (double)TIMING_REPETITIONS;
Console.WriteLine(string.Format("Run 1 Total Average Time = {0:0.000000} sec", averageTime1));
double averageTime2 = 0.0;
for (int i = 0; i < TIMING_REPETITIONS; ++i)
{
stopwatch.Reset();
stopwatch.Start();
output.Clear();
Parallel.ForEach<Program>(program, p =>
{
p.Score2(words);
});
for (int k = 0; k < program.Length; k++)
program[k].Report2(output);
stopwatch.Stop();
averageTime2 += stopwatch.Elapsed.TotalSeconds;
GC.Collect();
}
averageTime2 /= (double)TIMING_REPETITIONS;
Console.WriteLine(string.Format("Run 2 Total Average Time = {0:0.000000} sec", averageTime2));
Console.ReadLine();
}
}
}
While C is one of the more difficult languages to learn, it's still an excellent first language pick up because almost all programming languages are implemented in it. This means that once you learn C, it'll be simple to learn more languages like C++ and C#.
C is what is called a compiled language. This means that once you write your C program, you must run it through a C compiler to turn your program into an executable that the computer can run (execute).
First of all, you are parallelizing the repeated runs. This will improve your benchmark time, but may not help out your real production run time very much. To accurately measure how long it will take to actually run through one word list, you need to have exactly one word list going at a time. Otherwise, the individual threads processing all the lists are competing with each other to some extent for system resources and the time per list suffers, even if the time to do all the lists in total is faster.
To speed up the time to process one word list, you want to processes the individual words in the list in parallel, for exactly one list at a time. To get enough definition/size for a good measurement, either make the list very long or process the list many times in serial.
In your case, this gets a bit tricky because the stringbuilder needed for your final product is not documented as being thread-safe. It's not that bad, though. Here's an example of calling parallel foreach for a single word list:
var locker = new Object(); //I'd actually make this static, but it should end up as a closure and so still work
var OutputBuffer = new StringBuilder(); // you can improve things futher if you can make a good estimate for the final size and force it allocate all the memory it will need up front
int score = 0;
Parallel.ForEach(words, w =>
{
// We want to push as much of the work to the individual threads as possible.
// If run in 1 thread, a stringbuilder per word would be bad.
// Run in parallel, it allows us to do a little more of the work outside of locked code.
var buf = new StringBuilder(w.Length + 5);
string word = buf.Append(w.Where(c=>c!='U').Concat(' ').ToArray()).Append(w.Length).ToString();
lock(locker)
{
OutputBuffer.Append(word);
score += w.Length;
}
});
OutputBuffer.Append("Total = ").Append(score);
Just call that 20 times in a normal sequentially processed for loop. Again, it might finish the benchmarks a little slower, but I think it will perform real-world a little faster because of a flaw in your benchmark. Also note that I typed this right into the reply window — I've never event tried to compile it, and so it's not likely to be perfect right out of the gate.
After fixing your benchmark to more accurately reflect how parallel code will impact your real-world processing time, the next step is to do some profiling to see where your program is actually spending it's time. This is how you know what areas to look at for improvement.
Out of curiosity, I'd also like to know how this version performs:
var agg = new {score = 0, OutputBuffer = new StringBuilder()};
agg = words.Where(w => w.Length == 3)
.Select(w => new string(w.Where(c => c!='U').ToArray())
.Aggregate(agg, (a, w) => {a.OutputBuffer.AppendFormat("{0} {1}\n", w, w.Length); score += w.Length;});
agg.OutputBuffer.Append("Total = ").Append(score);
The size of a struct should typically be less than that of a pointer (if performance is the primary issue. Microsoft says that anything less than 16 bytes performs better as a struct if reference type semantics aren't needed), else the overhead for passing it around increases (because it is pass by value) and would be more than it would have been for just passing a pointer. Your struct contains a pointer and an int (making it more than a pointer) so you would be experiencing overhead because of this.
See the When to use structs section of this article.
I tried running it through a profiler, but I'm not trusting the results I got. (Run1 takes less time than Run2 in it.) So there aren't any concrete answers there, but my suspicion is that the valid[] array is the culprit:
That's a potentially large memory allocation that Run1 is doing and Run2 isn't. Allocating big chunks of memory can be time-consuming.
It's possible that array is ending up far from any other working data in physical memory. At the very least, it's big enough to end up in the large object heap, whereas it looks like most everything else is going to end up on the stack or small object heap. That might mean that the Score1 function is having to deal with more cache misses than the Score2 function.
It might be a much smaller issue in the serial code, where you've only got that happening once at any given time. When it's happening for a lot of threads simultaneously, though, the problem might compound so that what originally just caused an extra cache miss or two is now causing page faults.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With