Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# performance varying due to memory

Hope this is a valid post here, its a combination of C# issues and hardware.

I am benchmarking our server because we have found problems with the performance of our quant library (written in C#). I have simulated the same performance issues with some simple C# code- performing very heavy memory-usage.

The code below is in a function which is spawned from a threadpool, up to a maximum of 32 threads (because our server has 4x CPUs x 8 cores each).

This is all on .Net 3.5

The problem is that we are getting wildly differing performance. I run the below function 1000 times. The average time taken for the code to run could be, say, 3.5s, but the fastest will only be 1.2s and the slowest will be 7s- for the exact same function!

I have graphed the memory usage against the timings and there doesnt appear to be any correlation with the GC kicking in.

One thing I did notice is that when running in a single thread the timings are identical and there is no wild deviation. I have also tested CPU-bound algorithms and the timings are identical too. This has made us wonder if the memory bus just cannot cope.

I was wondering could this be another .net or C# problem, or is it something related to our hardware? Would this be the same experience if I had used C++, or Java?? We are using 4x Intel x7550 with 32GB ram. Is there any way around this problem in general?

Stopwatch watch = new Stopwatch();
watch.Start();
List<byte> list1 = new List<byte>();
List<byte> list2 = new List<byte>();
List<byte> list3 = new List<byte>();


int Size1 = 10000000;
int Size2 = 2 * Size1;
int Size3 = Size1;

for (int i = 0; i < Size1; i++)
{
    list1.Add(57);
}

for (int i = 0; i < Size2; i = i + 2)
{
    list2.Add(56);
}

for (int i = 0; i < Size3; i++)
{
    byte temp = list1.ElementAt(i);
    byte temp2 = list2.ElementAt(i);
    list3.Add(temp);
    list2[i] = temp;
    list1[i] = temp2;
}
watch.Stop();

(the code is just meant to stress out the memory)

I would include the threadpool code, but we used a non-standard threadpool library.

EDIT: I have reduced "size1" to 100000, which basically doesn't use much memory and I still get a lot of jitter. This suggests it's not the amount of memory being transferred, but the frequency of memory grabs?

like image 665
mezamorphic Avatar asked Apr 03 '12 16:04

mezamorphic


People also ask

What C is used for?

C programming language is a machine-independent programming language that is mainly used to create many types of applications and operating systems such as Windows, and other complicated programs such as the Oracle database, Git, Python interpreter, and games and is considered a programming foundation in the process of ...

What is C in C language?

What is C? C is a general-purpose programming language created by Dennis Ritchie at the Bell Laboratories in 1972. It is a very popular language, despite being old. C is strongly associated with UNIX, as it was developed to write the UNIX operating system.

Is C language easy?

Compared to other languages—like Java, PHP, or C#—C is a relatively simple language to learn for anyone just starting to learn computer programming because of its limited number of keywords.

What is C full form?

Full form of C is “COMPILE”. One thing which was missing in C language was further added to C++ that is 'the concept of CLASSES'.


1 Answers

There isn't enough to go on, but here are some areas to start looking:

  • The variability is the result of internal GC state. The GC dynamically manages the sizes of the various pools. If you start with different pool sizes, you'll get different GC behavior during runs.
  • Moire patterns in the thread scheduling. Depending on random variations in the sequencing of the threads, you could have more or less favorable patterns of contention. If there's any periodicity, that may lead to an amplified effect akin to constructive interference.
  • False sharing. If you have two threads that both hit memory addresses that are close enough as to be colocated in the processor cache, you'll see a marked decrease in performance as the processors have to spend a lot of time re-synching their caches. Depending on how you organize your data and allocate threads to process it, you may get patterns in false sharing based on variations at the start.
  • Another process in the system is taking up processor time. You might want to use a measure of process user mode time instead of wall-time. (There's an accessor to that in the Process class somewhere).
  • The machine is running close to it's full physical memory limit. Swapping to disk is occurring with a more-or-less random pattern.
like image 100
Kennet Belenky Avatar answered Sep 27 '22 16:09

Kennet Belenky