Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need thoughts on profiling of multi-threading in C on Linux

My application scenario is like this: I want to evaluate the performance gain one can achieve on a quad-core machine for processing the same amount of data. I have following two configurations:

i) 1-Process: A program without any threading and processes data from 1M .. 1G, while system was assumed to run only single core of its 4-cores.

ii) 4-threads-Process: A program with 4-threads (all threads performing same operation) but processing 25% of the input data.

In my program for creating 4-threads, I used pthread's default options (i.e., without any specific pthread_attr_t). I believe the performance gain of 4-thread configuration comparing to 1-Process configuration should be closer to 400% (or somewhere between 350% and 400%).

I profiled the time spent in creation of threads just like this below:

timer_start(&threadCreationTimer); 
pthread_create( &thread0, NULL, fun0, NULL );
pthread_create( &thread1, NULL, fun1, NULL );
pthread_create( &thread2, NULL, fun2, NULL );
pthread_create( &thread3, NULL, fun3, NULL );
threadCreationTime = timer_stop(&threadCreationTimer);

pthread_join(&thread0, NULL);
pthread_join(&thread1, NULL);
pthread_join(&thread2, NULL);
pthread_join(&thread3, NULL);    

Since increase in the size of the input data may also increase in the memory requirement of each thread, then so loading all data in advance is definitely not a workable option. Therefore, in order to ensure not to increase the memory requirement of each thread, each thread reads data in small chunks, process it and reads next chunk process it and so on. Hence, structure of the code of my functions run by threads is like this:

timer_start(&threadTimer[i]);
while(!dataFinished[i])
{
    threadTime[i] += timer_stop(&threadTimer[i]);
    data_source();
    timer_start(&threadTimer[i]);
    process();
}
threadTime[i] += timer_stop(&threadTimer[i]);

Variable dataFinished[i] is marked true by process when the it received and process all needed data. Process() knows when to do that :-)

In the main function, I am calculating the time taken by 4-threaded configuration as below:

execTime4Thread = max(threadTime[0], threadTime[1], threadTime[2], threadTime[3]) + threadCreationTime.

And performance gain is calculated by simply

gain = execTime1process / execTime4Thread * 100

Issue: On small data size around 1M to 4M, the performance gain is generally good (between 350% to 400%). However, the trend of performance gain is exponentially decreasing with increase in the input size. It keeps decreasing until some data size of upto 50M or so, and then become stable around 200%. Once it reached that point, it remains almost stable for even 1GB of data.

My question is can anybody suggest the main reasoning of this behaviour (i.e., performance drop at the start and but remaining stable later)?

And suggestions how to fix that?

For your information, I also investigated the behaviour of threadCreationTime and threadTime for each thread to see what's happening. For 1M of data the values of these variables are small and but with increase in the data size both these two variables increase exponentially (but threadCreationTime should remain almost same regardless of data size and threadTime should increase at a rate corresponding to data being processing). After keep on increasing until 50M or so threadCreationTime becomes stable and threadTime (just like performance drop becomes stable) and threadCreationTime keep increasing at a constant rate corresponding to increase in data to be processed (which is considered understandable).

Do you think increasing the stack size of each thread, process priority stuff or custom values of other parameters type of scheduler (using pthread_attr_init) can help?

PS: The results are obtained while running the programs under Linux's fail safe mode with root (i.e., minimal OS is running without GUI and networking stuff).

like image 971
user1082170 Avatar asked Dec 08 '11 19:12

user1082170


1 Answers

Since increase in the size of the input data may also increase in the memory requirement of each thread, then so loading all data in advance is definitely not a workable option. Therefore, in order to ensure not to increase the memory requirement of each thread, each thread reads data in small chunks, process it and reads next chunk process it and so on.

Just this, alone, can cause a drastic speed decrease.

If there is sufficient memory, reading one large chunk of input data will always be faster than reading data in small chunks, especially from each thread. Any I/O benefits from chunking (caching effects) disappears when you break it down into pieces. Even allocating one big chunk of memory is much cheaper than allocating small chunks many, many times.

As a sanity check, you can run htop to ensure that at least all your cores are being topped out during the run. If not, your bottleneck could be outside of your multi-threading code.

Within the threading,

  • threading context switches due to many threads can cause sub-optimal speedup
  • as mentioned by others, a cold cache due to not reading memory contiguously can cause slowdowns

But re-reading your OP, I suspect the slowdown has something to do with your data input/memory allocation. Where exactly are you reading your data from? Some kind of socket? Are you sure you need to allocate memory more than once in your thread?

Some algorithm in your worker threads is likely to be suboptimal/expensive.

like image 195
kfmfe04 Avatar answered Nov 15 '22 06:11

kfmfe04