Shortly about my problem:
I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.
I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.
How do I make one-program version as fast as two-programs?
More details:
I have a big number of tasks and want to fully load all 32 cores of the system.
So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for
loop distribute tasks between 32 cores.
I use pthread_setaffinity_np
to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.
I use mlockall(MCL_FUTURE)
to insure that system would not make my memory jump between sockets.
So the code looks like this:
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
TaskManager manager;
for (int j = 0; j < fNTasksPerThr; j++){
manager.SetData( &(InpData->fInput[j]) );
manager.Run();
}
}
}
Only the computing time is important to me therefore I prepare input data in separate parallel_for
loop. And do not include preparation time in time measurements.
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
InpData[i].fInput = new ProgramInputData[fNTasksPerThr];
for(int j=0; j<fNTasksPerThr; j++){
InpData[i].fInput[j] = InpDataPerThread.fInput[j];
}
}
}
Now I run all these on 32 cores and see speed of ~1600 tasks per second.
Then I create two version of program, and with taskset
and pthread
insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply &
command in shell:
program1 & program2 &
Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.
What do I miss?
I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?
On a multiprocessor system, multiple threads can concurrently run on multiple CPUs. Therefore, multithreaded programs can run much faster than on a uniprocessor system. They can also be faster than a program using multiple processes, because threads require fewer resources and generate less overhead.
So when processing a task in a thread is trivial, the cost of creating a thread will create more overhead than distributing the task. This is one case where a single thread will be faster than multithreading.
Advantages of Multithreaded Processes All the threads of a process share its resources such as memory, data, files etc. A single application can have different threads within the same address space using resource sharing. It is more economical to use threads as they share the process resources.
I would guess that it's STL/boost memory allocation that's spreading memory for your collections, etc across numa nodes due to the fact that they're not numa aware and you have threads in the program running on each node.
Custom allocators for all of the STL/boost things that you use might help (but is likely a huge job).
You might be suffering a bad case of false sharing of cache: http://en.wikipedia.org/wiki/False_sharing
Your threads probably share access to the same data structure through the block_range reference. If speed is all you need, you might want to pass a copy to each thread. If your data is too huge to fit onto the call-stack you could dynamically allocate a copy of each range in different cache segments (i.e. just make sure they are far enough appart).
Or maybe I need to see the rest of the code to understand what you are doing better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With