I'm currently parallelizing program using openmp on a 4-core phenom2. However I noticed that my parallelization does not do anything for the performance. Naturally I assumed I missed something (falsesharing, serialization through locks, ...), however I was unable to find anything like that. Furthermore from the CPU Utilization it seemed like the program was executed on only one core. From what I found sched_getcpu()
should give me the Id of the core the thread executing the call is currently scheduled on. So I wrote the following test program:
#include <iostream>
#include <sstream>
#include <omp.h>
#include <utmpx.h>
#include <random>
int main(){
#pragma omp parallel
{
std::default_random_engine rand;
int num = 0;
#pragma omp for
for(size_t i = 0; i < 1000000000; ++i) num += rand();
auto cpu = sched_getcpu();
std::ostringstream os;
os<<"\nThread "<<omp_get_thread_num()<<" on cpu "<<sched_getcpu()<<std::endl;
std::cout<<os.str()<<std::flush;
std::cout<<num;
}
}
On my machine this gives the following output(the random numbers will vary of course):
Thread 2 on cpu 0 num 127392776
Thread 0 on cpu 0 num 1980891664
Thread 3 on cpu 0 num 431821313
Thread 1 on cpu 0 num -1976497224
From this I assume that all threads execute on the same core (the one with id 0). To be more certain I also tried the approach from this answer. The results where the same. Additionally using #pragma omp parallel num_threads(1)
didn't make the execution slower (slightly faster in fact), lending credibility to the theory that all threads use the same cpu, however the fact that the cpu is always displayed as 0
makes me kind of suspicious. Additionally I checked GOMP_CPU_AFFINITY
which was initially not set, so I tried setting it to 0 1 2 3
, which should bind each thread to a different core from what I understand. However that didn't make a difference.
Since develop on a windows system, I use linux in virtualbox for my development. So I though that maybe the virtual system couldn't access all cores. However checking the settings of virtualbox showed that the virtual machine should get all 4 cores and executing my test program 4 times at the same time seems to use all 4 cores judging from the cpu utilization (and the fact that the system was getting very unresponsive).
So for my question is basically what exactly is going on here. More to the point: Is my deduction that all threads use the same core correctly? If it is, what could be the reasons for that behavious?
After some experimentation I found out that the problem was that I was starting my program from inside the eclipse IDE, which seemingly set the affinity to use only one core. I thought I got the same problems when starting from outside of the IDE, but a repeated test showed that the program works just fine, when started from the terminal instead of from inside the ide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With