I have a function evaluation which is somewhat slow. I'm trying to speed it up by using threading, since there are three things which can be done in parallel. The single-threaded version is
return dEdx_short(E) + dEdx_long(E) + dEdx_quantum(E);
where evaluation of those functions takes ~250us, ~250us, and ~100us respectively. So I implemented a three-thread solution:
double ret_short, ret_long, ret_quantum; // return values for the terms
auto shortF = [this,&E,&ret_short] () {ret_short = this->dEdx_short(E);};
std::thread t1(shortF);
auto longF = [this,&E,&ret_long] () {ret_long = this->dEdx_long(E);};
std::thread t2(longF);
auto quantumF = [this,&E,&ret_quantum] () {ret_quantum = this->dEdx_quantum(E);};
std::thread t3(quantumF);
t1.join();
t2.join();
t3.join();
return ret_short + ret_long + ret_quantum;
Which I expected to take ~300us, yet it actually takes ~600us - basically the same as the single-threaded version! These are all inherently thread-safe so there are no waits for locks. I checked the thread creation time on my system and it's ~25us. I'm not using all of my cores, so I'm a bit baffled as to why the parallel solution is so slow. Is it something to do with the lambda creation?
I tried to bypass the lambda, e.g.:
std::thread t1(&StopPow_BPS::dEdx_short, this, E, ret_short);
after rewriting the function being called, but that gave me an error attempt to use a deleted function
...
Threading allows for a more defined and precise shape and can create better definition for eyebrows. It is also used as a method of removing unwanted hair on the entire face and upper lip area.
Every thread needs some overhead and system resources, so it also slows down performance. Another problem is the so called "thread explosion" when MORE thread are created than cores are on the system. And some waiting threads for the end of other threads is the worst idea for multi threading.
General rule of thumb for threading an application: 1 thread per CPU Core. On a quad core PC that means 4. As was noted, the XBox 360 however has 3 cores but 2 hardware threads each, so 6 threads in this case.
A single CPU core can have up-to 2 threads per core. For example, if a CPU is dual core (i.e., 2 cores) it will have 4 threads.
Perhaps you are experiencing false sharing. To verify, store the return values in a type that uses an entire cache line (size depends on CPU).
const int cacheLineSize = 64; // bytes
union CacheFriendly
{
double value;
char dummy[cacheLineSize];
} ret_short, ret_long, ret_quantum; // return values for the terms
// ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With