Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multithreading on Intel much slower than on AMD

I want to make code below parallelized:

for(int c=0; c<n; ++c) {
    Work(someArray, c);
}

I've done it this way:

#include <thread>
#include <vector>

auto iterationsPerCore = n/numCPU;
std::vector<std::future<void>> futures;

for(auto th = 0; th < numCPU; ++th) {
    for(auto n = th * iterationsPerCore; n < (th+1) * iterationsPerCore; ++n) {
        auto ftr = std::async( std::launch::deferred | std::launch::async,
            [n, iterationsPerCore, someArray]()
            {
                for(auto m = n; m < n + iterationsPerCore; ++m)
                    Work(someArray, m);
            }
        );
        futures.push_back(std::move(ftr));
    }

    for(auto& ftr : futures)
        ftr.wait();
}

// rest of iterations: n%iterationsPerCore
for(auto r = numCPU * iterationsPerCore; r < n; ++r)
    Work(someArray, r);

Problem is that it runs only 50% faster on Intel CPU, while on AMD it does 300% faster. I run it on three Intel CPUs (Nehalem 2core+HT, Sandy Bridge 2core+HT, Ivy Brigde 4core+HT). AMD processor is Phenom II x2 with 4 cores unlocked. On 2-core Intel processor it runs 50% faster with 4 threads. On 4-core, it runs 50% faster also on 4 threads. I'm testing with VS2012, Windows 7.

When I try it with 8 threads, it is 8x slower than serial loop on Intel. I suppose it is caused by HT.

What do you think about it? What's the reason of such behavior? Maybe code is not correct?

like image 907
Michal Avatar asked Dec 12 '12 19:12

Michal


2 Answers

I'd suspect false sharing. This is what happens when two variables share the same cache line. Effectively, all operations on them have to be very expensively synchronized even if they are not accessed concurrently, as the cache can only operate in terms of cache lines of a certain size, even if your operations are more fine-grained. I would suspect that the AMD hardware is simply more resilient or has a different hardware design to cope with this.

To test, change the code so that each core only works on chunks which are multiples of 64bytes. This should avoid any cache line sharing, as the Intel CPUs only have a cache line of 64bytes.

like image 191
Puppy Avatar answered Oct 05 '22 23:10

Puppy


I would say you need to change your compiler settings to make all the compiled code minimize the number of branches. The two different CPU styles have different operation look-ahead setups. You need to change the compiler optimization settings to match the target CPU, not the CPU upon which the code is compiled.

like image 24
Zagrev Avatar answered Oct 05 '22 23:10

Zagrev