Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Simple C++ Loop Not Benefitting from Multithreading

I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.

There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.

One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).

Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.

Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?

#include <chrono>
#include <iostream>
#include <thread>

void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);

namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;

int main() {
    int* x = new int[itemCount];
    int* y = new int[itemCount];

    // Initialize arrays
    for (int i = 0; i < itemCount; i++) {
        x[i] = 1;
        y[i] = 2;
    }

    // Call add() on multiple threads
    std::thread threads[threadCount];
    startTimer();
    for (int i = 0; i < threadCount; ++i) {
        threads[i] = std::thread(add, x, y, i);
    }
    for (auto& thread : threads) {
        thread.join();
    }
    stopTimer();

    // Verify results
    for (int i = 0; i < itemCount; ++i) {
        if (y[i] != 3) {
            std::cout << "Error!";
        }
    }

    delete[] x;
    delete[] y;
}

void add(int* x, int* y, int threadIdx) {
    int firstIdx = threadIdx * itemsPerThread;
    int lastIdx = firstIdx + itemsPerThread - 1;

    for (int i = firstIdx; i <= lastIdx; ++i) {
        y[i] = x[i] + y[i];
    }
}

void startTimer() {
    start = ch::steady_clock::now();
}

void stopTimer() {
    auto end = ch::steady_clock::now();
    auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
    std::cout << duration << " ms\n";
}
like image 337
Gumby The Green Avatar asked Dec 07 '22 11:12

Gumby The Green


1 Answers

You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.

On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).

As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.

like image 170
Alan Birtles Avatar answered Dec 09 '22 23:12

Alan Birtles