Simple C++ Loop Not Benefitting from Multithreading

Question

I have some extremely simple C++ code that I was certain would run 3x faster with multithreading but somehow only runs 3% faster (or less) on both GCC and MSVC on Windows 10.

There are no mutex locks and no shared resources. And I can't see how false sharing or cache thrashing could be at play since each thread only modifies a distinct segment of the array, which has over a billion int values. I realize there are many questions on SO like this but I haven't found any that seem to solve this particular mystery.

One hint might be that moving the array initialization into the loop of the add() function does make the function 3x faster when multithreaded vs single-threaded (~885ms vs ~2650ms).

Note that only the add() function is being timed and takes ~600ms on my machine. My machine has 4 hyperthreaded cores, so I'm running the code with threadCount set to 8 and then to 1.

Any idea what might be going on? Is there any way to turn off (when appropriate) the features in processors that cause things like false sharing (and possibly like what we're seeing here) to happen?

#include <chrono>
#include <iostream>
#include <thread>

void startTimer();
void stopTimer();
void add(int* x, int* y, int threadIdx);

namespace ch = std::chrono;
auto start = ch::steady_clock::now();
const int threadCount = 8;
int itemCount = 1u << 30u; // ~1B items
int itemsPerThread = itemCount / threadCount;

int main() {
    int* x = new int[itemCount];
    int* y = new int[itemCount];

    // Initialize arrays
    for (int i = 0; i < itemCount; i++) {
        x[i] = 1;
        y[i] = 2;
    }

    // Call add() on multiple threads
    std::thread threads[threadCount];
    startTimer();
    for (int i = 0; i < threadCount; ++i) {
        threads[i] = std::thread(add, x, y, i);
    }
    for (auto& thread : threads) {
        thread.join();
    }
    stopTimer();

    // Verify results
    for (int i = 0; i < itemCount; ++i) {
        if (y[i] != 3) {
            std::cout << "Error!";
        }
    }

    delete[] x;
    delete[] y;
}

void add(int* x, int* y, int threadIdx) {
    int firstIdx = threadIdx * itemsPerThread;
    int lastIdx = firstIdx + itemsPerThread - 1;

    for (int i = firstIdx; i <= lastIdx; ++i) {
        y[i] = x[i] + y[i];
    }
}

void startTimer() {
    start = ch::steady_clock::now();
}

void stopTimer() {
    auto end = ch::steady_clock::now();
    auto duration = ch::duration_cast<ch::milliseconds>(end - start).count();
    std::cout << duration << " ms
";
}

Alan Birtles · Accepted Answer

You may be simply hitting the memory transfer rate of your machine, you are doing 8GB of reads and 4GB of writes.

On my machine your test completes in about 500ms which is 24GB/s (which is similar to the results given by a memory bandwidth tester).

As you hit each memory address with a single read and a single write the caches aren't much use as you aren't reusing memory.

Simple C++ Loop Not Benefitting from Multithreading

Tags:

c++

multithreading

Gumby The Green

1 Answers

Alan Birtles

Recent Activity

Donate For Us

Simple C++ Loop Not Benefitting from Multithreading

Tags:

c++

multithreading

Gumby The Green

1 Answers

Alan Birtles

Related questions

Recent Activity

Donate For Us