Multithreaded Program slower than singlethreaded

Question

I have created a very simple test program that calculates c*A of some scalar c and matrix A. You can either run it online here here or paste the following code in your favorite text editor:

#include <iostream>
#include <time.h>
#include <chrono>
#include <thread>

void fill_rand_matrix(double* mat, int n){
    
    for (int i=0;i<n;i++){
        mat[i]=static_cast <double> (rand()) / static_cast <double> (RAND_MAX)*20-10;
    }
}

void test(size_t m, size_t n, double alpha, double* X) {

        for (int j = 0; j < m; ++j) {
            for (int i = 0; i < n; ++i) {
                X[i+ j*n] *= alpha;
            }
        }       
}   

int main()
{
    int m=10000;
    int n=10000;
    double res_scaling=0.5;
    
    double* res=new double[m*n];
    fill_rand_matrix(res,n*m);
    
    auto begin1 = std::chrono::steady_clock::now();
    std::thread t1(test,0.5*m,n,res_scaling,res);
    std::thread t2(test,0.5*m,n,res_scaling,(double*)(res+(m/2)*n));
    t1.join();
    t2.join();
    auto end1= std::chrono::steady_clock::now();
    std::cout << "Time taken multithreaded = " << std::chrono::duration_cast<std::chrono::milliseconds>(end1 - begin1).count() << "[ms]" << std::endl;

    auto begin2 = std::chrono::steady_clock::now();
    test(m,n,res_scaling,res);
    auto end2= std::chrono::steady_clock::now();
    std::cout << "Time taken singlethreaded = " << std::chrono::duration_cast<std::chrono::milliseconds>(end2 - begin2).count() << "[ms]" << std::endl;

    return 0;
}

When I am running this code multiple times, the multithreaded version is either only faster marginally or even slower than the singlethreaded variant. This even happens, if I add more than 2 threads. There appears to be just no benefit in multithreading even though the problem should almost scale perfectly with the number of cores. In addition, the higher I set the matrix size, the more wildly the running times oscillate, sometimes by as much as a factor of twenty.

Do you have any idea what is happening here?

Elliott · Accepted Answer

I'm not knowledgeable enough to give a definitive answer, but as there's no answer so far I'll do my best:

Short answer:

Multiple sequential iterations over an array of size L (where L is bigger than the size of your largest cache) won't be able to take advantage of any of the caches for a new cache-line access of the array (because caches use variants of LRU). Quickly iterating over an array of size L with fast calculations means that accessing the (main) memory is the bottleneck, and will hog the shared memory bus for all running processes/threads. Adding more threads that are also bound by main-memory access will just cause competition between them.

(If very clever, your caches might just ditch each new array data before it makes its way to the L2 cache, realising there's no way you'll be able to use it before it gets pushed out. This wouldn't effect this process, but would leave more cache-space for others.)

More info:

For me, using g++ -std=c++17 -O3 -pthread -S -fverbose-asm

...gives the assembly output with movups instruction twice associated with the line:

X[i+ j*n] *= alpha; // line 17

movups is a x86 parallelisation (SIMD) instruction. SIMD instructions like these typically put 4 doubles onto a giant register to do the calculations very quickly (~10 clock cycles all up, but please correct me if I'm wrong). Multiply this by 4 to get the work done on a cache line: ~40 cycles. If you're not using x86, then you're probably using a CPU with similar parallelisation instructions.

Main memory access is slow (roughly 100-200 clock cycles to get a cache line from main memory [cache line = chunk of 64 bytes ~= 16 doubles]).

Your CPU will try to help you by pre-fetching data, however, because you're always requesting data from main memory at a faster rate than the data bus can deliver, then pre-fetching may only help to reduce your wait time from say ~100 to ~60, if your lucky. Either way, waiting for main memory access is still the bottleneck.

Note that this works in the other direction too, you can fill up your caches with updated array values, but after 8MB or so you'll need to be continuously sending that data back to the main memory. So really our above estimates are optimistic.

Nitpick:

The test function has a small bug:

j < m and i < n are singed/unsigned comparisons.

Multithreaded Program slower than singlethreaded

Tags:

c++

multithreading

c++17

Theodor Johnson

1 Answers

Short answer:

More info:

Nitpick:

Elliott

Recent Activity

Donate For Us

Multithreaded Program slower than singlethreaded

Tags:

c++

multithreading

c++17

Theodor Johnson

1 Answers

Short answer:

More info:

Nitpick:

Elliott

Related questions

Recent Activity

Donate For Us