16 cores, yet performance plateaus when computing inner product with = 4 threads. What's happening?

Question

I'd greatly appreciate anyone's insight on this question. My machine has 16 cores (Intel i9-12900HX) and I want to better understand how to directly leverage them.

To do this, I ran a simple experiment that computes the inner product of two 1-D vectors of doubles, where each vector has 10⁹ elements.

First, I compute the inner product using a simple loop (i.e. in series), and then I compute it in parallel using threads, first by using 1 thread, then by using 2 threads,...,up to and including 8 threads.

Here are the results I obtained for one execution instance of the calculations. The numbers are similar from one execution instance to the next (the self-contained code is below):

N Threads ----- Plel DotProd --- Plel Time(ms)  -- Series DotProd - Series Time(ms)
   
    1           166666666.67     714.55      166666666.67     694.82
    2           166666666.67     416.97
    3           166666666.67     387.20
    4           166666666.67     316.38
    5           166666666.67     315.09
    6           166666666.67     293.89
    7           166666666.67     286.43
    8           166666666.67     277.68

As a check, notice that all the dot product results are the same, and the time duration for the series calculation (top row, far right) is in line with the time duration for the "parallel" calculation when using only 1 thread (top row, middle). But for the parallel calculations in general, the speedup seems (possibly) reasonable when going from 1 to 2 threads, not as good as I expected when going from 2 to 3, or from 3 to 4, and there is no material benefit for 4 or more threads.

Are all these threads running on only 3 (maybe 4) cores out of the total of 16? Is it possible to know how many cores are actually being used, or to be able to guarantee how many cores are to be used in a calculation? Does the problem lie elsewhere? Have I not accounted for something? Thanks very much in advance for anyone's thoughts. Here's the code (it takes about 6 or so seconds to run on my Thinkpad):

#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
#include <iomanip>

using std::chrono::high_resolution_clock;
using std::chrono::duration;

//The functor being parallelized
void innerProduct(const int  threadID,
                  const long start,
                  const long end,
                  const std::vector<double>& a,
                  const std::vector<double>& b,
                  std::vector<double>& innerProductResult //ith element holds result from ith thread
                 )
{
    innerProductResult[threadID] = 0.0;
    for (long i = start; i < end; i++)
        innerProductResult[threadID] += a[i] * b[i];
}


int main(void)
{
    //Create and populate
    const int DIMENSION = 1'000'000'000;
    std::vector<double> a(DIMENSION);
    std::vector<double> b(DIMENSION);

    //Use values that won't explode the result or invoke any multiplication optimzations
    for (int i = 0; i < DIMENSION; i++)
    {
        double arg = i / (double)DIMENSION;
        a[i] = arg;
        b[i] = 1.0 - arg;
    }

    //Compute inner product in SERIES
    double seriesInnerProduct = 0.0;
    auto t1 = high_resolution_clock::now();

    for (int i = 0; i < DIMENSION; i++)
        seriesInnerProduct += a[i] * b[i];

    auto t2 = high_resolution_clock::now();
    duration<double, std::milli> millisecs_duration = t2 - t1;   //milliseconds as double
    const double seriesTime = millisecs_duration.count();

    //Compute inner product in PARALLEL
    const int MAX_NUMBER_OF_THREADS = 8;
    std::vector<double> parallelTimings(MAX_NUMBER_OF_THREADS);
    std::vector<double> parallelInnerProducts(MAX_NUMBER_OF_THREADS, 0.0); //initialize elements to zero

    //Compute the inner product using different numbers of threads
    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        //Create batches for subjobs
        const long batch_size = DIMENSION / numThreadsUsed;   //but there may be a remainder, so...
        const long sizeOfLastBatch = batch_size + DIMENSION % numThreadsUsed; //include any remainder

        std::vector<double> innerProductByThread(numThreadsUsed);
        std::vector< std::thread > threads(numThreadsUsed);

        t1 = high_resolution_clock::now();

        //Spread the work over the number of threads being used
        for (int i = 0; i < numThreadsUsed; i++)
        {
            long start = i * batch_size;
            long end = start + (i < numThreadsUsed - 1 ? batch_size : sizeOfLastBatch);

            threads[i] = std::thread(innerProduct, i, start, end, std::ref(a), std::ref(b), std::ref(innerProductByThread));
        }

        // Wait for everyone to finish   
        for (int i = 0; i < numThreadsUsed; i++)
            threads[i].join();

        //Aggregate the (sub) inner product results from the threads used
        for (int i = 0; i < numThreadsUsed; i++)
            parallelInnerProducts[numThreadsUsed - 1] += innerProductByThread[i];

        t2 = high_resolution_clock::now();
        millisecs_duration = t2 - t1;

        parallelTimings[numThreadsUsed - 1] = millisecs_duration.count();
    }

    //Output to screen
    std::cout << "   Num      Plel        Plel     Series     Series " << std::endl;
    std::cout << " Threads  DotProd     Time(ms)   DotProd   Time(ms)" << std::endl << std::endl;

    for (int numThreadsUsed = 1; numThreadsUsed <= MAX_NUMBER_OF_THREADS; numThreadsUsed++)
    {
        std::cout << std::setw(5) << numThreadsUsed
                  << std::fixed << std::setw(14) << std::setprecision(2) << parallelInnerProducts[numThreadsUsed - 1]
                  << std::fixed << std::setw(9) << std::setprecision(2) << parallelTimings[numThreadsUsed - 1];

        if (numThreadsUsed == 1)
            std::cout << std::fixed << std::setw(13) << std::setprecision(2) << seriesInnerProduct
                      << std::setw(8) << std::setprecision(2) << seriesTime;
        
        std::cout << std::endl;

            
    }
  
  
    std::cout << std::endl << "Press Enter to exit...";
    std::cin.get();
    return 1;
}

RickB88 · Accepted Answer

Below you will find results and timings that explain what is happening in the problems mentioned in the prior posts in this thread. I present it in the order I investigated it, and I hope folks find it helpful.

In what follows are timings of a dot product for 2 vectors each of length $10^7$. Both std::thread() and OpenMP were used to run tests. For both libraries, the tasks for each thread were exercised an additional number of times using an outer FOR loop in order to investigate behavior that is dependent on runtime, as follows:

The parallelized functor for std::thread() is modified relative to the prior post so that it now contains the extra FOR loop, and it looks like this (with iterations hardwired here to 1,000 although the number varies across tests):

void innerProduct2(const int  threadID,
                   const long start,
                   const long end,
                   double* a,
                   double* b,
                   std::vector<double>& innerProductResult
                   )
{
    double res = 0.0;
              
    for (int j = 0; j < 1000; j++) //the (new) outer for  
        for (long i = start; i < end; i++) 
            res += a[i] * b[i];
              
    innerProductResult[threadID] = res;
}

Note that the local "res" variable in the above slice is used to avoid thrashing of the cache lines in the L1 cache.

And when using OpenMP, the modified slice looks like this:

omp_set_dynamic(0);
omp_set_num_threads(numThreadsUsed);
                    
double result = 0;
#pragma omp parallel
{
    #pragma omp for reduction(+:result)
              
    for (int j = 0; j < 1000; j++) //the (new) for loop 
        for (int i = 0; i < DIMENSION; i++) 
            result += a[i] * b[i];                  
    
}

In the graph below of Timing Benchmarks, some of the curves use std::thread() denoted by "THD" in the legend, and others use OpenMP, indicated by "OMP" in the legend. The number of times the outer FOR loop was executed is also indicated in the legend: "x10^2 OMP" means OpenMP was used and the dot product (for each thread) was run a total of 100 times before returning its sub-result.

FIGURE 1:

Benchmark parallel timings of dot product

The curves are normalized as follows: The points on any given THD curve are normalized relative to that curve's respective time for a single thread. Each THD curve (and the 1/N theoretical curve) therefore all start at the point (1,1).

To compare OMP (OpenMP) curves to THD curves, each OMP curve is normalized by the single thread value of the corresponding THD curve. So the time measurements on the "x1 OMP" curve are normalized by the single thread time of the "x1 THD" curve, and the "x10 OMP" points are normalized by the single thread time of the "x10 THD" curve, etc. Some curves therefore don't begin exactly at (1,1). This keeps the plateaus between corresponding THD and OMP curves comparable to one another and avoids the bias that would otherwise arise from normalizing with the slightly different respective single thread times they have. (It turns out not to make much of a difference.)

The top curve is OpenMP with no extra iterations (x1 OMP). It shows no scaling and is similar to the behavior I noted in the post above when operating on vectors of length $10^9$. For this case (x1 OMP), overhead is swamping the scaling because the calculations are too short in duration on each thread. It is only after artificially inflating the runtime with the outer FOR loop that the (approximate) 1/N scaling starts to emerge -- these OMP curves appear just above the theoretical 1/N curve. This behavior helps put my prior observation in context and makes better sense. OpenMP ultimately produces a small shelf or plateau above the 1/N curve as N increases.

The THD curves are still a bit annoying though because of their elevated plateau. I also observed this plateau with preliminary tests in my prior post. As with the OMP curves, the shapes of the THD curves do smooth out and stabilize as the number of outer FOR iterations increases, but their plateau remains almost triple that of the OMP curves. What can be said, however, is that the plateau is not arising because of a nonlocal variable in the functor inducing cache line thrashing in the L1 cache.

However, additional insight is gained if the outer FOR loop in the OpenMP code slice is moved so that the slice now looks like this:

omp_set_dynamic(0);
omp_set_num_threads(numThreadsUsed);
                    
double result = 0;

for (int j = 0; j < 1000; j++) //MOVED for loop
{
    #pragma omp parallel
    {
        #pragma omp for reduction(+:result)
              
        //for (int j = 0; j < 1000; j++) //the (new) for loop 
        for (int i = 0; i < DIMENSION; i++) 
            result += a[i] * b[i];                  
     } 
}

The above slice runs 1,000 parallelized dot product calculations in a row, rather than parallelizing 1,000 repeated dot products. Here are the results:

FIGURE 2:

enter image description here

The lowest curve corresponds to the OpenMP calculation prior to moving the FOR loop, and also corresponds to the lower plateau in FIGURE 1. But the OpenMP calculation after the FOR loop is moved now exhibits the same plateau as in the curve generated from std::thread().

This means the extra overhead introduced after moving the FOR loop might be of the same type as that present in the std::thread() approach. Perhaps it is the case that, even though there is no non-local variable causing excessive L1 cache line thrashing in the functor code using std::thread(), there is still a good deal of cache refreshing since the outer FOR loop in the std::thread() functor requires all the vector indices be accessed prior to advancing to the next index in the outer FOR. Could this be enough overhead to explain the std::thread() plateau?

To find out, note that interchanging the order of the FOR loops in the functor would show to what degree that extra refreshing is having an effect because the switch would invoke the artificial repetitions of the exercising FOR prior to the cache needing to update for the next set of vector indices, reducing the number of cache updates. To this end, the following modified functor code is used which shows the interchanged FOR loops:

    void innerProduct2(const int  threadID,
                   const long start,
                   const long end,
                   double* a,
                   double* b,
                   std::vector<double>& innerProductResult
                   )
{
    double res = 0.0;
    
    for (long i = start; i < end; i++) //for is moved here          
       for (int j = 0; j < 1000; j++) //the (new) outer for  
       //for (long i = start; i < end; i++) 
            res += a[i] * b[i];
              
    innerProductResult[threadID] = res;
}

And FIGURE 3 includes the output of this modified functor slice along with the same information in FIGURE 2 for comparison:

FIGURE 3:

enter image description here

That did the trick. The timings for the updated functor appear as the lowest curve in the graph. The plateau is gone.

Conclusions:

The original plateaus I observed (in the prior posts) were being caused by excessively short runtimes on each thread in the case of OpenMP, and in std::thread(). They were not (in this case) being caused by cache line thrashing in L1.
To investigate the plateaus, the execution time on each thread was lengthened in the computations for both libraries with outer FOR loops. OpenMP seemed to handle this well, but std::thread(), despite using a local accumulator variable, did not: The outer FOR loop in the functor repeated the expected cache updates as the vector indices were traversed. These repeated updates were still significant enough to produce a noticeable plateau (although this was not yet apparent until (3) below).
Switching the FOR loops in OpenMP made for less efficient execution and reproduced the plateau generated by std::thread(), which hinted that the remaining cache updates in std::thread() might be a source of the plateau problem. This prompted switching the FOR loops in the std::thread() functor and this removed the plateau.
It's interesting that OpenMP was able to handle the outer FOR loop very well without me having to interchange the FORs as I needed to do in the functor slice. Any insight on this from anyone?

Understanding what is going on with the L1 cache is essential to successful parallelization. Throwing threads at a problem is not the way forward. By the way, I did not refer in this post to all the side streets I investigated regarding CPU cache management that everyone's suggestions pointed me to, so thank you very much! In the end, I hope that some found this investigation useful and not too basic.

16 cores, yet performance plateaus when computing inner product with >= 4 threads. What's happening?

Tags:

c++

multithreading

parallel-processing

cpu-cores

RickB88

1 Answers

RickB88

Recent Activity

Donate For Us

16 cores, yet performance plateaus when computing inner product with >= 4 threads. What's happening?

Tags:

c++

multithreading

parallel-processing

cpu-cores

RickB88

1 Answers

RickB88

Related questions

Recent Activity

Donate For Us