<h3>Original Problem:</h3> So I have written some code to experiment with threads and do some testing. The code should create some numbers and then find the mean of those numbers. I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time. <pre class="prettyprint"><code>void findmean(std::vector<double>*, std::size_t, std::size_t, double*); int main(int argn, char** argv) { // Program entry point std::cout << "Generating data..." << std::endl; // Create a vector containing many variables std::vector<double> data; for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i); // Calculate mean using 1 core double mean = 0; std::cout << "Calculating mean, 1 Thread..." << std::endl; findmean(&data, 0, data.size(), &mean); mean /= (double)data.size(); // Print result std::cout << " Mean=" << mean << std::endl; // Repeat, using two threads std::vector<std::thread> thread; std::vector<double> result; result.push_back(0.0); result.push_back(0.0); std::cout << "Calculating mean, 2 Threads..." << std::endl; // Run threads uint32_t halfsize = data.size() / 2; uint32_t A = 0; uint32_t B, C, D; // Split the data into two blocks if(data.size() % 2 == 0) { B = C = D = halfsize; } else if(data.size() % 2 == 1) { B = C = halfsize; D = hsz + 1; } // Run with two threads thread.push_back(std::thread(findmean, &data, A, B, &(result[0]))); thread.push_back(std::thread(findmean, &data, C, D , &(result[1]))); // Join threads thread[0].join(); thread[1].join(); // Calculate result mean = result[0] + result[1]; mean /= (double)data.size(); // Print result std::cout << " Mean=" << mean << std::endl; // Return return EXIT_SUCCESS; } void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result) { for(uint32_t i = 0; i < length; i ++) { *result += (*datavec).at(start + i); } } </code></pre> I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also. <h3>Register Variable:</h3> Several people have suggested making a local variable for the function 'findmean'. This is what I have done: <pre class="prettyprint"><code>void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result) { register double holding = *result; for(uint32_t i = 0; i < length; i ++) { holding += (*datavec).at(start + i); } *result = holding; } </code></pre> I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast? <h3>Register Variable and O2 Optimization:</h3> I have set the optimization to 'O2' - I will create a table with the results. <h3>Results so far:</h3> Original Code with no optimization or register variable: 1 thread: 4.98 seconds, 2 threads: 29.59 seconds Code with added register variable: 1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds With reg variable and -O2 optimization: 1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower? With Dameon's suggestion, which was to put a large block of memory in between the two result variables: 1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds With TAS 's suggestion of using iterators to access contents of the vector: 1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds Same as above on Core i7 920 (single channel memory 4GB): 1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds Same as above on Core i7 920 (dual channel memory 2x2GB): 1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds

<blockquote> Why are 2 threads 6x slower than 1 thread? </blockquote> You are getting hit by a bad case of false sharing. <blockquote> After getting rid of the false-sharing, why is 2 threads not faster than 1 thread? </blockquote> You are bottlenecked by your memory bandwidth. <hr> False Sharing: The problem here is that each thread is accessing the <code>result</code> variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores. Each thread is running this loop: <pre class="prettyprint"><code>for(uint32_t i = 0; i < length; i ++) { *result += (*datavec).at(start + i); } </code></pre> And you can see that the <code>result</code> variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of <code>result</code>. Normally, the compiler should put <code>*result</code> into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop. Memory Bandwidth: Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth. Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement. In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

Code runs 6 times slower with 2 threads than with 1

Original Problem:

So I have written some code to experiment with threads and do some testing.

The code should create some numbers and then find the mean of those numbers.

I think it is just easier to show you what I have so far. I was expecting with two threads that the code would run about 2 times as fast. Measuring it with a stopwatch I think it runs about 6 times slower! EDIT: Now using the computer and clock() function to tell the time.

void findmean(std::vector<double>*, std::size_t, std::size_t, double*);


int main(int argn, char** argv)
{

    // Program entry point
    std::cout << "Generating data..." << std::endl;

    // Create a vector containing many variables
    std::vector<double> data;
    for(uint32_t i = 1; i <= 1024 * 1024 * 128; i ++) data.push_back(i);

    // Calculate mean using 1 core
    double mean = 0;
    std::cout << "Calculating mean, 1 Thread..." << std::endl;
    findmean(&data, 0, data.size(), &mean);
    mean /= (double)data.size();

    // Print result
    std::cout << "  Mean=" << mean << std::endl;

    // Repeat, using two threads
    std::vector<std::thread> thread;
    std::vector<double> result;
    result.push_back(0.0);
    result.push_back(0.0);
    std::cout << "Calculating mean, 2 Threads..." << std::endl;

    // Run threads
    uint32_t halfsize = data.size() / 2;
    uint32_t A = 0;
    uint32_t B, C, D;
    // Split the data into two blocks
    if(data.size() % 2 == 0)
    {
        B = C = D = halfsize;
    }
    else if(data.size() % 2 == 1)
    {
        B = C = halfsize;
        D = hsz + 1;
    }

    // Run with two threads
    thread.push_back(std::thread(findmean, &data, A, B, &(result[0])));
    thread.push_back(std::thread(findmean, &data, C, D , &(result[1])));

    // Join threads
    thread[0].join();
    thread[1].join();

    // Calculate result
    mean = result[0] + result[1];
    mean /= (double)data.size();

    // Print result
    std::cout << "  Mean=" << mean << std::endl;

    // Return
    return EXIT_SUCCESS;
}


void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
    for(uint32_t i = 0; i < length; i ++) {
        *result += (*datavec).at(start + i);
    }
}

I don't think this code is exactly wonderful, if you could suggest ways of improving it then I would be grateful for that also.

Register Variable:

Several people have suggested making a local variable for the function 'findmean'. This is what I have done:

void findmean(std::vector<double>* datavec, std::size_t start, std::size_t length, double* result)
{
register double holding = *result;
for(uint32_t i = 0; i < length; i ++) {
    holding += (*datavec).at(start + i);
}
*result = holding;
}

I can now report: The code runs with almost the same execution time as with a single thread. That is a big improvement of 6x, but surely there must be a way to make it nearly twice as fast?

Register Variable and O2 Optimization:

I have set the optimization to 'O2' - I will create a table with the results.

Results so far:

Original Code with no optimization or register variable: 1 thread: 4.98 seconds, 2 threads: 29.59 seconds

Code with added register variable: 1 Thread: 4.76 seconds, 2 Threads: 4.76 seconds

With reg variable and -O2 optimization: 1 Thread: 0.43 seconds, 2 Threads: 0.6 seconds 2 Threads is now slower?

With Dameon's suggestion, which was to put a large block of memory in between the two result variables: 1 Thread: 0.42 seconds, 2 Threads: 0.64 seconds

With TAS 's suggestion of using iterators to access contents of the vector: 1 Thread: 0.38 seconds, 2 Threads: 0.56 seconds

Same as above on Core i7 920 (single channel memory 4GB): 1 Thread: 0.31 seconds, 2 Threads: 0.56 seconds

Same as above on Core i7 920 (dual channel memory 2x2GB): 1 Thread: 0.31 seconds, 2 Threads: 0.35 seconds

708

asked Jun 27 '13 16:06

FreelanceConsultant

1 Answers

Why are 2 threads 6x slower than 1 thread?

You are getting hit by a bad case of false sharing.

After getting rid of the false-sharing, why is 2 threads not faster than 1 thread?

You are bottlenecked by your memory bandwidth.

False Sharing:

The problem here is that each thread is accessing the result variable at adjacent memory locations. It's likely that they fall on the same cacheline so each time a thread accesses it, it will bounce the cacheline between the cores.

Each thread is running this loop:

for(uint32_t i = 0; i < length; i ++) {
    *result += (*datavec).at(start + i);
}

And you can see that the result variable is being accessed very often (each iteration). So each iteration, the threads are fighting for the same cacheline that's holding both values of result.

Normally, the compiler should put *result into a register thereby removing the constant access to that memory location. But since you never turned on optimizations, it's very likely the compiler is indeed still accessing the memory location and thus incurring false-sharing penalties at every iteration of the loop.

Memory Bandwidth:

Once you have eliminated the false sharing and got rid of the 6x slowdown, the reason why you're not getting improvement is because you've maxed out your memory bandwidth.

Sure your processor may be 4 cores, but they all share the same memory bandwidth. Your particular task of summing up an array does very little (computational) work for each memory access. A single thread is already enough to max out your memory bandwidth. Therefore going to more threads is not likely to get you much improvement.

In short, no you won't be able to make summing an array significantly faster by throwing more threads at it.

103

answered Oct 12 '22 03:10

Mysticial

Related questions
                            
                                Should ALL global variables be volatile-qualified?
                            
                                write a C/C++ program to find if a machine is 32 bit or 64 bit
                            
                                pass function by value (?) instead of function pointer?
                            
                                Why use an extern "C" around a C++ namespace
                            
                                Sort points by angle from given axis?
                            
                                How can I work around warning C4505 in third party libraries?
                            
                                What is angle brackets for argument values, and what is it used for? [duplicate]
                            
                                Draw many of the same object quickly in OpenGL
                            
                                When is 'this' required?
                            
                                Should you declare enums inside or outside a class? [closed]
                            
                                What happens if I call an objects member function from a different thread?
                            
                                C++, two classes with mutual needs
                            
                                initializer list in Clang
                            
                                How to read a growing text file in C++?
                            
                                Is Boost using legal C++ preprocessing directive syntax?
                            
                                Move constructor suppressed by comma operator
                            
                                How can you print instruction in llvm
                            
                                Is it possible to std::move objects out of functions? (C++11)
                            
                                bind is not a member of std
                            
                                Why was the addition of trailing-return-types necessary in C++11?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Code runs 6 times slower with 2 threads than with 1

Tags:

c++

performance

optimization

multithreading