I was playing around with std::thread and something weird popped up:
#include <thread>
int k = 0;
int main() {
    std::thread t1([]() { while (k < 1000000000) { k = k + 1; }});
    std::thread t2([]() { while (k < 1000000000) { k = k + 1; }});
    t1.join();
    t2.join();
    return 0;
}
When compiling the above code with no optimizations using clang++, I got the following benchmarks:
real 0m2.377s  
user 0m4.688s  
sys  0m0.005s
I then changed my code to the following: (Now using only 1 thread)
#include <thread>
int k = 0;
int main() {
    std::thread t1([]() { while (k < 1000000000) { k = k + 1; }});
    t1.join();
    return 0;
}
And these were the new benchmarks:
real 0m2.304s
user 0m2.298s
sys  0m0.003s
Why is the code utilizing 2 threads slower than the code utilizing 1?
Multithreading is always faster than serial. Dispatching a cpu heavy task into multiple threads won't speed up the execution. On the contrary it might degrade overall performance. Imagine it like this: if you have 10 tasks and each takes 10 seconds, serial execution will take 100 seconds in total.
Every thread needs some overhead and system resources, so it also slows down performance. Another problem is the so called "thread explosion" when MORE thread are created than cores are on the system. And some waiting threads for the end of other threads is the worst idea for multi threading.
In many cases, multithreading gives excellent results for I/O bound application, because you can do multiple things in parallel, rather than blocking your entire app waiting for single I/O operation. This is also most common case when using more threads than cpu cores is beneficial.
In fact, multithreading can be slower due to the overhead of creating the threads and context switching between them. The multithreaded program performed worse due to the overhead of creating 100 threads and forcing them all to wait with the mutex .
You have two threads fighting over the same variable, k. So you are spending time where the processors say "Processor 1: Hey, do you know what value k has? Processor 2: Sure, here you go!", ping-ponging back and forth every few updates. Since k isn't atomic, there's also no guarantee that thread2 doesn't write an "old" value of k so that next time thread 1 reads the value, it jumps back 1, 2, 10 or 100 steps, and has to do it over again - in theory that could lead to neither of the loops every finishing, but that would require quite a bit of bad luck. 
This should really be a comment in reply to Mats Petersson's answer, but I wanted to supply code examples.
The problem is the contention of a specific resource, and also a cacheline.
Alternative 1:
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
static const uint64_t ITERATIONS = 10000000000ULL;
int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }
    std::vector<std::thread> threads;
    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([&k]() { // capture k by reference so we all use the same k.
           while (k < ITERATIONS) {
               k++;
           }
       });
    }
    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}
Here the threads contend for a single variable, performing both read and write which forces it to ping-pong causing contention and making the single threaded case the most efficient.
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>
static const uint64_t ITERATIONS = 10000000000ULL;
int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }
    std::vector<std::thread> threads;
    std::atomic<uint64_t> k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([&]() {
           // Imperfect division of labor, we'll fall short in some cases.
           for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
               k++;
           }
       });
    }
    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}
Here we divide the labor deterministically (we fall afoul of cases where numThreads is not a divisor of ITERATIONS but it's close enough for this demonstration). Unfortunately, we are still encountering contention for access to the shared element in memory.
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>
static const uint64_t ITERATIONS = 10000000000ULL;
int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }
    std::vector<std::thread> threads;
    std::vector<uint64_t> ks;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([=, &ks]() {
           auto& k = ks[t];
           // Imperfect division of labor, we'll fall short in some cases.
           for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
               k++;
           }
       });
    }
    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
        k += ks[t];
    }
    return 0;
}
Again this is deterministic about the distribution of the workload, and we spend a small amount of effort at the end to collate the results. However we did nothing to ensure the distribution of counters favors healthy CPU distribution. For that:
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
static const uint64_t ITERATIONS = 10000000000ULL;
#define CACHE_LINE_SIZE 128
int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }
    std::vector<std::thread> threads;
    std::mutex kMutex;
    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([=, &k]() {
           alignas(CACHE_LINE_SIZE) uint64_t myK = 0;
           // Imperfect division of labor, we'll fall short in some cases.
           for (uint64_t i = 0; i < ITERATIONS / numThreads; ++i) {
               myK++;
           }
           kMutex.lock();
           k += myK;
           kMutex.unlock();
       });
    }
    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}
Here we avoid contention between threads down to the cache line level, except for the single case at the end where we use a mutex to control synchronization. For this trivial workload, the mutex is going to have one hell of a relative cost. Alternatively, you could use alignas to provide each thread with its own storage at the outer scope and summarize the results after the joins, eliminating the need for the mutex. I leave that as an exercise to the reader.
Seems to me like the more important question than "why didn't this work?" is "How do I get this to work?" For the task at hand, I think std::async (despite significant shortcomings) is really a better tool than using std::thread directly.
#include <future>
#include <iostream>
int k = 0;
unsigned tasks = std::thread::hardware_concurrency();
unsigned reps = 1000000000 / tasks;
int main() {
    std::vector<std::future<int>> f;
    for (int i=0; i<tasks; i++)
        f.emplace_back(std::async(std::launch::async, 
                                  [](){int j; for (j=0; j<reps; j++); return j;})
                      );
    for (int i=0; i<tasks; i++) {
        f[i].wait();
        k += f[i].get();
    }
    std::cout << k << "\n";
    return 0;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With