Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When performing a calculation - how many threads should I open?

I am writing a program that performs some long computation, which I can break into as many tasks as I want. For the sake of discussion, let's suppose I am writing an algorithm for finding whether or not a number p is prime by trying to divide it by all numbers between 2 and p-1. This task can obviously be broken down to many threads.

I actually wrote a sample app that does just that. As a parameter, I give the number I want to check for, and the number of threads to use (each thread is given a range of equal size of numbers to try and divide p by - together they cover the entire range).

My machine has 8 cores. I started running the program with a large number that I know is prime (2971215073), and with 1, 2, 3 threads etc. until reaching 8 threads - each time the program ran faster than the previous, which was what I expected. However, when I tried numbers larger than 8, the computation time actually kept getting smaller (even if by a little)!

There's no I/O or anything like that in my threads, just pure cpu computations. I was expecting the run-time to become worse when I passed 8 threads as there would be more context switching and the number of parallel running threads remains at 8. It is hard to say where the peak is as the differences are very little and change from one run to another, however it is clear that i.e. 50 threads somehow runs faster than 8 (by ~300 ms)...

My guess is that since I have so many threads, I get more running time since I have a larger portion in the system's thread pool, so my threads get selected more. However, it doesn't seem to make sense that the more threads I create, the faster the program runs (otherwise why don't everyone create 1000 threads??).

Can anyone offer an explanation, and perhaps a best-practice as to how many threads to create relative to the number of cores on the machine?

Thanks.


My code for whoever's interested (compiled on Windows, VS2012):

#include <Windows.h>
#include <conio.h>
#include <iostream>
#include <thread>
#include <vector>

using namespace std;

typedef struct
{
    unsigned int primeCandidate;
    unsigned int rangeStart;
    unsigned int rangeEnd;
} param_t;


DWORD WINAPI isDivisible(LPVOID p)
{
    param_t* param = reinterpret_cast<param_t*>(p);

    for (unsigned int d = param->rangeStart; d < param->rangeEnd; ++d)
    {
        if (param->primeCandidate % d == 0)
        {
            cout << param->primeCandidate << " is divisible by " << d << endl;
            return 1;
        }
    }

    return 0;
}

bool isPrime(unsigned int primeCandidate, unsigned int numOfCores)
{
    vector<HANDLE> handles(numOfCores);
    vector<param_t> params(numOfCores);
    for (unsigned int i = 0; i < numOfCores; ++i)
    {
        params[i].primeCandidate = primeCandidate;
        params[i].rangeStart = (primeCandidate - 2) * (static_cast<double>(i) / numOfCores) + 2;
        params[i].rangeEnd = (primeCandidate - 2) * (static_cast<double>(i+1) / numOfCores) + 2;
        HANDLE h = CreateThread(nullptr, 0, reinterpret_cast<LPTHREAD_START_ROUTINE>(isDivisible), &params[i], 0, 0);
        if (NULL == h)
        {
            cout << "ERROR creating thread: " << GetLastError() << endl;
            throw exception();
        }
        handles[i] = h;
    }

    DWORD ret = WaitForMultipleObjects(numOfCores, &handles[0], TRUE, INFINITE);
    if (ret >= WAIT_OBJECT_0 && ret <= WAIT_OBJECT_0 + numOfCores - 1)
    {
        for (unsigned int i = 0; i < numOfCores; ++i)
        {
            DWORD exitCode = -1;
            if (0 == GetExitCodeThread(handles[i], &exitCode))
            {
                cout << "Failed to get thread's exit code: " << GetLastError() << endl;
                throw exception();
            }

            if (1 == exitCode)
            {
                return false;
            }
        }

        return true;
    }
    else
    {
        cout << "ERROR waiting on threads: " << ret << endl;
        throw exception();
    }
}

int main()
{
    unsigned int primeCandidate = 1;
    unsigned int numOfCores = 1;

    cout << "Enter prime candidate: ";
    cin >> primeCandidate;
    cout << "Enter # of cores (0 means all): ";
    cin >> numOfCores;
    while (primeCandidate > 0)
    {
        if (0 == numOfCores) numOfCores = thread::hardware_concurrency();

        DWORD start = GetTickCount();
        bool res = isPrime(primeCandidate, numOfCores);
        DWORD end = GetTickCount();
        cout << "Time: " << end-start << endl;
        cout << primeCandidate << " is " << (res ? "" : "not ") << "prime!" << endl;

        cout << "Enter prime candidate: ";
        cin >> primeCandidate;
        cout << "Enter # of cores (0 means all): ";
        cin >> numOfCores;
    }

    return 0;
}
like image 320
E.K. Avatar asked Jun 15 '13 12:06

E.K.


2 Answers

Yes. Here is a small extract of some tests I did on my i7/Vista 64 box, (4 'real' cores + hyperthreading):

8 tests,
400 tasks,
counting to 10000000,
using 8 threads:
Ticks: 2199
Ticks: 2184
Ticks: 2215
Ticks: 2153
Ticks: 2200
Ticks: 2215
Ticks: 2200
Ticks: 2230
Average: 2199 ms

8 tests,
400 tasks,
counting to 10000000,
using 32 threads:
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2138
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2137
Average: 2137 ms

.. showing that, like in your tests, an 'over-subscription' of threads does result in a marginal 2-3% improvement in overall execution time. My tests submitted simple 'count up an integer' CPU-intensive tasks to a threadpool with varying numbers of threads.

My conclusion at the time was that the minor improvement was because the larger number of threads took up a larger %age of the 'base load' on my box - the 1-4% of load from the few of the 1000-odd threads in the nearly-always-idle Firefox, uTorrent, Word, Taskbar etc etc. that happened to run a bit during the tests.

It would appear that, in my test, the 'context switching overhead' from, say, using 64 threads instead of 8 is negligible, and can be ignored.

This only applies when the data used by the tasks is very small. I later repeated a similar batch of tests where the tasks used an 8K array - the size of the L1 cache. In this 'worst case' scenario, using more threads than cores resulted in a very noticeable slowdown until, at 16 threads and above, the performance dropped by 40% as the threads swapped the whole cache in and out. Above about 20 threads, the slowdown did not get any worse since, no matter how many threads ran the tasks, the cache still got swapped out of every core at the same rate.

Note also that I had plenty of RAM and so very few page faults.

like image 61
Martin James Avatar answered Oct 26 '22 23:10

Martin James


You're making an assumption that each thread has an equal amount of work to perform, which may not actually be the case. What you should look at is the exit times of each of your threads. If one or more of them are exiting significantly earlier than the rest it will make sense that adding more threads will speed it up. That is, if one stops early it means that a core will no longer be used, by having extra threads it breaks up the load more fairly.

There are several reasons why each thread may take a different execution time. I don't know the underlying instruction timings on your code, but perhaps they are variable. Also likely is that each thread has a different set of CPU optimizations, like branch prediction. One may just lose its timeslice to the OS, or be momentarily stalled on its tiny amount of memory. Suffice to say there are numerous factors which could make one slower than the other.

Which is the best count is hard to say. In general you'd like to keep the CPUs loaded, so you are generally correct about N threads for N cores. However, be aware of things like hyperthreading, where you don't actually have extra cores -- unless you have a lot of memory use, which you don't, the hyperthreading will just get in the way. On AMD's newer chips they have half as many FPUs, so your integer instructions are fine, but floating point could stall.

If you wish to keep each CPU loaded the only way to really do it is with a job based framework. Break your calculation into smaller units (as you do), but still only have one thread per core. As a thread is done with its current job it should take the next available job. This way it doesn't matter if some jobs are longer/shorter, the freed up CPUs will just move on to the next job.

This of course only makes sense if the calculation is long. If the total time is only a few seconds the overhead of the jobs might cause a slight slowdown. But even starting at 4-5 seconds you should start seeing gains. Also, make sure you turn off CPU frequency scaling when doing small timing tests, otherwise the speed up/down times on each CPU will basically give you random results.

like image 21
edA-qa mort-ora-y Avatar answered Oct 26 '22 23:10

edA-qa mort-ora-y