Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to determine the number of threads to fire off in a machine with n cores? (C++)

I have a vector<int> with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc, that operates on an integer. Assume that the runtime for ThrFunc for each integer in the vector<int> is roughly the same.

How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?

Editing to provide extra information

  • No need for blocking; each function invocation needs only read-only access
like image 447
Shredderroy Avatar asked Jan 17 '12 02:01

Shredderroy


People also ask

How to determine number of threads to use?

The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.

What is the ideal number of threads?

The size of the thread pool and the host operating system impact performance and processor utilization. In general, Endeca recommends using one thread per processor or core for good performance in most cases.

How do you get thread count in C++?

hpp> unsigned int nthreads = boost::thread::hardware_concurrency(); In either case, hardware_concurrency() returns the number of threads that the hardware is capable of executing concurrently based on the number of CPU cores and hyper-threading units.


5 Answers

The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.

In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).

If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.

The best way to find out the correct answer is, with all hardware questions, to test and find out.

like image 191
Borealid Avatar answered Sep 29 '22 08:09

Borealid


Borealid's answer includes test and find out, which is impossible to beat as advice goes.

But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)

If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.

So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:

+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...

If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.

+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+

I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.

like image 33
sarnold Avatar answered Sep 29 '22 10:09

sarnold


Assuming ThrFunc is CPU-bound then you want probably one thread per core, and divide the elements between them.

If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.

like image 38
Andrew Cooper Avatar answered Sep 29 '22 10:09

Andrew Cooper


I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.

A good way to numerically determine the number of threads to start would be to use

std::thread::hardware_concurrency()

This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.

There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.

like image 44
jupp0r Avatar answered Sep 29 '22 08:09

jupp0r


The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.

like image 24
ciphor Avatar answered Sep 29 '22 09:09

ciphor