I've implemented a version of PageRank in a multithreaded version. I'm running it on a 4-core Q6600. When I run it set to create 4 threads, I get:
real 6.968s
user 26.020s
sys 0.050s
When I run with 128 threads I get:
real 0.545s
user 1.330s
sys 0.040s
This makes no sense to me. The basic algorithm is a sum-reduce:
Profiling hasn't helped. I'm not sure what data would be helpful to understand my code - please just ask.
It really has me puzzled.
The advantage of having several cores is that each core can handle a different data thread simultaneously, allowing for a much quicker transfer of data at any given time. A high clock speed means faster processor.
Threads are not as fast as Cores. Hyperthreading or SMT allows tasks to be scheduled more efficiently, meaning they can use parts of the Core that are currently not actively processing a task. In a best-case scenario, Threads provide around 50% more performance compared to a physical core.
A larger number of threads or cores means that more tasks can be completed at the same time, but this does not mean that those tasks will complete faster. The more threads and cores, the better! The more you have, the faster your computer will be. This is because tasks can be split up and processed in parallel.
If there is only one thread then it is not possible to divide the processes into smaller tasks that different processors can perform. Single threaded process can run only on one processor regardless of how many processors are available. Multi-threading on a multiple CPU machine increases parallelism.
Deliberately creating more threads than processors is a standard technique used to make use of "spare cycles" where a thread is blocked waiting for something, whether that's I/O, a mutex, or something else by providing some other useful work for the processor to do.
If your threads are doing I/O then this is a strong contender for the speed-up: as each thread blocks waiting for the I/O, the processor can run the other threads until they too block for I/O, hopefully by which time the data for the first thread is ready, and so forth.
Another possible cause of the speed up is that your threads are experiencing false sharing. If you have two threads writing data to different values on the same cache line (e.g. adjacent elements of an array) then this will block the CPU whilst the cache line is transferred back and forth. By adding more threads you decrease the likelihood that they are operating on adjacent elements, and thus reduce the chance of false sharing. You can easily test this by adding extra padding to your data elements so they are each at least 64 bytes in size (the typical cache line size). If your 4-thread code speeds up, this was the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With