Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accelerate programme with multiple processors

I found that sometimes it's faster to divide one loop into two or more

for (i=0; i<AMT; i++) {
    a[i] += c[i];
    b[i] += d[i];
}
     ||
     \/
for (i=0; i<AMT; i++) {
    //a[i] += c[i];
    b[i] += d[i];
}
for (i=0; i<AMT; i++) {
    a[i] += c[i];
    //b[i] += d[i];
}

On my desktop, win7, AMD Phenom(tm) x6 1055T, the two-loop version runs faster with around 1/3 time less time.

But if I am dealing with assignment,

for (i=0; i<AMT; i++) {
    b[i] = rand()%100;
    c[i] = rand()%100;
}

dividing the assignment of b and c into two loops is no faster than in one loop.

I think that there are some rules the OS use to determine if certain codes can be run by multiple processors.

I want to ask if my guess is right, and if I'm right, what are such rules or occasions that multiple processors will be automatically (without thread programming) used to speed up my programmes?

like image 546
Robert Bean Avatar asked Apr 02 '13 06:04

Robert Bean


1 Answers

It is possible that your compiler is vectorizing the simpler loops. In the assembler output you would see this as the compiled program using SIMD instructions (like Intel's SSE) to process larger chunks of data than one number a time. Automatic vectorization is a hard problem, and it's plausible that the compiler would not be able to vectorize the loop that updates both a and b at the same time. This could partially explain why breaking the complex loop into two would be faster.

In the "assignment" loops, each invocation to rand() depends on the output of the previous invocations, which means that vectorization is inherently impossible. Breaking the loop into two would not make it benefit from SIMD instructions like in the first case, so you wouldn't see it run any faster. Looking at the assembler code the compiler generates would tell you what optimizations the compiler performed and what instructions it used.

Even if the compiler is vectorizing the loop the program is not using more than one CPU or thread; there is no concurrency. What happens is that the one CPU that there is is capable of running the single thread of execution on multiple data points in parallel. The distinction between parallel and concurrent programming is subtle but important.

Cache locality might also explain why breaking the first loop into two makes it run faster, but not why breaking the "assignment" loop into two doesn't. It is possible that b and c in the "assignment" loop are sufficiently small so that they fit into the cache, which would mean that the loop already has optimal performance and breaking it further brings no benefit. If this were the case, making b and c larger would force the loop to start trashing the cache and breaking the loop into two would have the expected benefit.

like image 185
Joni Avatar answered Sep 27 '22 00:09

Joni