Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Associativity gives us parallelizability. But what does commutativity give?

Alexander Stepanov notes in one of his brilliant lectures at A9 (highly recommended, by the way) that the associative property gives us parallelizability – an extremely useful and important trait these days that the compilers, CPUs and programmers themselves can leverage:

// expressions in parentheses can be done in parallel
// because matrix multiplication is associative
Matrix X = (A * B) * (C * D);

But what, if anything, does the commutative property give us? Reordering? Out of order execution?

like image 733
Lmn Avatar asked Feb 16 '16 21:02

Lmn


People also ask

What is the difference between commutativity and associativity?

The commutative property deals with the order of certain mathematical operations. For a binary operation, we can express it as a + b = b + a. On the other hand, the associative property deals with the grouping of numbers in an operation.

Does associativity follow from commutativity?

Associativity is not the same as commutativity, which addresses whether the order of two operands affects the result. For example, the order does not matter in the multiplication of real numbers, that is, a × b = b × a , so we say that the multiplication of real numbers is a commutative operation.

Does commutativity implies associativity or vice versa?

A bonus: Often people mistakenly believe that commutativity implies associativity, but that's in fact not true! Consider a∗b=ab+1. This is obviously commutative, but it isn't associative.

What does the associative property tell us?

The associative property is a math rule that says that the way in which factors are grouped in a multiplication problem does not change the product.


1 Answers

Here is a more abstract answer with less emphasis on instruction level parallelism and more on thread level parallelism.

A common objective in parallelism is to do a reduction of information. A simple example is the dot product of two arrays

for(int i=0; i<N; i++) sum += x[i]*[y];

If the operation is associative then we can have each thread calculate a partial sum. Then the finally sum is the sum of each partial sum.

If the operation is commutative the final sum can be done in any order. Otherwise the partial sums have to be summed in order.

One problem is that we can't have multiple threads writing to the final sum at the same time otherwise it creates a race condition. So when one thread writes to the final sum the others have to wait. Therefore, summing in any order can be more efficient because it's often difficult to have each thread finish in order.


Let's choose an example. Let's say there are two threads and therefore two partial sums.

If the operation is commutative we could have this case

thread2 finishes its partial sum
sum += thread2's partial sum
thread2 finishes writing to sum   
thread1 finishes its partial sum
sum += thread1's partial sum

However if the operation does not commute we would have to do

thread2 finishes its partial sum
thread2 waits for thread1 to write to sum
thread1 finishes its partial sum
sum += thread1's partial sum
thread2 waits for thread1 to finish writing to sum    
thread1 finishes writing to sum   
sum += thread2's partial sum

Here is an example of the dot product with OpenMP

#pragma omp parallel for reduction(+: sum)
for(int i=0; i<N; i++) sum += x[i]*[y];

The reduction clause assumes the operation (+ in this case) is commutative. Most people take this for granted.

If the operation is not commutative we would have to do something like this

float sum = 0;
#pragma omp parallel
{
    float sum_partial = 0 
    #pragma omp for schedule(static) nowait
    for(int i=0; i<N; i++) sum_partial += x[i]*[y];
    #pragma omp for schedule(static) ordered
    for(int i=0; i<omp_get_num_threads(); i++) {
        #pragma omp ordered
        sum += sum_partial;
    }
}

The nowait clause tells OpenMP not to wait for each partial sum to finish. The ordered clause tells OpenMP to only write to sum in order of increasing thread number.

This method does the final sum linearly. However, it could be done in log2(omp_get_num_threads()) steps.

For example if we had four threads we could do the reduction in three sequential steps

  1. calculate four partial sums in parallel: s1, s2, s3, s4
  2. calculate in parallel: s5 = s1 + s2 with thread1 and s6 = s3 + s4 with thread2
  3. calculate sum = s5 + s6 with thread1

That's one advantage of using the reduction clause since it's a black box it may do the reduction in log2(omp_get_num_threads()) steps. OpenMP 4.0 allows defining custom reductions. But nevertheless it still assumes the operations are commutative. So it's not good for e.g. chain matrix multiplication. I'm not aware of an easy way with OpenMP to do the reduction in log2(omp_get_num_threads()) steps when the operations don't commute.

like image 50
Z boson Avatar answered Sep 25 '22 03:09

Z boson