Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Threading nested for loops

I've been looking for a while for a similar question but without any success. I don't know how to optimize some code in cocoa to use all available cores of CPU (I don't want to use GPU at the moment). Below is simple sample of code with case I mean:

int limA = 1000;
int limB = 1000;
unsigned short tmp;
for (int i = 0; i < 10000; i++) {
  for (int a = 0; a < limA; a++) {
    for (int b = 0; b < limB; b++) {
      tmp = [[array objectAtIndex:(a*b)] unsignedShortValue];
      c_array[a*limB+b] += tmp;
    }
  }
}

assume that array and c_array is properly initialized etc... But as you can see, if we have many iterations (in this case: 10^10) it takes some time to execute this code. I thought that maybe It is simple possibility to execute this code in few threads, but how to synchronize c_array? What is the best way to improve time execution of this kind of code in objective-c? Maybe it could be done this way that iterations 0-2499 of most external for loop would be executed in thread 1 and 2500-4999 thread 2 etc... ? I know that this is silly way but I don't need "real time" performance... any ideas?

like image 762
zioolek Avatar asked Apr 10 '26 13:04

zioolek


2 Answers

A few suggestions:

Do an initial pass over the array to extract all the shorts from their object wrappers:

short *tmp_array = calloc(limA * limB, sizeof(short));
int tmp_idx = 0;
for (NSNumber *num in array) {
    tmp_array[tmp_idx++] = [num unsignedShortValue];
}

This has several benefits. You go from 10^10 method calls to 10^6, your inner loop stops being opaque to the compiler (it can't "see through" method calls), and your inner loop gets smaller and more likely to fit in the instruction cache.

Try to linearize access patterns. Right now you're doing 'strided' access, since the index is being multiplied each time. If you can rearrange the data in tmp_array so that elements that are processed sequentially are also sequential in the array, you should get much better performance (since each access to the array is loading a full cache line, which is 64 bytes on most processors).

Getting a benefit out of parallelism is likely to be tricky. You could try replacing the outer loop with:

dispatch_apply(10000, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t i) {

});

and the += in the inner loop with OSAtomicAdd, but my suspicion is that your speed is going to be dominated by memory accesses anyway, and adding more processors to the mix will just lead to them stepping on each other's toes (i.e. processor 0 loads c_array[1500] so that it knows what to add tmp to, which actually loads the cache line covering [1500-1531], then processor 1 writes to c_array[1512], invalidating that entire cache line and forcing it to be re-read). Also, I'm pretty sure you'd need to store 32 bit values in c_array to do that, since you'd be using OSAtomicAdd32 (there's no OSAtomicAdd16).

At the very least, if you're going to parallelize, then you need to figure out how to divide the work into non-overlapping chunks of 32 elements of c_array (i.e. 64 bytes), so that you can avoid contention. Dividing up the ranges of the array should also let you avoid needing to use atomic add operations.

(edit)

Check out an0's answer for some practical suggestions for parallelizing this, rather than this discussion of why the naive parallelization won't work :)

like image 54
Catfish_Man Avatar answered Apr 13 '26 03:04

Catfish_Man


First, follow @Catfish_Man's suggestion, except for the parallelism part.

For the parallelism, here are my ideas:

  1. The outmost loop is meaningless. Just use 10000 * tmp instead of tmp.
  2. Since the segments of target array to be written to are strictly disjoint for different a values, the second level of loop can be easily parallelized. In fact, it also applies to b. But if we also parallelize over b the calculation unit left in the body will be too small for the splitting of the work load to be useful.

Code:

int limA = 1000;
int limB = 1000;
short *tmp_array = calloc(limA * limB, sizeof(short));
int tmp_idx = 0;
for (NSNumber *num in array) {
    tmp_array[tmp_idx++] = [num unsignedShortValue];
}
dispatch_apply(limA, dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^(size_t a) {
    for (int b = 0; b < limB; b++) {
        tmp = ;
        c_array[a*limB+b] += 1000 * tmp_array[a*b];
    }
});
free(tmp_array);
like image 30
an0 Avatar answered Apr 13 '26 02:04

an0