The code without fission looks like this:
int check(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[hash(keys[i])]
}
return ret;
}
With fission:
int check(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
tmp[i] = map[hash(keys[i])];
}
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += tmp[i];
}
return ret;
}
Notes:
The bottleneck is map[hash(keys[i])]
which accesses memory randomly.
normally, it would be if(tmp[i]) res[ret++] = i;
to avoid the if, I'm using ret += tmp[i]
.
map[..]
is always 0 or 1
The fission version is usually significantly faster and I am trying to explain why. My best guess is that ret += map[..]
still introduces some dependency and that prevents speculative execution.
I would like to hear if anyone has a better explanation.
Loop fission is the opposite of loop fusion: a loop is split into two or more loops. This optimization is appropriate if the number of computations in a loop becomes excessive, leading to register spills that degrade performance. Loop fission can also come into play if a loop contains conditional statements.
Loop fusion is a type of programming technique that combines two or more loops into one, complying with principles of programming efficiency or compiler optimization. Loop fusion is also known as loop jamming.
From my tests, I get roughly 2x speed difference between the fused and split loops. This speed difference is very consistent no matter how I tweak the loop.
Fused: 1.096258 seconds
Split: 0.562272 seconds
(Refer to bottom for the full test code.)
Although I'm not 100% sure, I suspect that this is due to a combination of two things:
map[gethash(keys[i])]
.It's obvious that map[gethash(keys[i])]
will result in a cache miss nearly every time. In fact, it is probably enough to saturate the entire load-store buffer.
Now let's look at the added dependency. The issue is the ret
variable:
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[gethash(keys[i])];
}
return ret;
}
The ret
variable is needed for address resolution of the the store res[ret] = i;
.
ret
is coming from a sure cache miss.ret
is coming tmp[i]
- which is much faster.This delay in address resolution of the fused loop case likely causes res[ret] = i
to store to clog up the load-store buffer along with map[gethash(keys[i])]
.
Since the load-store buffer has a fixed size, but you have double the junk in it:
You are only able to overlap the cache misses half as much as before. Thus 2x slow-down.
Suppose if we changed the fused loop to this:
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[i] = i; // Change "res" to "i"
ret += map[gethash(keys[i])];
}
return ret;
}
This will break the address resolution dependency.
(Note that it's not the same anymore, but it's just to demonstrate the performance difference.)
Then we get similar timings:
Fused: 0.487477 seconds
Split: 0.574585 seconds
Here's the complete test code:
#define SIZE 67108864
unsigned gethash(int key){
return key & (SIZE - 1);
}
int check_fused(int * res, char * map, int n, int * keys){
int ret = 0;
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += map[gethash(keys[i])];
}
return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
int ret = 0;
for(int i = 0; i < n; ++i){
tmp[i] = map[gethash(keys[i])];
}
for(int i = 0; i < n; ++i){
res[ret] = i;
ret += tmp[i];
}
return ret;
}
int main()
{
char *map = (char*)calloc(SIZE,sizeof(char));
int *keys = (int*)calloc(SIZE,sizeof(int));
int *res = (int*)calloc(SIZE,sizeof(int));
int *tmp = (int*)calloc(SIZE,sizeof(int));
if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
printf("Memory allocation failed.\n");
system("pause");
return 1;
}
// Generate Random Data
for (int i = 0; i < SIZE; i++){
keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
}
printf("Start...\n");
double start = omp_get_wtime();
int ret;
ret = check_fused(res,map,SIZE,keys);
// ret = check_split(res,map,SIZE,keys,tmp);
double end = omp_get_wtime();
printf("ret = %d",ret);
printf("\n\nseconds = %f\n",end - start);
system("pause");
}
I don't think it's the array indexing, but the call to the function hash()
that may cause a pipeline stall and prevent optimal instruction reordering.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With