The code without fission looks like this: <pre class="prettyprint"><code>int check(int * res, char * map, int n, int * keys){ int ret = 0; for(int i = 0; i < n; ++i){ res[ret] = i; ret += map[hash(keys[i])] } return ret; } </code></pre> With fission: <pre class="prettyprint"><code>int check(int * res, char * map, int n, int * keys){ int ret = 0; for(int i = 0; i < n; ++i){ tmp[i] = map[hash(keys[i])]; } for(int i = 0; i < n; ++i){ res[ret] = i; ret += tmp[i]; } return ret; } </code></pre> Notes: <ul> <li>The bottleneck is <code>map[hash(keys[i])]</code> which accesses memory randomly.</li> <li>normally, it would be <code>if(tmp[i]) res[ret++] = i;</code> to avoid the if, I'm using <code>ret += tmp[i]</code>.</li> <li><code>map[..]</code> is always 0 or 1</li> </ul> The fission version is usually significantly faster and I am trying to explain why. My best guess is that <code>ret += map[..]</code> still introduces some dependency and that prevents speculative execution. I would like to hear if anyone has a better explanation.

I don't think it's the array indexing, but the call to the function <code>hash()</code> that may cause a pipeline stall and prevent optimal instruction reordering.

Why does loop fission make sense in this case?

Tags:

performance

c

optimization

memory

The code without fission looks like this:

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[hash(keys[i])]
    }
    return ret;
}

With fission:

int check(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[hash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}

Notes:

The bottleneck is map[hash(keys[i])] which accesses memory randomly.
normally, it would be if(tmp[i]) res[ret++] = i; to avoid the if, I'm using ret += tmp[i].
map[..] is always 0 or 1

The fission version is usually significantly faster and I am trying to explain why. My best guess is that ret += map[..] still introduces some dependency and that prevents speculative execution.

I would like to hear if anyone has a better explanation.

235

asked Jun 20 '12 16:06

user16367

2 Answers

From my tests, I get roughly 2x speed difference between the fused and split loops. This speed difference is very consistent no matter how I tweak the loop.

Fused: 1.096258 seconds
Split: 0.562272 seconds

(Refer to bottom for the full test code.)

Although I'm not 100% sure, I suspect that this is due to a combination of two things:

Saturation of the load-store buffer for memory disambigutation due to the cache misses from map[gethash(keys[i])].
An added dependency in the fused loop version.

It's obvious that map[gethash(keys[i])] will result in a cache miss nearly every time. In fact, it is probably enough to saturate the entire load-store buffer.

Now let's look at the added dependency. The issue is the ret variable:

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}

The ret variable is needed for address resolution of the the store res[ret] = i;.

In the fused loop, ret is coming from a sure cache miss.
In the split loop, ret is coming tmp[i] - which is much faster.

This delay in address resolution of the fused loop case likely causes res[ret] = i to store to clog up the load-store buffer along with map[gethash(keys[i])].

Since the load-store buffer has a fixed size, but you have double the junk in it:
You are only able to overlap the cache misses half as much as before. Thus 2x slow-down.

Suppose if we changed the fused loop to this:

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[i] = i;    //  Change "res" to "i"
        ret += map[gethash(keys[i])];
    }
    return ret;
}

This will break the address resolution dependency.

^{(Note that it's not the same anymore, but it's just to demonstrate the performance difference.)}

Then we get similar timings:

Fused: 0.487477 seconds
Split: 0.574585 seconds

Here's the complete test code:

#define SIZE 67108864

unsigned gethash(int key){
    return key & (SIZE - 1);
}

int check_fused(int * res, char * map, int n, int * keys){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += map[gethash(keys[i])];
    }
    return ret;
}
int check_split(int * res, char * map, int n, int * keys, int *tmp){
    int ret = 0;
    for(int i = 0; i < n; ++i){
        tmp[i] = map[gethash(keys[i])];
    }
    for(int i = 0; i < n; ++i){
        res[ret] = i;
        ret += tmp[i];
    }
    return ret;
}


int main()
{
    char *map = (char*)calloc(SIZE,sizeof(char));
    int *keys =  (int*)calloc(SIZE,sizeof(int));
    int *res  =  (int*)calloc(SIZE,sizeof(int));
    int *tmp  =  (int*)calloc(SIZE,sizeof(int));
    if (map == NULL || keys == NULL || res == NULL || tmp == NULL){
        printf("Memory allocation failed.\n");
        system("pause");
        return 1;
    }

    //  Generate Random Data
    for (int i = 0; i < SIZE; i++){
        keys[i] = (rand() & 0xff) | ((rand() & 0xff) << 16);
    }

    printf("Start...\n");

    double start = omp_get_wtime();
    int ret;

    ret = check_fused(res,map,SIZE,keys);
//    ret = check_split(res,map,SIZE,keys,tmp);

    double end = omp_get_wtime();

    printf("ret = %d",ret);
    printf("\n\nseconds = %f\n",end - start);

    system("pause");
}

131

answered Nov 16 '22 03:11

Mysticial

I don't think it's the array indexing, but the call to the function hash() that may cause a pipeline stall and prevent optimal instruction reordering.

answered Nov 16 '22 02:11

Ben Voigt

Related questions
                            
                                Triple fault in home grown kernel
                            
                                fopen does not return
                            
                                Writing to frame buffer
                            
                                How to use kernel libcrc32c (or same functions) in userspace programmes?
                            
                                Why I can't print out environment variables in gdb?
                            
                                How to implement (char-ready?) in C
                            
                                Please explain this hardcore macro that does casting and typechecking
                            
                                pinvoke: How to free a malloc'd string?
                            
                                ftok() collisions
                            
                                Why can extern be applied to definitions?
                            
                                can c/c++ do preemeptive multitasking in a single thread? [closed]
                            
                                Questions about register_chrdev_region() in linux device driver
                            
                                Binary representation in C
                            
                                Is it always required to bind a socket?
                            
                                In-place run length decoding?
                            
                                Is this legal C/C++? `int* p = (int[]) {1,2,3} ;`
                            
                                Can you detect a debugger attached to your process using Div by Zero
                            
                                How would I write a bootloader to load a c kernel?
                            
                                Get the current time, in milliseconds, in C?
                            
                                Am I forced to use pthread_cond_broadcast (over pthread_cond_signal) in order to guarantee that *my* thread is woken up?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With