In C, I reduced the total number of loop executions by nearly 3 times, but through testing the execution time, I found that there is almost no improvement in doing so. All optimization levels have been tested, and the results are basically the same(including O0, O1, O2 and O3). I guess it’s a problem with the compiler, but I want to know what causes this situation. And what to do to make the results meet expectations.
The code is as follow:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define Len 10000000
// Two variables that count the number of loops
int count1 = 0;
int count2 = 0;
int main(int argc, const char * argv[]) {
    srandom((unsigned)time(NULL));
    
    // An array to increase the index,
    // the range of its elements is 1-256
    int rand_arr[128];
    for (int i = 0; i < 128; ++i)
        rand_arr[i] = random()%256+1;
    
    // A random text, the range of its elements is 0-127
    char *tex = malloc((sizeof *tex) * Len);
    for (int i = 0; i < Len; ++i)
        tex[i] = random()%128;
    
    // The first testing
    clock_t start = clock();
    for (int i = 0; i < Len; i += rand_arr[tex[i]])
        count1++;
    printf("No.1: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
    
    // The second testing (x3)
    start = clock();
    for (int i = 0; i < Len; i += rand_arr[tex[i]]+256)
        count2++;
    printf("No.2: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
    
    printf("count1: %d\n", count1);
    printf("count2: %d\n", count2);
    
    return 0;
}
The print result(average) is as follows:
No.1: 0.002213 s
No.2: 0.002209 s
count1: 72661
count2: 25417
The problem comes from the processor itself and not the compiler. This is a complex problem related to the behaviour of the CPU caches, CPU prefetching units and the random access pattern.
Both codes read the tex array based i value that cannot be easily predicted ahead of time by the processor because of the random increments stored in the rand_arr. Because tex is relatively big, it will likely not be fully stored in L1 cache (nor in an intermediate L2 cache if any) but rather in the last-level cache (LLC) or even in RAM. As a result, tex need to be reloaded in each loops from the LLC cache or the RAM. The latency of the LLC cache and the RAM are relatively big nowadays. This thing is that the second loop is harder to predict and less cache-friendly than the first although the number of iteration is smaller!
On x86 CPU, caches pack values by block of 64 bytes called a cache line. When a value is fetched from the main memory or another cache (typically due to a cache miss), a full cache line is fetched. The following accesses to the same cache line are faster because the CPU do not need to fetch it again (as long as the cache line is not invalidated).
In the first loop, the average increment of i is 128 (because the mean of rand_arr is 128). This means that the average stride between two fetched item from tex is 128. In the worst case, the stride is 256. In the second loop, the average stride is 256+128=384 and in the worst case it is 256+256=512. When the stride is smaller than 64, there is a high probability that is will be already fetched in the first case while this is never the case in the second loop. Moreover, the prefetching units can prefetch cache line when several access are contiguous or close each other. This enable the processor to most items of the tex array ahead of time in the first loop. Meanwhile, in the second loop the prefetcher will likely fail to recognize any cache line fetching access. The prefetching units will probably not prefetch anything (because it is too expensive to do that) and the result is many cache misses with a high latency that cannot be mitigated because the accesses are inherently sequential and unpredictible. If the prefeteching units decide to prefetch all the cache lines, then the second loop should not be faster than the first (because the two loops are both bound by the memory hierarchy).
Note that random and srandom are not standard functions.
Also, be aware that clock is not always precise on all platforms. Actually, it as a precision of 1 ms on my Windows (using GCC and MinGW). This can also be seen on some Linux systems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With