Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does my code cause instruction-cache misses?

According to cachegrind this checksum calculation routine is one of the greatest contributors to instruction-cache load and instruction-cache misses in the entire application:

#include <stdint.h>

namespace {

uint32_t OnesComplementSum(const uint16_t * b16, int len)  {
    uint32_t sum = 0;

    uint32_t a = 0;
    uint32_t b = 0;
    uint32_t c = 0;
    uint32_t d = 0;

    // helper for the loop unrolling
    auto run8 = [&] {
        a += b16[0];
        b += b16[1];
        c += b16[2];
        d += b16[3];
        b16 += 4;
    };

    for (;;) {
        if (len > 32) {
            run8();
            run8();
            run8();
            run8();
            len -= 32;
            continue;
        }

        if (len > 8) {
            run8();
            len -= 8;
            continue;
        }
        break;
    }

    sum += (a + b) + (c + d);

    auto reduce = [&]() {
        sum = (sum & 0xFFFF) + (sum >> 16);
        if (sum > 0xFFFF) sum -= 0xFFFF;
    };

    reduce();

    while ((len -= 2) >= 0) sum += *b16++;

    if (len == -1) sum += *(const uint8_t *)b16; // add the last byte

    reduce();

    return sum;
}    

} // anonymous namespace     

uint32_t get(const uint16_t* data, int length)
{
    return OnesComplementSum(data, length);
}

See asm output here.

Maybe the it's caused by the loop unrolling, but the generated object code doesn't seem too excessive.

How can I improve the code?

Update

  • Because the checksum function was in an anonymous namespace it was inlined and duplicated by two functions that resided in the same cpp file.
  • The loop unrolling is still beneficial. Removing it slowed down the code.
  • Improving the infinite loop speeds up the code (but for some reason I get opposite results on my mac).

    • Before fixes: here you can see the two checksums and 17210 L1 IR misses
    • After fixes: after fixing the inlining problem and fixing the infinite loop the L1 instruction cache misses dropped to 8324.
    • "InstructionFetch" is higher in the fixed example. I'm not sure how to interpret that. Does it simply mean that's where most activity occurred? Or does it hint at a problem?
like image 733
StackedCrooked Avatar asked Oct 20 '22 14:10

StackedCrooked


1 Answers

replace the main loop with just:

const int quick_len=len/8;
const uint16_t * const the_end=b16+quick_len*4;
len -= quick_len*8;
for (; b16+4 <= the_end; b16+=4)
{
    a += b16[0];
    b += b16[1];
    c += b16[2];
    d += b16[3];
}

There seems no need to manually loop unroll if you use -O3

Also, the test case allowed for too much optimization since the input was static and the results unused, also printing out the result helps verify optimized versions don't break anything

Full test I used:

int main(int argc, char *argv[])
{

    using namespace std::chrono;
    auto start_time = steady_clock::now();
    int ret=OnesComplementSum((const uint8_t*)(s.data()+argc), s.size()-argc, 0);
    auto elapsed_ns = duration_cast<nanoseconds>(steady_clock::now() - start_time).count();

    std::cout << "loop=" << loop << " elapsed_ns=" << elapsed_ns << " = " << ret<< std::endl;

    return ret;
}

Comparison with theis (CLEAN LOOP) and your improved version (UGLY LOOP) and a longer test string:

loop=CLEAN_LOOP  elapsed_ns=8365  =  14031
loop=CLEAN_LOOP  elapsed_ns=5793  =  14031
loop=CLEAN_LOOP  elapsed_ns=5623  =  14031
loop=CLEAN_LOOP  elapsed_ns=5585  =  14031
loop=UGLY_LOOP   elapsed_ns=9365  =  14031
loop=UGLY_LOOP   elapsed_ns=8957  =  14031
loop=UGLY_LOOP   elapsed_ns=8877  =  14031
loop=UGLY_LOOP   elapsed_ns=8873  =  14031

Verification here: http://coliru.stacked-crooked.com/a/52d670039de17943

EDIT:

In fact the whole function can be reduced to:

uint32_t OnesComplementSum(const uint8_t* inData, int len, uint32_t sum)
{
  const uint16_t * b16 = reinterpret_cast<const uint16_t *>(inData);
  const uint16_t * const the_end=b16+len/2;

  for (; b16 < the_end; ++b16)
  {
    sum += *b16;
  }

  sum = (sum & uint16_t(-1)) + (sum >> 16);
  return (sum > uint16_t(-1)) ? sum - uint16_t(-1) : sum;
}

Which does better than the OPs with -O3 but worse with -O2:

http://coliru.stacked-crooked.com/a/bcca1e94c2f394c7

loop=CLEAN_LOOP  elapsed_ns=5825  =  14031
loop=CLEAN_LOOP  elapsed_ns=5717  =  14031
loop=CLEAN_LOOP  elapsed_ns=5681  =  14031
loop=CLEAN_LOOP  elapsed_ns=5646  =  14031
loop=UGLY_LOOP   elapsed_ns=9201  =  14031
loop=UGLY_LOOP   elapsed_ns=8826  =  14031
loop=UGLY_LOOP   elapsed_ns=8859  =  14031
loop=UGLY_LOOP   elapsed_ns=9582  =  14031

So mileage may vary, and unless the exact architecture is known, I'd just go simpler

like image 95
Glenn Teitelbaum Avatar answered Oct 22 '22 21:10

Glenn Teitelbaum