Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decrease the time spent on one instruction?

I am trying to optimize a code in C, and it seems that one instruction is taking about 22% of the time.

The code is compiled with gcc 8.2.0. Flags are -O3 -DNDEBUG -g, and -Wall -Wextra -Weffc++ -pthread -lrt.

    509529.517218      task-clock (msec)         #    0.999 CPUs utilized
            6,234      context-switches          #    0.012 K/sec
               10      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,985,640,853,831      cycles                    #    3.897 GHz                      (30.76%)
1,897,574,410,921      instructions              #    0.96  insn per cycle           (38.46%)
  229,365,727,020      branches                  #  450.152 M/sec                    (38.46%)
   13,027,677,754      branch-misses             #    5.68% of all branches          (38.46%)
  604,340,619,317      L1-dcache-loads           # 1186.076 M/sec                    (38.46%)
   47,749,307,910      L1-dcache-load-misses     #    7.90% of all L1-dcache hits    (38.47%)
   19,724,956,845      LLC-loads                 #   38.712 M/sec                    (30.78%)
    3,349,412,068      LLC-load-misses           #   16.98% of all LL-cache hits     (30.77%)
  <not supported>      L1-icache-loads                                          
      129,878,634      L1-icache-load-misses                                         (30.77%)
  604,482,046,140      dTLB-loads                # 1186.353 M/sec                    (30.77%)
    4,596,384,416      dTLB-load-misses          #    0.76% of all dTLB cache hits   (30.77%)
        2,493,696      iTLB-loads                #    0.005 M/sec                    (30.77%)
       21,356,368      iTLB-load-misses          #  856.41% of all iTLB cache hits   (30.76%)
  <not supported>      L1-dcache-prefetches                                     
  <not supported>      L1-dcache-prefetch-misses                                

    509.843595752 seconds time elapsed

    507.706093000 seconds user
      1.839848000 seconds sys

VTune Amplifier gives me a hint to a function: https://pasteboard.co/IagrLaF.png

The instruction cmpq seems to take 22% of the whole time. On the other hand, the other instructions take negligible time.

perf gives me a somewhat different picture, yet I think that results are consistent:

 Percent│       bool mapFound = false;
   0.00 │       movb   $0x0,0x7(%rsp)
        │     goDownBwt():
        │       bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
   0.00 │       lea    0x20(%rsp),%r12
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.00 │       lea    (%rax,%rax,2),%rax
   0.00 │       shl    $0x3,%rax
   0.00 │       mov    %rax,0x18(%rsp)
   0.01 │       movzwl %dx,%eax
   0.00 │       mov    %eax,(%rsp)
   0.00 │     ↓ jmp    d6
        │       nop
        │       if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
   0.30 │ 88:   mov    (%rax),%rsi
   8.38 │       mov    0x10(%rsi),%rcx
   0.62 │       test   %rcx,%rcx
   0.15 │     ↓ je     1b0
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   2.05 │       add    0x18(%rsp),%rcx
        │         ++stats->nDownPreprocessed;
   0.25 │       addq   $0x1,0x18(%rdx)
        │         newState->trace                = PREPROCESSED;
   0.98 │       movb   $0x10,0x30(%rsp)
        │         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  43.36 │       mov    0x8(%rcx),%rax
   2.61 │       cmp    %rax,(%rcx)
        │         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.05 │       mov    %rcx,0x20(%rsp)
        │         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
   3.47 │       setbe  %dl

The function is

inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
  ++stats->nDown;
  if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
    ++stats->nDownPreprocessed;
    newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
    newState->trace                = PREPROCESSED;
    return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  }
  bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
  newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
  newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
  newState->trace = 0;
  return (newState->interval.k <= newState->interval.l);
}

state_t is defined as

struct state_t {
  union {
    bwtinterval_t interval;
    preprocessedInterval_t *preprocessedInterval;
  };
  unsigned char trace;
  struct state_t *previousState;
};

preprocessedInterval_t is:

struct preprocessedInterval_t {
  bwtinterval_t interval;
  preprocessedInterval_t *firstChild;
};

There are few (~1000) state_t structures. However, there are many (350k) preprocessedInterval_t objects, allocated somewhere else.

The first if is true 15 billion times over 19 billions.

Finding mispredicted branches with perf record -e branches,branch-misses mytool on the function gives me:

Available samples
2M branches                                                                                                                                                                                                       
1M branch-misses  

Can I assume that branch misprediction is responsible for this slow down? What would be the next step to optimize my code?

The code is available on GitHub


Edit 1

valgrind --tool=cachegrind gives me:

I   refs:      1,893,716,274,393
I1  misses:            4,702,494
LLi misses:              137,142
I1  miss rate:              0.00%
LLi miss rate:              0.00%

D   refs:        756,774,557,235  (602,597,601,611 rd   + 154,176,955,624 wr)
D1  misses:       39,489,866,187  ( 33,583,272,379 rd   +   5,906,593,808 wr)
LLd misses:        3,483,920,786  (  3,379,118,877 rd   +     104,801,909 wr)
D1  miss rate:               5.2% (            5.6%     +             3.8%  )
LLd miss rate:               0.5% (            0.6%     +             0.1%  )

LL refs:          39,494,568,681  ( 33,587,974,873 rd   +   5,906,593,808 wr)
LL misses:         3,484,057,928  (  3,379,256,019 rd   +     104,801,909 wr)
LL miss rate:                0.1% (            0.1%     +             0.1%  )

Edit 2

I compiled with -O3 -DNDEBUG -march=native -fprofile-use, and used the command perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out

    508322.348021      task-clock (msec)         #    0.998 CPUs utilized
           21,592      context-switches          #    0.042 K/sec
               33      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,978,382,746,597      cycles                    #    3.892 GHz                      (44.44%)
  228,898,532,311      branches                  #  450.302 M/sec                    (44.45%)
   12,816,920,039      branch-misses             #    5.60% of all branches          (44.45%)
1,867,947,557,739      instructions              #    0.94  insn per cycle           (55.56%)
2,957,085,686,275      uops_issued.any           # 5817.343 M/sec                    (55.56%)
2,864,257,274,102      uops_executed.thread      # 5634.726 M/sec                    (55.56%)
    2,490,571,629      mem_load_uops_retired.l3_miss #    4.900 M/sec                    (55.55%)
   12,482,683,638      mem_load_uops_retired.l2_miss #   24.557 M/sec                    (55.55%)
   18,634,558,602      mem_load_uops_retired.l1_miss #   36.659 M/sec                    (44.44%)

    509.210162391 seconds time elapsed

    506.213075000 seconds user
      2.147749000 seconds sys

Edit 3

I selected the results of perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out that mentioned my function:

Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead  Command     Shared Object       Symbol
  49.76%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead  Command     Shared Object       Symbol
  49.36%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead  Command     Shared Object       Symbol
  43.01%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead  Command     Shared Object       Symbol
  50.32%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead  Command     Shared Object       Symbol
  34.85%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead  Command     Shared Object       Symbol
  42.19%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead  Command     Shared Object       Symbol
  48.14%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead  Command     Shared Object      Symbol
  33.24%  srnaMapper  srnaMapper         [.] addState
  31.00%  srnaMapper  srnaMapper         [.] mapWithoutError

Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead  Command     Shared Object       Symbol
  40.79%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead  Command     Shared Object       Symbol
  39.11%  srnaMapper  srnaMapper          [.] mapWithoutError

(Using perf record --period 10000 triggers Workload failed: No such file or directory.)

like image 993
unamourdeswann Avatar asked Apr 15 '19 15:04

unamourdeswann


1 Answers

Was the sample-rate the same for branches and branch-misses? A 50% mispredict rate would be extremely bad.

https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate explains that the kernel dynamically adjusts the period for each counter so events fire often enough to get enough samples even for rare events, But you can set the period (how many raw counts trigger a sample) I think that's what perf record --period 10000 does, but I haven't used that.

Use perf stat to get hard numbers. Update: yup, your perf stat results confirm your branch mispredict rate is "only" 5%, not 50%, at least for the program as a whole. That's still higher than you'd like (branches are usually frequent and mispredicts are expensive) but not insane.


Also for cache miss rate for L1d and maybe mem_load_retired.l3_miss (and/or l2_miss and l1_miss) to see if it's really that load that's missing. e.g.

perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss  ./a.out

You can use any of these events with perf record to get some statistical samples on which instructions are causing cache misses. Those are precise events (using PEBS), so should accurately map to the correct instruction (not like "cycles" where counts get attributed to some nearby instruction, often the one that stalls waiting for an input with the ROB full, instead of the one that was slow to produce it.)

And without any skew for non-PEBS events that should "blame" a single instruction but don't always interrupt at exactly the right place.


If you're optimizing for your local machine and don't need it to run anywhere else, you might use -O3 -march=native. Not that that will help with cache misses.

GCC profile-guided optimization can help it choose branchy vs. branchless. (gcc -O3 -march=native -fprofile-generate / run it with some realistic input data to generate profile outputs / gcc -O3 -march=native -fprofile-use)


Can I assume that branch misprediction is responsible for this slow down?

No, cache misses might be more likely. You have a significant number of L3 misses, and going all the way to DRAM costs hundreds of core clock cycles. Branch prediction can hide some of that if it predicts correctly.

What would be the next step to optimize my code?

Compact your data structures if possible so more of them fit in cache, e.g. 32-bit pointers (Linux x32 ABI: gcc -mx32) if you don't need more than 4GiB of virtual address space. Or maybe try using a 32-bit unsigned index into a large array instead of raw pointers, but that has slightly worse load-use latency (by a couple cycles on Sandybridge-family.)

And / or improve your access pattern, so you're mostly accessing them in sequential order. So hardware prefetch can bring them into cache before you need to read them.

I'm not familiar enough with the https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform or its application in sequence alignment to know if it's possible to make it more efficient, but data compression is inherently problematic because you very often need data-dependent branching and accessing scattered data. It's often worth the tradeoff vs. even more cache misses, though.

like image 196
Peter Cordes Avatar answered Oct 20 '22 10:10

Peter Cordes