Branch Predictor Entries Invalidation upon program finishes?

Tags:

intel

I am trying to understand when branch predictor entries are invalidated.

Here are the experiments I have done:

Code1:

start_measure_branch_mispred()
while(X times):
 if(something something):
  do_useless()
 endif
endwhile
end_measurement()
store_difference()

So, I am running this code a number of times. I can see that after the first run, the misprediction rates go lower. The branch predictor learns how to predict correctly. But, if I run this experiment again and again (i.e. by writing ./experiment to the terminal), all the first iterations are starting from high misprediction rates. So, at each execution, the branch prediction units for those conditional branches are invalidated. I am using nokaslr and I have disabled ASLR. I also run this experiment on an isolated core. I have run this experiment a couple of times to make sure this is the behavior (i.e. not because of the noise).

My question is: Does CPU invalidate branch prediction units after the program stops its execution? Or what is the cause of this?

The second experiment I have done is:

Code 2:

do:
    start_measure_branch_mispred()
    while(X times):
      if(something something):
        do_useless()
      endif
    endwhile
    end_measurement()
    store_difference()
while(cpu core == 1)

In this experiment, I am running the different processes from two different terminals. The first one is pinned to the core 1 so that it will run on the core 1 and it will do this experiment until I stop it (by killing it). Then, I am running the second process from another terminal and I am pinning the process to different cores. As this process is in a different core, it will only execute the do-while loop 1 time. If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly. If I pin the second process another core which is not the sibling of the first one, then the first iteration of the second process makes higher mispredictions. This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption). So, the second process benefits the trained branch prediction units as they have the same virtual address and map to the same branch prediction unit entry.

As far as I understand, since the CPU is not done with the first process (core 1 process that does the busy loop), the branch prediction entries are still there and the second process can benefit from this. But, in the first one, from run to run, I get higher mispredictions.

EDIT: As the other user asked for the code, here it is. You need to download performance events header code from here

To compile: $(CXX) -std=c++11 -O0 main.cpp -lpthread -o experiment

The code:

#include "linux-perf-events.h"

#include <algorithm>
#include <climits>
#include <cstdint>
#include <cstdio>
#include <cstdlib>
#include <vector>

// some array
int arr8[8] = {1,1,0,0,0,1,0,1};

int pin_thread_to_core(int core_id){            
    int retval;     
    int num_cores = sysconf(_SC_NPROCESSORS_ONLN);      
    if (core_id < 0 || core_id >= num_cores)            
        retval = EINVAL;                                
    cpu_set_t cpuset;                                   
    CPU_ZERO(&cpuset);                                  
    CPU_SET(core_id, &cpuset);                          
    retval = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
    return retval;
}

void measurement(int cpuid, uint64_t howmany, int* branch_misses){

    int retval = pin_thread_to_core(cpuid);
    if(retval){
        printf("Affinity error: %s\n", strerror(errno));
        return;
    }

    std::vector<int> evts;
    evts.push_back(PERF_COUNT_HW_BRANCH_MISSES); // You might have a different performance event!

    LinuxEvents<PERF_TYPE_HARDWARE> unified(evts, cpuid); // You need to change the constructor in the performance counter so that it will count the events in the given cpuid

    uint64_t *buffer = new uint64_t[howmany + 1];
    uint64_t *buffer_org; // for restoring
    buffer_org = buffer;
    uint64_t howmany_org = howmany; // for restoring

    std::vector<unsigned long long> results;
    results.resize(evts.size());

    do{
        for(size_t trial = 0; trial < 10; trial++) {

            unified.start();
            // the while loop will be executed innerloop times
            int res;
            while(howmany){
                res = arr8[howmany & 0x7]; // do the sequence howmany/8 times
                if(res){
                    *buffer++ = res;
                }       
                howmany--;
            }
            unified.end(results);
            // store misses
            branch_misses[trial] = results[0];
            // restore for next iteration
            buffer = buffer_org;
            howmany = howmany_org;
        }
    }while(cpuid == 5); // the core that does busy loop

    // get rid of optimization
    howmany = (howmany + 1) * buffer[3];
    branch_misses[10] = howmany; // last entry is reserved for this dummy operation

    delete[] buffer;

}
void usage(){
    printf("Run with ./experiment X \t where X is the core number\n");
}
int main(int argc, char *argv[]) {
    // as I have 11th core isolated, set affinity to that
    if(argc == 1){
        usage();
        return 1;
    }

    int exp = 16; // howmany

    int results[11];
    int cpuid = atoi(argv[1]); 

    measurement(cpuid, exp, results);

    printf("%d measurements\n", exp);

    printf("Trial\t\t\tBranchMiss\n");
    for (size_t trial = 0; trial < 10; trial++)
    {
        printf("%zu\t\t\t%d\n", trial, results[trial]);
    }
    return 0;
}

If you want to try the first code, just run ./experiment 1 twice. It will have the same execution as the first code.

If you want to try the second code, open two terminals, run ./experiment X in the first one, and run ./experiment Y in the second one, where X and Y are cpuid's.

Note that, you might not have the same performance event counter. Also, note that you might need to change the cpuid in the busyloop.

833

asked Dec 02 '19 16:12

yzb74714

2 Answers

So, I have conducted more experiments to reduce the effect of noise (either from _start until main() functions or from syscalls and interrupts that can happen between two program execution which (syscalls and interrupts) can corrupt the branch predictors.

Here is the pseudo-code of the modified experiment:

int main(int arg){ // arg is the iteration
   pin_thread_to_isolated_core()
   for i=0 to arg:
     measurement()
     std::this_thread::sleep_for(std::chrono::milliseconds(1)); // I put this as it is
   endfor
   printresults() // print after all measurements are completed
}

void measurement(){
   initialization()
   for i=0 to 10:
      start_measurement()
      while(X times) // for the results below, X is 32
        a = arr8[an element] //sequence of 8,
        if(a is odd)
           do_sth()
        endif
      endwhile
      end_measurement()
      store_difference()
   endfor
}

And, these are the results:

For example, I give iteration as 3

Trial           BranchMiss
RUN:1
    0           16
    1           28
    2           3
    3           1
    ....  continues as 1
RUN:2
    0           16   // CPU forgets the sequence
    1           30
    2           2
    3           1
    ....  continues as 1
RUN:3
    0           16
    1           27
    2           4
    3           1
    ....  continues as 1

So, even a millisecond sleep can disturb the branch prediction units. Why is that the case? If I don't put a sleep between those measurements, the CPU can correctly guess, i.e. the Run2 and Run3 will look like below:

RUN:2
    0           1   
    1           1
    ....  continues as 1
RUN:3
    0           1
    1           1
    ....  continues as 1

I believe I diminish the branch executions from _start to the measurement point. Still, the CPU forgets the trained thing.

138

answered Sep 29 '22 22:09

yzb74714

Does CPU invalidate branch prediction units after the program stops its execution?

No, the CPU has no idea if/when a program stops execution.

The branch prediction data only makes sense for one virtual address space, so when you switch to a different virtual address space (or when the kernel switches to a different address space, rips the old virtual address space apart and converts its page tables, etc. back into free RAM, then constructs an entirely new virtual address space when you start the program again) all of the old branch predictor data is no longer valid for the new (completely different and unrelated, even if the contents happen to be the same) virtual address space.

If the second process is pinned to the sibling core of the first one (same physical core), I see that in the first iteration, the second process guess almost correctly.

This is expected results because virtual cores on the same physical core share the same branch prediction units (that is my assumption).

In a perfect world; a glaring security vulnerability (branch predictor state, that can be used to infer information about the data that caused it, being leaked from a victim's process on one logical processor to an attacker's process on a different logical processor in the same core) is not what I'd expect.

The world is somewhat less than perfect. More specifically, in a perfect world branch predictor entries would have "tags" (meta-data) containing which virtual address space and the full virtual address (and which CPU mode) the entry is valid for, and all of this information would be checked by the CPU before using the entry to predict a branch; however that's more expensive and slower than having smaller tags with less information, accidentally using branch predictor entries that are not appropriate, and ending up with "spectre-like" security vulnerabilities.

Note that this is a known vulnerability that the OS you're using failed to mitigate, most likely because you disabled the first line of defense against this kind of vulnerability (ASLR).

answered Sep 29 '22 22:09

Brendan

Related questions
                            
                                Can I get rid of a sign-extend between CTZ and addition to a pointer?
                            
                                Pass values from C program to Assembly language
                            
                                Strange initialization behavior for an array constructor with implied do in gfortran
                            
                                Minimum SIMD vector width data type
                            
                                Achieving Numba's performance with Cython
                            
                                Using carry flag after shift operation
                            
                                Get number of MPI Communicators in use
                            
                                how to load .JSON file to RealSense D435 camera using C++ and intel API
                            
                                Why is this specpoline not working on Kaby lake?
                            
                                Which is generally faster to test for zero in x86 ASM: "TEST EAX, EAX" versus "TEST AL, AL"?
                            
                                Clang generates worse code for 7 comparisons than for 8 comparisons
                            
                                X86 instructions to power off computer in real mode?
                            
                                Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?
                            
                                What unit testing frameworks are available for x86 assembler?
                            
                                Optimizing an arithmetic coder
                            
                                Assembly ADC (Add with carry) to C++
                            
                                How to make GCC generate bswap instruction for big endian store without builtins?
                            
                                .net Framework Error (HRESULT 0x8007000B)
                            
                                What is the 'shadow space' in x64 assembly?
                            
                                What does ordered / unordered comparison mean?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Branch Predictor Entries Invalidation upon program finishes?

Tags:

intel

yzb74714

People also ask

2 Answers

yzb74714

Brendan

Recent Activity

Donate For Us