The problem: I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases. So it has to be a piece of code that "is imune" to compiler optimizations related to branching and also all the HW branch prediction should not go better than 50% (tossing a coin). Even a greater challenge is being able to run the code on multiple CPU architectures and get the same 50% miss ratio. I managed to write a code that goes to 47% branch miss ratio on an x86 platform. I'm suspecting the missing could 3% come from: <ul> <li>Program launch overhead that has branching in it (very small though)</li> <li>Profiler overhead - Basically for each counter read an interrupt is raised so this might add additional predictable branches.</li> <li>System calls running in the background that contain loops and predictable branching</li> </ul> I written my own random number generator to avoid calls to a rand whose implementation might have hidden predictable branches. It can use also rdrand when available. Latency does not matter for me. The questions: <ol> <li>Can I do better than my version of code? Better means getting a higher branch misspredict and same results for all CPU architectures.</li> <li>Can this code be predicated? What would that mean?</li> </ol> The code: <pre class="prettyprint"><code>#include <stdio.h> #include <time.h> #define RDRAND #define LCG_A 1103515245 #define LCG_C 22345 #define LCG_M 2147483648 #define ULL64 unsigned long long ULL64 generated; ULL64 rand_lcg(ULL64 seed) { #ifdef RDRAND ULL64 result = 0; asm volatile ("rdrand %0;" : "=r" (result)); return result; #else return (LCG_A * seed + LCG_C) % LCG_M; #endif } ULL64 rand_rec1() { generated = rand_lcg(generated) % 1024; if (generated < 512) return generated; else return rand_rec1(); } ULL64 rand_rec2() { generated = rand_lcg(generated) % 1024; if (!(generated >= 512)) return generated; else return rand_rec2(); } #define BROP(num, sum) \ num = rand_lcg(generated); \ asm volatile("": : :"memory"); \ if (num % 2) \ sum += rand_rec1(); \ else \ sum -= rand_rec2(); #define BROP5(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) #define BROP25(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) #define BROP100(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum) int main() { int i = 0; int iterations = 500000; ULL64 num = 0; ULL64 sum = 0; generated = rand_lcg(0) % 54321; for (i = 0; i < iterations; i++) { BROP100(num, sum); // ... repeat the line above 10 times } printf("Sum = %llu\n", sum); } </code></pre> Update v1: Following the suggestion of usr, I generated various patterns by varying the LCG_C parameter from the command line in a script. I was able to go to 49.67% BP miss. That is enough for my purpose and I have the methodology to produce this on various architectures.

If you know how the branch predictor works you can get to 100% misprediction. Just take the expected prediction of the predictor each time and do the opposite. The problem is that we don't know how it is implemented. I have read that typical predictors are able to predict patters such as <code>0,1,0,1</code> and so on. But I'm sure there is a limit to how long the pattern can be. My suggestion would be to try each and every pattern of a given length (such as 4) and see which one comes closest to your target percentage. You should be able to target both 50% and 100% and come very close. This profiling needs to be done for each platform once or at runtime. I doubt that 3% of the total number of branches are in system code like you said. The kernel does not take 3% overhead on purely CPU bound user code. Increase the scheduling priority to the maximum. You can take the RNG out of the game by generating random data once and iterating over the same data many times. The branch predictor is unlikely to detect this (although it clearly could). I would implement this by filling a <code>bool[1 << 20]</code> with a zero-one pattern like I described. Then, you can run the following loop over it many times: <pre class="prettyprint"><code>int sum0 = 0, sum1 = 0; for (...) { //unroll this a lot if (array[i]) sum0++; else sum1++; } //print both sums here to make sure the computation is not being optimized out </code></pre> You'll need to examine the disassembly to make sure that the compiler did not do anything clever. I don't see why the complicated setup that you have right now is necessary. The RNG can be taken out of the question and I don't see why more than this simple loop is needed. If the compiler is playing tricks you might need to mark the variables as <code>volatile</code> which makes the compiler (better: most compilers) treat them as if they were external function calls. Since the RNG now no longer matters since it is almost never called you can even invoke the cryptographic RNG of your OS to get numbers that are indistinguishable (to any human) from true random numbers.

Fill an array with bytes, and write a loop that checks each byte and branches depending on the value of the byte. Now examine the architecture of your processor and its branch prediction very carefully. Fill the initial bytes of the array so that after examining them, the processor is in a predictable known state. From that known state, you can find out whether the next branch is predicted taken or not. Set the next byte so that the prediction is wrong. Again, find out whether the next branch is predicted taken or not, and set the next byte so that the prediction is wrong and so on. If you disable interrupts as well (which could change the branch prediction), you can come close to 100% mispredicted branches. As a simple case, on an old PowerPC processor with strong/weak prediction, after three taken branches it will always be in the state "strong taken" and one branch not taken changes it to "weak taken". If you now have a sequence of alternating not taken / taken branches, the prediction is always wrong and switches between weak not taken and weak taken. This will of course only work with that particular processor. Most modern processors would see that sequence as almost 100% predictable. For example, they might use two separate predictors; one for the case "last branch was taken" and one for the case "last branch was not taken". But for such a processor, a different sequence of bytes will give the same 100% misprediction rate.

Is there a code that results in 50% branch prediction miss?

Tags:

The problem:

I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases.

So it has to be a piece of code that "is imune" to compiler optimizations related to branching and also all the HW branch prediction should not go better than 50% (tossing a coin). Even a greater challenge is being able to run the code on multiple CPU architectures and get the same 50% miss ratio.

I managed to write a code that goes to 47% branch miss ratio on an x86 platform. I'm suspecting the missing could 3% come from:

Program launch overhead that has branching in it (very small though)
Profiler overhead - Basically for each counter read an interrupt is raised so this might add additional predictable branches.
System calls running in the background that contain loops and predictable branching

I written my own random number generator to avoid calls to a rand whose implementation might have hidden predictable branches. It can use also rdrand when available. Latency does not matter for me.

The questions:

Can I do better than my version of code? Better means getting a higher branch misspredict and same results for all CPU architectures.
Can this code be predicated? What would that mean?

The code:

#include <stdio.h> #include <time.h>  #define RDRAND #define LCG_A   1103515245 #define LCG_C   22345 #define LCG_M   2147483648 #define ULL64   unsigned long long  ULL64 generated;  ULL64 rand_lcg(ULL64 seed) { #ifdef RDRAND     ULL64 result = 0;     asm volatile ("rdrand %0;" : "=r" (result));     return result; #else     return (LCG_A * seed + LCG_C) % LCG_M; #endif }  ULL64 rand_rec1() {     generated = rand_lcg(generated) % 1024;      if (generated < 512)         return generated;     else return rand_rec1(); }  ULL64 rand_rec2() {     generated = rand_lcg(generated) % 1024;      if (!(generated >= 512))         return generated;     else return rand_rec2(); }  #define BROP(num, sum)                  \     num = rand_lcg(generated);          \     asm volatile("": : :"memory");      \     if (num % 2)                        \         sum += rand_rec1();             \     else                                \         sum -= rand_rec2();  #define BROP5(num, sum)     BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) #define BROP25(num, sum)    BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) #define BROP100(num, sum)   BROP25(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum)  int main() {     int i = 0;     int iterations = 500000;         ULL64 num = 0;     ULL64 sum = 0;      generated = rand_lcg(0) % 54321;      for (i = 0; i < iterations; i++)     {         BROP100(num, sum);         // ... repeat the line above 10 times     }      printf("Sum = %llu\n", sum); }

Update v1:

Following the suggestion of usr, I generated various patterns by varying the LCG_C parameter from the command line in a script. I was able to go to 49.67% BP miss. That is enough for my purpose and I have the methodology to produce this on various architectures.

883

asked Mar 10 '15 10:03

VAndrei

2 Answers

If you know how the branch predictor works you can get to 100% misprediction. Just take the expected prediction of the predictor each time and do the opposite. The problem is that we don't know how it is implemented.

I have read that typical predictors are able to predict patters such as 0,1,0,1 and so on. But I'm sure there is a limit to how long the pattern can be. My suggestion would be to try each and every pattern of a given length (such as 4) and see which one comes closest to your target percentage. You should be able to target both 50% and 100% and come very close. This profiling needs to be done for each platform once or at runtime.

I doubt that 3% of the total number of branches are in system code like you said. The kernel does not take 3% overhead on purely CPU bound user code. Increase the scheduling priority to the maximum.

You can take the RNG out of the game by generating random data once and iterating over the same data many times. The branch predictor is unlikely to detect this (although it clearly could).

I would implement this by filling a bool[1 << 20] with a zero-one pattern like I described. Then, you can run the following loop over it many times:

int sum0 = 0, sum1 = 0; for (...) {  //unroll this a lot  if (array[i]) sum0++;  else sum1++; } //print both sums here to make sure the computation is not being optimized out

You'll need to examine the disassembly to make sure that the compiler did not do anything clever.

I don't see why the complicated setup that you have right now is necessary. The RNG can be taken out of the question and I don't see why more than this simple loop is needed. If the compiler is playing tricks you might need to mark the variables as volatile which makes the compiler (better: most compilers) treat them as if they were external function calls.

Since the RNG now no longer matters since it is almost never called you can even invoke the cryptographic RNG of your OS to get numbers that are indistinguishable (to any human) from true random numbers.

174

answered Oct 12 '22 23:10

usr

Fill an array with bytes, and write a loop that checks each byte and branches depending on the value of the byte.

Now examine the architecture of your processor and its branch prediction very carefully. Fill the initial bytes of the array so that after examining them, the processor is in a predictable known state. From that known state, you can find out whether the next branch is predicted taken or not. Set the next byte so that the prediction is wrong. Again, find out whether the next branch is predicted taken or not, and set the next byte so that the prediction is wrong and so on.

If you disable interrupts as well (which could change the branch prediction), you can come close to 100% mispredicted branches.

As a simple case, on an old PowerPC processor with strong/weak prediction, after three taken branches it will always be in the state "strong taken" and one branch not taken changes it to "weak taken". If you now have a sequence of alternating not taken / taken branches, the prediction is always wrong and switches between weak not taken and weak taken.

This will of course only work with that particular processor. Most modern processors would see that sequence as almost 100% predictable. For example, they might use two separate predictors; one for the case "last branch was taken" and one for the case "last branch was not taken". But for such a processor, a different sequence of bytes will give the same 100% misprediction rate.

answered Oct 12 '22 21:10

gnasher729

Related questions
                            
                                How to programmatically check the applicability rules of a Windows update?
                            
                                Web API Authentication in ASP.NET 5
                            
                                Is it possible to merge coverage data from two executables with gcov/gcovr?
                            
                                In VSCode when exporting functions: "Individual declarations must be all exported or all local"
                            
                                BeautifulSoup like scraper for nodejs [closed]
                            
                                PowerMock: stub methods from parent class
                            
                                Google Cloud Platform Project Limit
                            
                                @EventListener with @Async in Spring
                            
                                Passing asyncio loop by argument or using default asyncio loop
                            
                                How to work with Typescript in .vue files using VS Code?
                            
                                Google Cloud Endpoints with Google Cloud Functions [closed]
                            
                                How do I get Sonarcloud to run on pull requests from forks with Travis, Maven & github

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With