I am wondering if setting a 32 bit variable after checking it will be faster than just setting it? E.g. variable a is of uint32
if( a != 0)
{
a = 0;
}
or
a = 0;
The code will be running in a loop which it will run many times so I want to reduce the time to run the code. Note variable a will be 0 most of the time, so the question can possibly be shortened to if it is faster to check a 32 bit variable or to set it. Thank you in advance!
edit: Thank you all who commented on the question, I have created a for loop and tested both assigning and if-ing for 100 thousand times. It turns out assigning is faster.(54ms for if-ing and 44ms for assigning)
What you describe is called a "silent store" optimization.
PRO: unnecessary stores are avoided.
This can reduce pressure on the store to load forwarding buffers, a component of a modern out-of-order CPU that is quite expensive in hardware, and, as a result, is often undersized, and therefore a performance bottleneck. On Intel x86 CPUs there are performance Event Monitoring counters (EMON) that you can use to investigate whether this is a problem in your program.
Interestingly, it can also reduce the number of loads that your program does. First, SW: if the stores are not eliminated, the compiler may be unable to prove that the do not write to the memory occupied by a different variable, the so-called address and pointer disambiguation problem, so the compiler may generate unnecessary reloads of such possibly but not actually conflicting memory locations. Eliminate the stores, some of these loDs may also be eliminated. Second, HW: most modern CPUs have store to load dependency predictors: fewer stores increase accuracy. If a dependency is predicted, the load may actually not be performed by hardware, and maybe converted into a register to register move. This was the subject of the recent patent lawsuits that the University of Wisconsin asserted against Intel and Apple, with awards exceeding hundreds of millions of dollars.
But the most important Reason to eliminate the unnecessary stores is to avoid unnecessarily dirtying the cache. A dirty cache line eventually has to be written to memory, even if unchanged. Wasting power. In many systems it will eventually be written to flash or SSD, wasting power and consuming the limited write cycles of the device.
These considerations have motivated academic research in silent stores, such as http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.28.8947&rep=rep1&type=pdf. However, a quick google scholar search shows these papers are mainly 2000-2004, and I am aware of no modern CPUs implementing true silent store elimination - actually having hardware read the old value. I suspect, however, that this lack of deployment of silent stores us mainly because CPU design went on pause for more than a decade, as focus changed from desktop PCs to cell phones. Now that cell phones are almost caught up to the sophistication of 2000-era desktop CPUs, it may arise again.
CON: Eliminating the silent store in software takes more instructions. Worse, it takes a branch. If the branch is not very predictable, the resulting branch mispredictions will consume any savings. Some machines have instructions that allow you to eliminate such stores without a branch: eg Intel's LRBNI vector store instructions with a conditional vector mask. I believe AVX has these instructions. If you or your compiler can use such instructions, then the cost is just the load of the old value and a vector compare if the old value is already in a register, then just the compare.
By the way, you can get some benefit without completely eliminating the store, but by redirecting it to a safe address. Instead if
If a[i] != 0 then a[i] := 0
Do
ptr = a+I; if *ptr == 0 then ptr.:= &safe; *ptr:=0
Still doing the store, but not dirtying so many cache lines. I have used this way if faking a conditional store instruction a lot. It is very unlikely that a compiler will do this sort of optimization.
So, unfortunately, the answer is "it depends". If you are on a vector mask machine or a GPU, and the silent stores are very common, like, more than 30%, worth thinking about. If in scalar code, probably need more like 90% silent.
Ideally, measure it yourself. Although it can be hard to make realistic measurements.
I would start with what is probably the best case for thus optimization:
char a[1024*1024*1024]; // zero filled const int cachelinesize = 64; for(char*p=a; p
Every store is eliminated here - mAke sure that the compiler still emits them. Good branch prediction, etc
If this limit case shows no benefit, your realistic code is unlikely to.
Come to think if it, I ran such a benchmark back in the last century. The silent store code was 2x faster, since totally memory bound, and the silent stores generate no dirty cache lines on a write back cache. Recheck thus, and then try on more realistic workload.
But first, measure whether you are memory bottlenecked or not.
By the way: if hardware implementations of silent store elimination become common, then you will never want to do it in software.
But at the moment I am aware of no hardware implementations of silent store elimination in commercially available CPUs.
As ECC becomes more common, silent store elimination becomes almost free - since you have to read the old bytes anyway to recalculate ECC in many cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With