I have this loop written in C++, that compiled with MSVC2010 takes a long time to run. (300ms)
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
float val = 0;
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
val += original[k*w + m] * ds[k - i + hr][m - j + hr];
}
}
heat_map[i*w + j] = val;
}
}
}
It seemed a bit strange to me, so I did some tests then changed a few bits to inline assembly: (specifically, the code that sums "val")
for (int i=0; i<h; i++) {
for (int j=0; j<w; j++) {
if (buf[i*w+j] > 0) {
const int sy = max(0, i - hr);
const int ey = min(h, i + hr + 1);
const int sx = max(0, j - hr);
const int ex = min(w, j + hr + 1);
__asm {
fldz
}
for (int k=sy; k < ey; k++) {
for (int m=sx; m < ex; m++) {
float val = original[k*w + m] * ds[k - i + hr][m - j + hr];
__asm {
fld val
fadd
}
}
}
float val1;
__asm {
fstp val1
}
heat_map[i*w + j] = val1;
}
}
}
Now it runs in half the time, 150ms. It does exactly the same thing, but why is it twice as quick? In both cases it was run in Release mode with optimizations on. Am I doing anything wrong in my original C++ code?
I suggest you try different floating-point calculation models supported by the compiler - precise
, strict
or fast
(see /fp
option) - with your original code before making any conclusions. I suspect that your original code was compiled with some overly restrictive floating-point model (not followed by your assembly in the second version of the code), which is why the original is much slower.
In other words, if the original model was indeed too restrictive, then you were simply comparing apples to oranges. The two versions didn't really do the same thing, even though it might seem so at the first sight.
Note, for example, that in the first version of the code the intermediate sum is accumulated in a float
value. If it was compiled with precise
model, the intermediate results would have to be rounded to the precision of float
type, even if the variable val
was optimized away and the internal FPU register was used instead. In your assembly code you don't bother to round the accumulated result, which is what could have contributed to its better performance.
I'd suggest you compile both versions of the code in /fp:fast
mode and see how their performances compare in that case.
A few things to check out:
You need to check that is actually is the same code. As in, are your inline assembly statements exactly the same as those generated by the compiler? I can see three potential differences (potential because they may be optimised out). The first is the initial setting of val
to zero, the second is the extra variable val1
(unlikely since it will most likely just change the constant subtraction of the stack pointer), the third is that your inline assembly version may not put the interim results back into val
.
You need to make sure your sample space is large. You didn't mention whether you'd done only one run of each version or a hundred runs but, the more runs, the better, so as to remove the effect of "noise" in your statistics.
An even better measurement would be CPU time rather than elapsed time. Elapsed time is subject to environmental changes (like your virus checker or one of your services deciding to do something at the time you're testing). The large sample space will alleviate, but not necessarily solve, this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With