Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cast from float to double produces different results—same code, same compiler, same OS

Edit: See the end of the question for an update on the answer.

I have spent several weeks tracking down a very odd bug in a piece of software I maintain. Long story short, there is an old piece of software that is in distribution, and a new piece of software that needs to match the output of the old. The two rely (in theory) on a common library.[1] However, I cannot duplicate the results being generated by the original version of the library, even though the source for the two versions of the library matches. The actual code in question is very simple. The original version looked like this (the "voodoo" commented isn't mine):[2]

// float rstr[101] declared and initialized elsewhere as a global

void my_function() {
    // I have elided several declarations not used until later in the function
    double tt, p1, p2, t2;
    char *ptr;

    ptr = NULL;
    p2 = 0.0;
    t2 = 0.0; /* voooooodoooooooooo */

    tt = (double) rstr[20];
    p1 = (double) rstr[8];

    // The code goes on and does lots of other things ...
}

The last statement I have included is where different behavior crops up. In the original program, rstr[8] has the value 101325., and after casting it to double[3] and assigning it, p1 has the value 101324.65625. Similarly, tt ends up with the value 373.149999999996. I have confirmed these values with both debug prints and examining the values in the debugger (including checking the hex values). This is not surprising in any sense, it is as expected with floating point values.

In a test wrapper around the same version of the library (as well as in any call to a refactored version of the library), the first assignment (to tt) produces the same results. However, p1 ends up as 101325.0, matching the original value in rstr[8]. This difference, while small, sometimes produces substantial variations in calculations that depend on the value of p1.

My test wrapper was simple, and matched the inclusion pattern of the original exactly, but eliminated all other context:

#include "the_header.h"

float rstr[101];
int main() {
    rstr[8] = 101325.;
    rstr[20] = 373.15;

    my_function();
}

Out of desperation, I have even gone to the trouble of looking at the disassembly generated by VC6.

4550:   tt = (double) rstr[20];
0042973F   fld         dword ptr [rstr+50h (006390a8)]
00429745   fstp        qword ptr [ebp-0Ch]
4551:   p1 = (double) rstr[8];
00429748   fld         dword ptr [rstr+20h (00639078)]
0042974E   fstp        qword ptr [ebp-14h]

The version generated by VC6 for the same library function when called by the test code wrapper (which matches the version generated by VC6 for my refactored version of the library):

60:       tt = (double) rstr[20];
00408BC8   fld         dword ptr [_rstr+50h (0045bc88)]
00408BCE   fstp        qword ptr [ebp-0Ch]
61:       p1 = (double) rstr[8];
00408BD1   fld         dword ptr [_rstr+20h (0045bc58)]
00408BD7   fstp        qword ptr [ebp-14h]

The only difference I can see, besides where in memory the array is stored and how far along through the program this is occuring, is the leading _ on the reference to rstr in the second. In general, VC6 uses a leading underscore for name-mangling with functions, but I cannot find any documentation of it doing name-mangling with array pointers. Nor can I see why these would produce different results in any case, unless that name-mangling is involved with reading the data accessed from the pointers in a different way.

The only other difference I can identify between the two (apart from calling context) is that the original is an MFC-based Win32 application, while the latter is a non-MFC console application. The two are otherwise configured the same way, and they are built with identical compilation flags and against the same C runtime.

Any suggestions would be much appreciated.


Edit: the solution, as several answers very helpfully pointed out, was to examine the binary/hex values and compare them to make sure the things I thought were exactly the same in fact were the same. This proved not to be the case—my strong protestations to the contrary notwithstanding.

Here I get to eat some humble pie and admit that while I thought I had checked those values, I had in fact checked some other, closely related values—a point I discovered only when I went back to look at the data again. As it turned out, the values being set in rstr[8] were very slightly different, and so the conversion to double highlighted the very slight differences, and these differences then propagated throughout the program in just the way I noted.

The discrepancy with the initialization I can explain based on the way the two programs work. Specifically, in one case rstr[8] is specified based on a user input to a GUI (and is in this case also the product of a conversion calculation), whereas in another, it is read in from a file where it has been stored with some loss of precision. Interestingly, in neither case was it actually exactly 101325.0, even the case in which it was read from a file where it had been stored as 1.01325e5.

This will teach me to double check my double checking of these sorts of things. Many thanks to Eric Postpischil and unwind for prompting me to check it again and for the prompt feedback. It was very helpful.


Footnotes

  1. In actuality, the original "library" was a header file with all the implementations done inline. The header was pulled in via #include and the functions referenced via extern statements. I have fixed this in a refactored version of the library that is actually a library, but see the rest of the question.
  2. Note that the variable names aren't mine, and are terrible. Likewise with the use of global variables, which is rampant in this piece of software. I left in the /* voooooodoooooooooo */ comment because it illustrates the… unusual… programming practices of my predecessor. I think that element is present because this was originally translated from Fortran and the developer had used it as a means of dealing with some sort of memory bug. The line has no effect whatsoever on the actual behavior of the code.
  3. I am well aware that there doesn't actually need to be a cast here, but this is how the original library worked, and I cannot modify it.
like image 440
Chris Krycho Avatar asked Jan 12 '23 20:01

Chris Krycho


1 Answers

This:

In the original program, rstr[8] has the value 101325., and after casting it to double[3] and assigning it, p1 has the value 101324.65625

implies that the float value is not, in fact, exactly 101325.0, so when you convert to double you see more of the precision. I would (highly) suspect the method by which you inspect the float value, automatic (implicit and silent) rounding when printing is very common with floats. Inspect the bit pattern and decode it using the known format of the float on your system, to make sure you're not being tricked.

like image 88
unwind Avatar answered Jan 29 '23 09:01

unwind