I and my Ph.D. student have encountered a problem in a physics data analysis context that I could use some insight on. We have code that analyzes data from one of the LHC experiments that gives irreproducible results. In particular, the results of calculations obtained from the same binary, run on the same machine can differ between successive executions. We are aware of the many different sources of irreproducibility, but have excluded the usual suspects.
We have tracked the problem down to irreproducibility of (double precision) floating point comparison operations when comparing two numbers that that nominally have the same value. This can happen occasionally as a result of prior steps in the analysis. An example we just found an example that tests whether a number is less than 0.3 (note that we NEVER test for equality between floating values). It turns out that due to the geometry of the detector, it was possible for the calculation to occasionally produce a result which would be exactly 0.3 (or its closest double precision representation).
We are well aware of the pitfalls in comparing floating point numbers and also with the potential for excess precision in the FPU to affect the results of the comparison. The question I would like to have answered is "why are the results irreproducible?" Is it because the FPU register load or other FPU instructions are not clearing the excess bits and thus "leftover" bits from previous calculations are affecting the results? (this seems unlikely) I saw a suggestion on another forum that context switches between processes or threads could also induce a change in floating point comparison results due to the contents of the FPU being stored on the stack, and thus, being truncated. Any comments on these =or other possible explanations would be appreciated.
At a guess, what's happening is that your computations are normally being carried out to a few extra bits of precision inside the FPU, and only rounded at specific points (e.g., when you assign a result to a value).
When there's a context switch, however, the state of the FPU has to be saved and restored -- and there's at least a pretty fair chance that those extra bits are not being saved and restored in the context switch. When it happens, that probably wouldn't cause a major change, but if (for example) you later subtract off a fixed amount from each and multiply what's left, the difference would be multiplied as well.
To be clear: I doubt that "left over" bits would be the culprit. Rather, it would be loss of extra bits causing rounding at slightly different points in the computation.
What platform?
Most FPUs can internally store more accuracy than the ieee double representation - to avoid rounding error in intermediate results. There is often a compiler switch to trade speed/accuracy - see http://msdn.microsoft.com/en-us/library/e7s85ffb(VS.80).aspx
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With