Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to hunt a Heisenbug

Recently, we received a bug report from one of our users: something on the screen was displayed incorrectly in our software. Somehow, we could not reproduce this in our development environment (Delphi 2007).

After some further study, it appears that this bug only manifests itself when "Code optimization" is turned on.

Are there any people here with experience in hunting down such a Heisenbug? Any specific constructs or coding bugs that commonly cause such an issue in Delphi software? Any places you would start looking?

I'll also just start debugging the whole thing in the usual way, but any tips specific to Optimization-related bugs (*) would be more than welcome!

(*) Note: I don't mean to say that the bug is caused by the optimizer; I think it's much more likely some wonky construct in the code is somehow pushed "over the edge" by the optimizer.

Update

It seems the bug boils down to a record being fully initialized with zeros when there's no code optimization, and the same record containing some random data when there is optimization. In this case, the random data seems to cause an enum type to contain invalid data (to my great surprise!).

Solution

The solution turned out to involve an unitialized local record variable somewhere deep in the code. Apparently, without optimization the record was reset (heap?), and with optimization turned on, the record was filled with the usual garbage. Thanks to you all for your contributions --- I learned a lot along the way!

like image 581
onnodb Avatar asked Oct 15 '09 18:10

onnodb


4 Answers

Typically bugs of this form are caused by invalid memory access (reading uninitialised data, reading off the end of a buffer...) or thread race conditions.

The former will be affected by optimisations causing data layout to be rearranged in memory, and/or possibly by debug code that initialises newly allocated memory to some value; causing the incorrect code to "accidentally work".

The latter will be affected due to timings changing between optimisation levels. The former is generally much more likely.

If you have some automated way of making freshly allocated memory be filled with some constant value before it is passed to the program, and this makes the crash go away or become reproducible in the debug build, that'll provide a good point to start chasing things.

like image 160
moonshadow Avatar answered Sep 20 '22 14:09

moonshadow


Could very well be a memory vs register issue: you programm running fine relying on memory persistence after a free.
I would recommend running your application with FastMM4 in full debug mode to be sure of your memory management.
Another (not free) tool which can be very useful in a case like this is Eurekalog.

Another thing that I've seen: a crash with the FPU registers being botched when calling some outside code (DLL, COM...) while with the debugger everything was OK.

like image 35
Francesca Avatar answered Sep 20 '22 14:09

Francesca


A record that contains different data according to different compiler settings tells me one thing: That the record is not being explicitly initialised.

You may find that the setting of the compiler optimization flag is only one factor that might affect the content of that record - with any uninitialised data structures the one thing that you can rely on is that you can't rely on the initial content of the structure.

In simple terms:

  • class member data is initialised (to zero's) for new instances of the class

  • local variables (in functions and procedures) and unit variables are NOT initialised except in a few specific cases: interface references, dynamic arrays and strings and I think (but would need to check) records if they contain one or more fields of those types that would be initialised (strings, interface references etc).

The question as stated is now a little misleading because it seems you found your "Heisenberg" fairly easily enough. Now the issue is how to deal with it, and the answer is simply to explicitly initialise your record so that you aren't reliant on whatever behaviour or side-effect of the compiler is sometimes taking care of that for you and sometimes not.

like image 38
Deltics Avatar answered Sep 19 '22 14:09

Deltics


Especially in purely native languages, like Delphi, you should be more than careful not to abuse the freedom to be able to cast anything to anything. IOW: One thing, I have seen is that someone copies the definition of a class (e.g. from the implementation section in RTL or VCL) into his own code and then cast instances of the original class to his copy. Now, after upgrading the library where the original class came from, you might experience all kinds of weird stuff. Like jumping into the wrong methods or bufferoverflows.

There's also the habit of using signed integer as pointers and vice-versa. (Instead of cardinal) this works perfectly fine as long as your process has only 2GB of address space. But boot with the /3GB switch and you will see a lot of apps that start acting crazy. Those made the assumption of "pointer=signed integer" at least somewhere. Your customer uses a 64Bit Windows? Chances are, he might have a larger address space for 32Bit apps. Pretty tough to debug w/o having such a test system available.

Then, there's race conditions. Like having 2 threads, where one is very, very slow. So that you instinctively assume it will always be the last one and so there's no code that handles the scenario where "Captn slow" finishes first. Changes in the underlying technologies can make these assumptions very wrong, very fast indeed. Take a look at the upcoming breed of Flash-based super-mega-fast server storage. Systems that can read and write Gigabytes per second. Applications that assume the IO stuff to be significantly slower than some calculations on in-memory values will easily fail on this kind of fast storage.

I could go on and on, but I gotta run right now... Cheers

like image 30
Robert Giesecke Avatar answered Sep 20 '22 14:09

Robert Giesecke