Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Describing and finding a state-corrupting bug which causes seemingly random crashes

Tags:

c++

debugging

I am currently facing one of the most evil bugs I have ever faced in a large, complex project my team is working on. We are using C++ as programming language, and currently Visual Studio for development, altough the end-product is intended to run cross-platform.

The bug:

There is a bug in our system which triggers crashes at seemingly random points of execution. The crash causes usually are read access violations of adresses which change every time the program is executed. Sometimes we get heap corruption errors too. The call stacks lead us to variating points in our codebase, and rarely to some external libraries (Lua in our case), where the bug clearly doesn't lie.

It seems as if this bug has been developing itself over the last 4 months. That amount of time ago, roughly, some of my team members saw the frontend program crash in manners and locations very similar to what happens now.

Some more details:

Our codebase is roughly 800K lines of pure C++ (comments excluded) big, and was developed over the course of 3 years. The current project weighs roughly 300K. We have used excessive unit testing and other ways to eliminate bugs before they happen such as assertions, smart pointers, and so on before.

The others and I have been trying to find this bug(s) for over 2 weeks now. It is becoming more than a nightmare for me. In such a complex project, even good old printf debugging seems to fail in face of the complexity things now have.

My questions

  • What kind of bug are we facing here? Is there even a name for this? Does this kind of bug occur more or less often in other, large projects?

  • What can we do to find and eliminate it after having spent 2 weeks of fruitless debugging using various utilities, on various platforms and with various build settings?

(My previous question was closed, so I am trying to formulate it better and with more details this time, link: https://stackoverflow.com/questions/7154645/how-is-this-kind-of-bug-called)

like image 782
Thaddeus Avatar asked Aug 22 '11 23:08

Thaddeus


1 Answers

The symptoms you describe are typical of heap corruption (not all heap corruptions are reported as such with an error message!). You will need to audit the lifetime of all objects in your program; make sure you're not freeing things twice, or using them after freeing them, and make sure you're not overflowing any buffers. You may want to take this opportunity to use things such as std::smart_ptr (or boost::smart_ptr) to automate parts of your heap management.

If you're on Linux or Mac OS, try running your program under valgrind - it will detect many heap and stack corruption errors. On Windows, use the application verifier; it can help make the errors cause a crash closer to the point when they really occurred.

If you are using threads, a race condition leading to heap corruption is another possibility. Audit your locking mechanisms as well.

If you can easily reproduce this bug, and have a source control system in place, consider a bisection to determine exactly when it was introduced as well. That is, perform a binary search on your source code history to find the first commit with the bug. Git has a tool to do this automatically - git-bisect - you can import a copy of your repository into git to run this tool if you're not using git already.

Also, see if you can disable parts of your program (prevent the code in question from being called at all) in an attempt to narrow down the problem. Note that this may have false positives - if you disable module X and it stops crashing, it could mean that module X is corrupting the heap, or it could mean that module W corrupted the heap and module X just happens to be good at noticing it.

like image 85
bdonlan Avatar answered Oct 19 '22 20:10

bdonlan