<p>The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?</p> <p>I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?</p> <hr> <p>Edit: I am more interested in what was done about an <em>unfindable</em> bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?</p>

<h3>Language</h3> <p>Different programming languages will have their own flavour of bugs.</p> <h3>C</h3> <p>Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).</p> <h3>Java</h3> <p>An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.</p> <h3>JavaScript</h3> <p>Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.</p> <h3>Environment</h3> <p>Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:</p> <ul> <li>on a server?</li> <li>on a desktop?</li> <li>in a web browser?</li> </ul> <p>In what environment does the application produce the problem?</p> <ul> <li>development?</li> <li>test?</li> <li>production?</li> </ul> <p>Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.</p> <h3>Networking</h3> <p>As networking is essential to so many applications:</p> <ul> <li>Ensure stable network connections, including wireless signals.</li> <li>Does the software reconnect after network failures robustly?</li> <li>Do all connections get closed properly so as to release file descriptors?</li> <li>Are people using the machine who shouldn't be?</li> <li>Are rogue devices interacting with the machine's network?</li> <li>Are there factories or radio towers nearby that can cause interference?</li> <li>Do packet sizes and frequency fall within nominal ranges?</li> <li>Are packets being monitored for loss?</li> <li>Are all network devices adequate for heavy bandwidth usage?</li> </ul> <h3>Consistency</h3> <p>Eliminate as many unknowns as possible:</p> <ul> <li>Isolate architectural components.</li> <li>Remove non-essential, or possibly problematic (conflicting), elements.</li> <li>Deactivate different application modules.</li> </ul> <p>Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.</p> <h3>Logging</h3> <p>Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.</p> <h3>Hardware</h3> <p>If the software seems okay, consider hardware faults:</p> <ul> <li>Are the physical network connections solid?</li> <li>Are there any loose cables?</li> <li>Are chips seated properly?</li> <li>Do all cables have clean connections?</li> <li>Is the working environment clean and free of dust?</li> <li>Have any hidden devices or cables been damaged by rodents or insects?</li> <li>Are there bad blocks on drives?</li> <li>Are the CPU fans working?</li> <li>Can the motherboard power all components? (CPU, network card, video card, drives, etc.)</li> <li>Could electromagnetic interference be the culprit?</li> </ul> <p>And mostly for embedded:</p> <ul> <li>Insufficient supply bypassing?</li> <li>Board contamination?</li> <li>Bad solder joints / bad reflow?</li> <li>CPU not reset when supply voltages are out of tolerance?</li> <li>Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?</li> <li>Latch-up?</li> <li>Floating input pins?</li> <li>Insufficient (sometimes negative) noise margins on logic levels?</li> <li>Insufficient (sometimes negative) timing margins?</li> <li> Tin whiskers?</li> <li>ESD damage?</li> <li>ESD upsets?</li> <li>Chip errata?</li> <li>Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?</li> <li>Race conditions?</li> <li>Counterfeit components?</li> </ul> <h3>Network vs. Local</h3> <p>What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?</p> <h3>Firmware</h3> <p>In between hardware and software is firmware.</p> <ul> <li>Is the computer BIOS up-to-date?</li> <li>Is the BIOS battery working?</li> <li>Are the BIOS clock and system clock synchronized?</li> </ul> <h3>Time and Statistics</h3> <p>Timing issues are difficult to track:</p> <ul> <li>When does the problem happen?</li> <li>How frequently?</li> <li>What other systems are running at that time?</li> <li>Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?</li> </ul> <p>Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.</p> <h3>Change Management</h3> <p>Sometimes problems appear after a system upgrade.</p> <ul> <li>When did the problem first start?</li> <li>What changed in the environment (hardware and software)?</li> <li>What happens after rolling back to a previous version?</li> <li>What differences exist between the problematic version and good version?</li> </ul> <h3>Library Management</h3> <p>Different operating systems have different ways of distributing conflicting libraries:</p> <ul> <li>Windows has <em>DLL Hell</em>.</li> <li>Unix can have numerous broken symbolic links.</li> <li>Java library files can be equally nightmarish to resolve.</li> </ul> <p>Perform a fresh install of the operating system, and include only the supporting software required for your application.</p> <h3>Java</h3> <p>Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.</p> <p>Use a library management tool such as Maven or Ivy.</p> <h3>Debugging</h3> <p>Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.</p> <h3>Sleep</h3> <p>It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.</p> <h3>Code Review</h3> <p>Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.</p> <h3>Cosmic Radiation</h3> <p>Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.</p> <h3>Tools</h3> <p>Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?</p>

<p>If it's a GUI app, it's <strong>invaluable</strong> to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).</p> <p>Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.</p> <p>Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).</p> <p>If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.</p>

How do you fix a bug you can't replicate?

Tags:

debugging

replicate

The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?

I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?

Edit: I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?

939

asked Aug 12 '09 19:08

cdeszaq

2 Answers

Language

Different programming languages will have their own flavour of bugs.

C

Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).

Java

An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.

JavaScript

Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.

Environment

Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:

on a server?
on a desktop?
in a web browser?

In what environment does the application produce the problem?

development?
test?
production?

Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.

Networking

As networking is essential to so many applications:

Ensure stable network connections, including wireless signals.
Does the software reconnect after network failures robustly?
Do all connections get closed properly so as to release file descriptors?
Are people using the machine who shouldn't be?
Are rogue devices interacting with the machine's network?
Are there factories or radio towers nearby that can cause interference?
Do packet sizes and frequency fall within nominal ranges?
Are packets being monitored for loss?
Are all network devices adequate for heavy bandwidth usage?

Consistency

Eliminate as many unknowns as possible:

Isolate architectural components.
Remove non-essential, or possibly problematic (conflicting), elements.
Deactivate different application modules.

Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.

Logging

Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.

Hardware

If the software seems okay, consider hardware faults:

Are the physical network connections solid?
Are there any loose cables?
Are chips seated properly?
Do all cables have clean connections?
Is the working environment clean and free of dust?
Have any hidden devices or cables been damaged by rodents or insects?
Are there bad blocks on drives?
Are the CPU fans working?
Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
Could electromagnetic interference be the culprit?

And mostly for embedded:

Insufficient supply bypassing?
Board contamination?
Bad solder joints / bad reflow?
CPU not reset when supply voltages are out of tolerance?
Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
Latch-up?
Floating input pins?
Insufficient (sometimes negative) noise margins on logic levels?
Insufficient (sometimes negative) timing margins?
Tin whiskers?
ESD damage?
ESD upsets?
Chip errata?
Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
Race conditions?
Counterfeit components?

Network vs. Local

What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?

Firmware

In between hardware and software is firmware.

Is the computer BIOS up-to-date?
Is the BIOS battery working?
Are the BIOS clock and system clock synchronized?

Time and Statistics

Timing issues are difficult to track:

When does the problem happen?
How frequently?
What other systems are running at that time?
Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?

Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.

Change Management

Sometimes problems appear after a system upgrade.

When did the problem first start?
What changed in the environment (hardware and software)?
What happens after rolling back to a previous version?
What differences exist between the problematic version and good version?

Library Management

Different operating systems have different ways of distributing conflicting libraries:

Windows has DLL Hell.
Unix can have numerous broken symbolic links.
Java library files can be equally nightmarish to resolve.

Perform a fresh install of the operating system, and include only the supporting software required for your application.

Java

Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.

Use a library management tool such as Maven or Ivy.

Debugging

Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.

Sleep

It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.

Code Review

Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.

Cosmic Radiation

Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.

Tools

Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?

104

answered Sep 20 '22 06:09

26 revs, 2 users 91%

If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).

Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.

Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).

If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.

answered Sep 18 '22 06:09

Brian Agnew

Related questions
                            
                                Java remote debugging, how does it work technically?
                            
                                How to debug Google Chrome background script? [duplicate]
                            
                                View activity stack in Android [closed]
                            
                                How to debug with obfuscated (with ProGuard) applications on Android?
                            
                                How to use trace and dbg in Erlang to debug and trace my program?
                            
                                How can I log each request/response using Alamofire?
                            
                                Why does not the Application_Start() event fire when I debug my ASP.NET MVC app?
                            
                                Watchpoint a fixed address
                            
                                How do I set specific environment variables when debugging in Visual Studio?
                            
                                System.Diagnostics.Debug.WriteLine in production code
                            
                                How to read and understand the java stack trace? [duplicate]
                            
                                How to debug a multi-threaded app in IntelliJ?
                            
                                How to wait until remote .NET debugger attached
                            
                                No debuggable processes in logcat when phone detected by Android studio
                            
                                Skip line while debugging in Chrome developer tools
                            
                                How to detect and debug multi-threading problems?
                            
                                How to change node.js debug port?
                            
                                How can I abort a long operation in WinDbg?
                            
                                Unhandled exceptions in BackgroundWorker
                            
                                How to execute some function in eclipse while debugging a java program?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do you fix a bug you can't replicate?

Tags:

debugging

replicate

cdeszaq

People also ask

2 Answers

Language

C

Java

JavaScript

Environment

Networking

Consistency

Logging

Hardware

Network vs. Local

Firmware

Time and Statistics

Change Management

Library Management

Java

Debugging

Sleep

Code Review

Cosmic Radiation

Tools

26 revs, 2 users 91%

Brian Agnew

Recent Activity

Donate For Us