How do you debug the bug that only appears when the load is huge?

Tags:

We are currently developing a cluster manager software in C. If several nodes connect to the manager, it works perfect, but if we use some tools to simulate 1000 nodes to connect the manager, it will sometimes work in unexpected ways.

How can one debug this kind of bug? It only appears when the load(connection/nodes) is large?

If I use gdb to debug step by step, the app never malfunctions.

400

asked Oct 03 '17 09:10

Sato

2 Answers

How to debug this kind of bug?

In general, you want to use at least these techniques:

Make sure the code compiles and links without warnings. The -Wall is a good start, but -Wextra is better.
Make sure the application has designed-in logging and tracing, which can be turned on or off, and which has sufficient details to debug these kinds of issues, and low overhead.
Make sure the code has good unit-test coverage.
Make sure the tests are sanitizer-clean.

there's also no warning in valgrind check.

It's not clear whether you've simply ran the target application under Valgrind, or whether you also have the unit tests, and the tests are Valgrind-clean. It's also not clear whether you've observed the application mis-behavior under Valgrind or not.

Valgrind used to be the best tool available for heap and unintialized memory problems, but in 2017 this is no longer the case.

Compiler-based Address, Thread and Memory sanitizers catch significantly wider class of errors (e.g. global and stack overflows, and data races), and you should run your unit tests under all of them.

When all of the above still fails to find the problem, you may be able to run the real application instrumented with sanitizers.

Lastly, there are tools like GDB tracing and systemtap -- they are harder to learn, but give you significant power. Overview here.

answered Nov 05 '22 04:11

Employed Russian

Sadly the debugger is less useful for debugging concurrency/load issues.

Keep adding logs/printfs, trigger the issue with load testing, then try to narrow it down with more logs/printfs. Repeat.

The faster it is to trigger the bug the faster this will converge. Also prefer the classic "bisection" / "binary search" technique when adding logs - try to narrow down the areas you're looking at by at least half every time.

answered Nov 05 '22 06:11

orip

Related questions
                            
                                Is it safe to allocate too little space (if you know you won't need it)?
                            
                                How to properly seed a mersenne twister RNG?
                            
                                Using OpenCV Mat images with Intel IPP?
                            
                                Execute command just before Mac going to sleep
                            
                                Does removing const from a pointer-to-const obey strict aliasing in C, and refer to the same object?
                            
                                __forceinline__ effect at CUDA C __device__ functions
                            
                                Why is TCP write latency worse when work is interleaved?
                            
                                SVG / vector graphical objects boolean operations (union, intersection, subtraction)
                            
                                Linux: write a C program that 'controls' a shell
                            
                                How to work on big integers that don't fit into any of language's data structures
                            
                                What is The Memory Address of Character Table In DOS? [closed]
                            
                                Are there known implementations of the CIEDE2000 or CIE94 Delta-E color difference calculation algorithm?
                            
                                Global variables, shared libraries and -fPIC effect
                            
                                Evaluate Mathematical Function from String [closed]
                            
                                Using errno for application / library error reporting
                            
                                *Almost* Perfect C Shell Piping
                            
                                Understanding undefined behavior for a binary stream using fseek(file, 0, SEEK_END) with a file
                            
                                Maximum size of string can be printed using %s?
                            
                                Ignore 'E' when reading double with sscanf
                            
                                Possible C/C++ compiler bug in Visual Studio 2013

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With