We are currently developing a cluster manager software in C. If several nodes connect to the manager, it works perfect, but if we use some tools to simulate 1000 nodes to connect the manager, it will sometimes work in unexpected ways.
How can one debug this kind of bug? It only appears when the load(connection/nodes) is large?
If I use gdb
to debug step by step, the app never malfunctions.
Debuggers allow programmers to create "breakpoints" in code. When they run code with a breakpoint in it, the code will stop running at the breakpoint. The developer can then step through the code line by line and examine the variables in each step to determine what went wrong.
There are two types of debugging techniques: reactive debugging and preemptive debugging. Most debugging is reactive — a defect is reported in the application or an error occurs, and the developer tries to find the root cause of the error to fix it.
How to debug this kind of bug?
In general, you want to use at least these techniques:
-Wall
is a good start, but -Wextra
is better.there's also no warning in valgrind check.
It's not clear whether you've simply ran the target application under Valgrind, or whether you also have the unit tests, and the tests are Valgrind-clean. It's also not clear whether you've observed the application mis-behavior under Valgrind or not.
Valgrind used to be the best tool available for heap and unintialized memory problems, but in 2017 this is no longer the case.
Compiler-based Address, Thread and Memory sanitizers catch significantly wider class of errors (e.g. global and stack overflows, and data races), and you should run your unit tests under all of them.
When all of the above still fails to find the problem, you may be able to run the real application instrumented with sanitizers.
Lastly, there are tools like GDB tracing and systemtap -- they are harder to learn, but give you significant power. Overview here.
Sadly the debugger is less useful for debugging concurrency/load issues.
Keep adding logs/printfs, trigger the issue with load testing, then try to narrow it down with more logs/printfs. Repeat.
The faster it is to trigger the bug the faster this will converge. Also prefer the classic "bisection" / "binary search" technique when adding logs - try to narrow down the areas you're looking at by at least half every time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With