Locating segmentation fault for multithread program running on cluster

Tags:

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we don't have an interactive operation.

How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.

I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.

392

asked Nov 14 '12 22:11

Ali

1 Answers

The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.

The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile to debug the program program which produced the core file corefile: You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb matching the hardware of the machine where it was run.

I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.

Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, e.g., ps -elf | grep program) and then attach to it using

gdb program pid

I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP for SIGSEGV...).

That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.

152

answered Oct 12 '22 19:10

Dietmar Kühl

Related questions
                            
                                Replacing the standard Android H264 software encoder with an ffmpeg based one
                            
                                How can I represent inheritance from a template parameter in UML?
                            
                                How to get minimum count rectangles that covers another pile of rectangle?
                            
                                Converting from char string to an array of uint8_t?
                            
                                Comparing Huge Files using C++
                            
                                __restrict pointer aliasing with only one pointer
                            
                                operator precedence [duplicate]
                            
                                Why is there no atomic_{store,load} for weak_ptr?
                            
                                C++ calling perl code - eval_sv not passing arguments to script
                            
                                Static constant class member declaration
                            
                                pointers to a class in dynamically allocated boost multi_array, not compiling
                            
                                How to prevent delete px.get() for a unique_ptr
                            
                                Can ORM ODB for C++ generate code from a database
                            
                                arm cortex a9 cross compiling strange floating point behaviour
                            
                                Specific Template Friendship in C++
                            
                                Option/Maybe class for C++
                            
                                Tool tip to show plot values in Qwt
                            
                                TDD Books for C++ [closed]
                            
                                Conflict between perfect forwarding constructor and copy constructor in class hierarchy
                            
                                REST client in C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Locating segmentation fault for multithread program running on cluster

Tags:

c++

multithreading

segmentation-fault

cluster-computing

hpc

Ali

People also ask

1 Answers

Dietmar Kühl

Recent Activity

Donate For Us