Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locating segmentation fault for multithread program running on cluster

It's quite straightforward to use gdb in order to locate a segmentation fault while running a simple program in interactive mode. But consider we have a multithread program - written by pthread - submitted to a cluster node (by qsub command). So we don't have an interactive operation.

How we can locate the segmentation fault? I am looking for a general approach, a program or test tool. I can not provide a reproducible example as the program is really big and crashes on the cluster in some unknown situations.

I need to find a problem in such hard situation because the program runs correctly on the local machine with any number of threads.

like image 392
Ali Avatar asked Nov 14 '12 22:11

Ali


People also ask

How do you determine where a segmentation fault is coming from?

Use debuggers to diagnose segfaults Start your debugger with the command gdb core , and then use the backtrace command to see where the program was when it crashed. This simple trick will allow you to focus on that part of the code.

How do you handle a segmentation fault?

Troubleshooting the problem: Check EVERY place in your program that uses pointers, subscripts an array, or uses the address operator (&) and the dereferencing operator (*). Each is a candidate for being the cause of a segmentation violation. Make sure that you understand the use of pointers and the related operators.

What causes SIGSEGV?

SIGSEGV and SIGABRT are two Unix signals that can cause a process to terminate. SIGSEGV is triggered by the operating system, which detects that a process is carrying out a memory violation, and may terminate it as a result. SIGABRT (signal abort) is a signal triggered by a process itself.


1 Answers

The "normal" approach is to have the environment produce a core file and get hold of those. If this isn't an option, you might want to try installing a signal handler for SIGSEGV which obtains, at least, a stack trace dumped somewhere. Of course, this immediately leads to the question "how to get a stack trace" but this is answered elsewhere.

The easiest approach is probably to get hold of a core file. Assuming you have a similar machine where the core file can be read, you can use gdb program corefile to debug the program program which produced the core file corefile: You should be able to look at the different threads, their data (to some extend), etc. If you don't have a suitable machine it may be necessary to cross-compile gdb matching the hardware of the machine where it was run.

I'm a bit confused about the statement that the core files are empty: You can set the limits for core files using ulimit on the shell. If the size for cores is set to zero it shouldn't produce any core file. Producing an empty one seems odd. However, if you cannot change the limits on your program you are probably down to installing a signal handler and dumping out a stack trace from the offending thread.

Thinking of it, you may be able to put the program to sleep in the signal handler and attach to it using a debugger, assuming you can run a debugger on the corresponding machine. You would determine the process ID (using, e.g., ps -elf | grep program) and then attach to it using

gdb program pid

I'm not sure how to put a program to sleep from within the program, though (possibly installing the handler for SIGSTOP for SIGSEGV...).

That said, I assume you tried running your program on your local machine...? Some problems are more fundamental than needing a distributed system of many threads running on each node. This is, obviously, not a replacement for the approach above.

like image 152
Dietmar Kühl Avatar answered Oct 12 '22 19:10

Dietmar Kühl