Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Program stalls during long runs

Fixed:

Well this seems a bit silly. Turns out top was not displaying correctly and programs actually continue to run. Perhaps the CPU time became too large to display? Either way, the program seems to be working fine and this whole question was moot.

Thanks (and sorry for the silly question).

Original Q:

I am running a simulation on a computer running Ubuntu server 10.04.3. Short runs (<24 hours) run fine, but long runs eventually stall. By stall, I mean that the program no longer gets any CPU time, but it still holds all information in memory. In order to run these simulations, I SSH and nohup the program and pipe any output to a file.

Miscellaneous information:

The system is definitely not running out of RAM. The program does not need to read or write to the hard drive until completion; the computation is done completely in memory. The program is not killed, as it still has a PID after it stalls. I am using openmp, but have increased the max number of processes and the max time is unlimited. I am finding the largest eigenvalues of a matrix using the ARPACK fortran library.

Any thoughts on what is causing this behavior or how to resume my currently stalled program?

Thanks

like image 986
user779810 Avatar asked Oct 16 '11 16:10

user779810


1 Answers

I assume this is an OpenMP program from your tags, though you never actually state this. Is ARPACK threadsafe?

It sounds like you are hitting a deadlock (more common in MPI programs than OpenMP, but it's definitely possible). The first thing to do is to compile with debugging flags on, then the next time you find this problem, attach with a debugger and find out what the various threads are doing. For gdb, for instance, some instructions for switching between threads are shown here.

like image 89
Jonathan Dursi Avatar answered Oct 12 '22 06:10

Jonathan Dursi