These are some general questions I am facing while designing the error handling for an algorithm that is supposed to run in parallel using MPI (in C++):
In an ideal world, you can use them to do what you ask. By "ideal world" I mean one where you have your choice of MPI implementation and are able to administer it yourself (instead of convincing the cluster owner to reconfigure it for you). The minimal configuration for exceptions will include the: --with-exceptions flag, and possibly a few more.
I've used LAM most often, and by default exceptions are disabled. I believe this is the default for other implementations as well.
They work in the same vein as 'vanilla' C++ exceptions. And they do work inside parallel executed code.
At some point in your startup code, you want to enable them:
MPI::COMM_WORLD.Set_errhandler ( MPI::ERRORS_THROW_EXCEPTIONS );
(if your library isn't configured to allow exceptions, this is probably a bad idea -- behaviour "undefined" according to LAM)
And then:
try { /* something that can fail */ }
catch ( MPI::Exception e ) {
cout << "Oops: " << e.Get_error_string() << e.Get_error_code();
MPI::COMM_WORLD.Abort (-1) ;
}
As for it being good or bad practice, I can't really say. I haven't seen extensive use of them in code written by hardened MPI hackers, but that may be because the code is generally more C than C++ in my experience.
A middle ground between error codes and exceptions may be error handlers, in a nutshell you can assign functions that will be called when a particular error (designated by code) occurs. This might be an option if you can't get your administrator on board with enabling exceptions.
Exceptions work the same in an MPI code as with a serial code, but you have to be extremely careful if it is possible for the exception is not raised on all processes in a communicator or you can easily end up with deadlock.
MPI_Barrier(comm); /* Or any synchronous call */
if (!rank) throw Exception("early exit on rank=0");
MPI_Barrier(comm); /* rank>0 deadlocks here because rank=0 exited early */
All error handling methods have this problem, it is difficult to recover from errors that do not occur consistently across a communicator. In the case above, you could perform an MPI_Allreduce
so that all ranks choose the same branch.
My preference is for calling error handlers and propagating them up the stack since this tends to give me tho most useful/verbose error message and it's easy to catch with a breakpoint (or the error handler can attach a debugger to itself and send it to your workstation in an xterm).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With