We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time. However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started. The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time. The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the problem, most of the jobs that fail to complete, if not all, do not leave core files) What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner. Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian). I have had some thoughts on the matter: <ol> <li>Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.</li> <li>Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job</li> </ol> In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet). In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.

Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems. The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies: <ol> <li>If you are outputting many files per job, change your output strategy so that each job writes out one file which contains all the others. If you have local ramdisk, you can do something as simple as writing them to ramdisk, then tar-gzing them out to the real filesystem. </li> <li>Write in binary, not in ascii. Big data never goes in ascii. Binary formats are ~10x faster to write, somewhat smaller, and you can write big chunks at a time rather than a few numbers in a loop, which leads to:</li> <li>Big writes are better than little writes. Every IO operation is something the file system has to do. Make few, big, writes rather than looping over tiny writes.</li> <li>Similarly, don't write in formats which require you to seek around to write in different parts of the file at different times. Seeks are slow and useless.</li> <li>If you're running many jobs on a node, you can use the same ramdisk trick as above (or local disk) to tar up all the jobs' outputs and send them all out to the parallel file system at once.</li> </ol> The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help. Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.

What is the best way to avoid overloading a parallel file-system when running embarrassingly parallel jobs?

Tags:

c

filesystems

parallel-processing

fortran

embarrassingly-parallel

We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time.

However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started.

The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time.

The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the problem, most of the jobs that fail to complete, if not all, do not leave core files)

What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner.

Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian).

I have had some thoughts on the matter:

Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.
Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job

In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet).

In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.

789

asked Apr 11 '11 19:04

Joseph Earl

2 Answers

Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems.

The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies:

If you are outputting many files per job, change your output strategy so that each job writes out one file which contains all the others. If you have local ramdisk, you can do something as simple as writing them to ramdisk, then tar-gzing them out to the real filesystem.
Write in binary, not in ascii. Big data never goes in ascii. Binary formats are ~10x faster to write, somewhat smaller, and you can write big chunks at a time rather than a few numbers in a loop, which leads to:
Big writes are better than little writes. Every IO operation is something the file system has to do. Make few, big, writes rather than looping over tiny writes.
Similarly, don't write in formats which require you to seek around to write in different parts of the file at different times. Seeks are slow and useless.
If you're running many jobs on a node, you can use the same ramdisk trick as above (or local disk) to tar up all the jobs' outputs and send them all out to the parallel file system at once.

The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.

Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.

answered Nov 03 '22 01:11

Jonathan Dursi

It is hard to decide if you don't know what exactly causes the crash. If you think it is an error related to the filesystem performance, you can try an distributed filesystem: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html

If you want to implement Master/Slave system, maybe Hadoop can be the answer.

But first of all I would try to find out what causes the crash...

answered Nov 03 '22 01:11

Binus

Related questions
                            
                                Implementing a Mutex Lock in C
                            
                                In a process using lots of memory, how can I spawn a shell without a memory-hungry fork()?
                            
                                How to execute a program from file descriptor?
                            
                                Controlling read and write access width to memory mapped registers in C
                            
                                Why is gmtime implemented this way?
                            
                                Why doesn't my output show up until the program exits?
                            
                                Erlang - C and Erlang
                            
                                porting linux 32 bit app to 64 bit?
                            
                                Determining whether a readable file descriptor is the read end of a pipe
                            
                                String reversal in C++
                            
                                What does the pointer value of a function pointer represent?
                            
                                Embedded C - How to create a cache for expensive external reads?
                            
                                Saving numerical 2D array to image
                            
                                Cloning a Lua table in Lua C API
                            
                                Can an integer be shared between threads safely?
                            
                                array declared with static keyword: what is static, the pointer or the whole array?
                            
                                Simultaneous abort() in two threads
                            
                                What does this line of C do?
                            
                                Reverse Engineer Auto-Generated C?
                            
                                Does LD_INCLUDE_PATH actually do anything?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With