Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelise this python script using mpi4py?

Tags:

I apologise if this has already been asked, but I've read a heap of documentation and am still not sure how to do what I would like to do.

I would like to run a Python script over multiple cores simultaneously.

I have 1800 .h5 files in a directory, with names 'snaphots_s1.h5', 'snapshots_s2.h5' etc, each about 30MB in size. This Python script:

  1. Reads in the h5py files one at a time from the directory.
  2. Extracts and manipulates the data in the h5py file.
  3. Creates plots of the extracted data.

Once this is done, the script then reads in the next h5py file from the directory and follows the same procedure. Hence, none of the processors need to communicate to any other whilst doing this work.

The script is as follows:

import h5py
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import cmocean
import os  

from mpi4py import MPI

de.logging_setup.rootlogger.setLevel('ERROR')

# Plot writes

count = 1
for filename in os.listdir('directory'):  ### [PERF] Applied to ~ 1800 .h5 files
    with h5py.File('directory/{}'.format(filename),'r') as file:

         ### Manipulate 'filename' data.  ### [PERF] Each fileI ~ 0.03 TB in size
         ...

         ### Plot 'filename' data.        ### [PERF] Some fileO is output here
         ...
count = count + 1

Ideally, I would like to use mpi4py to do this (for various reasons), though I am open to other options such as multiprocessing.Pool (which I couldn't actually get to work. I tried following the approach outlined here).

So, my question is: What commands do I need to put in the script to parallelise it using mpi4py? Or, if this option isn't possible, how else could I parallelise the script?

like image 244
mattos Avatar asked Oct 13 '17 11:10

mattos


People also ask

How do I Parallelise a Python code?

The general way to parallelize any operation is to take a particular function that should be run multiple times and make it run parallelly in different processors. To do this, you initialize a Pool with n number of processors and pass the function you want to parallelize to one of Pool s parallization methods.

What does mpi4py do?

This module provides an object-oriented interface that resembles the message passing interface (MPI), and hence allows Python programs to exploit multiple processors on multiple compute nodes. The mpi4py module supports both point-to-point and collective communications for Python objects as well as buffer-like objects.

Can we use MPI in Python?

Python supports MPI (Message Passing Interface) through mpi4py module. Python's standard “multiprocessing” module (http://docs.python.org/2/library/multiprocessing.html) may be considered as an alternative option.

How does MPI Python work?

MPI for Python provides Python bindings for the Message Passing Interface (MPI) standard, allowing Python applications to exploit multiple processors on workstations, clusters and supercomputers. This package builds on the MPI specification and provides an object oriented interface resembling the MPI-2 C++ bindings.


1 Answers

You should go with multiprocessing, and Javier example should work but I would like to break it down so you can understand the steps too.

In general, when working with pools you create a pool of processes that idle until you pass them some work. To ideal way to do it is to create a function that each process will execute separetly.

def worker(fn):
    with h5py.File(fn, 'r') as f:
        # process data..
        return result

That simple. Each process will run this, and return the result to the parent process.

Now that you have the worker function that does the work, let's create the input data for it. It takes a filename, so we need a list of all files

full_fns = [os.path.join('directory', filename) for filename in 
            os.listdir('directory')]

Next initialize the process pool.

import multiprocessing as mp
pool = mp.Pool(4)  # pass the amount of processes you want
results = pool.map(worker, full_fns)  

# pool takes a worker function and input data
# you usually need to wait for all the subprocesses done their work before 
using the data; so you don't work on partial data.

pool.join()
poo.close()

Now you can access your data through results.

for r in results:
    print r

Let me know in comments how this worked out for you

like image 162
Chen A. Avatar answered Sep 30 '22 14:09

Chen A.