Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parallelize 'for' loop in Python 3

I am trying to do some analysis of the MODIS satellite data. My code primarily reads a lot of files (806) of the dimension 1200 by 1200 (806*1200*1200). It do it using a for loop and perform mathematical operations.

Following is the general way in which I read files.

mindex=np.zeros((1200,1200))
for i in range(1200):
    var1 = xray.open_dataset('filename.nc')['variable'][:,i,:].data
    for j in range(1200):
        var2 = var1[:,j]
        ## Mathematical Calculations to find var3[i,j]## 
        mindex[i,j] = var3[i,j]

Since its a lot of data to handle, the process is very slow and I was considering parallelizing it. I tried doing something with joblib, but I have not been able to do it.

I am unsure how to tackle this problem.

like image 897
Nirav L Lekinwala Avatar asked Jul 13 '18 12:07

Nirav L Lekinwala


People also ask

Does Python parallelize for loops?

Use the joblib Module to Parallelize the for Loop in Python The joblib module uses multiprocessing to run the multiple CPU cores to perform the parallelizing of for loop. It provides a lightweight pipeline that memorizes the pattern for easy and straightforward parallel computation.

Can you parallelize in Python?

There are several common ways to parallelize Python code. You can launch several application instances or a script to perform jobs in parallel. This approach is great when you don't need to exchange data between parallel jobs.

How do you do a multiprocess loop in Python?

Parallel For-Loop with map() First, we can create the multiprocessing pool configured by default to use all available CPU cores. Next, we can call the map() function as before and iterate the result values, although this time the map() function is a method on the multiprocessing pool.

How do you parallelize a nested loop in Python?

Parallelizing nested loops. If we have nested for loops, it is often enough to simply parallelize the outermost loop: a(); #pragma omp parallel for for (int i = 0; i < 4; ++i) { for (int j = 0; j < 4; ++j) { c(i, j); } } z(); This is all that we need most of the time.


1 Answers

My guess is that you want to work on several files at the same time. To do so, the best way (in my opinion) is to use multiprocessing. To use this, you need to define an elementary step, and it is already done in your code.

import numpy as np
import multiprocessing as mp
import os

def f(file):
    mindex=np.zeros((1200,1200))
    for i in range(1200):
        var1 = xray.open_dataset(file)['variable'][:,i,:].data
        for j in range(1200):
            var2 = var1[:,j]
            ## Mathematical Calculations to find var3[i,j]## 
            mindex[i,j] = var3[i,j]
    return (file, mindex)


if __name__ == '__main__':
    N= mp.cpu_count()

    files = os.scandir(folder)

    with mp.Pool(processes = N) as p:
        results = p.map(f, [file.name for file in files])

This should return a list of element results in which each element is a tuple with the file name and the mindex matrix. With this, you can work on multiple files at the same time. It is particularly efficient if the computation on each file is long.

like image 90
Mathieu Avatar answered Oct 10 '22 02:10

Mathieu