Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - How to parallel consume and operate on files in a directory

Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.

Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.

The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.

Pseudo Script

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
like image 597
user5199564 Avatar asked Aug 06 '15 19:08

user5199564


People also ask

How do you perform a parallel execution in Python?

join() . Pool class can be used for parallel execution of a function for different input data. The multiprocessing. Pool() class spawns a set of processes called workers and can submit tasks using the methods apply/apply_async and map/map_async .

How do I make sure Python files are in the same directory?

Practical Data Science using Python The best and most reliable way to open a file that's in the same directory as the currently running Python script is to use sys. path[0].


1 Answers

To fully utilize your hardware core, it's better to use the multiprocessing library.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

Document of multiprocessing: https://docs.python.org/2/library/multiprocessing.html

like image 121
ruijin Avatar answered Sep 23 '22 02:09

ruijin