Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.
Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.
The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.
Pseudo Script
    from os import listdir 
    import csv
    mypath = "some/path/"
    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'
    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
join() . Pool class can be used for parallel execution of a function for different input data. The multiprocessing. Pool() class spawns a set of processes called workers and can submit tasks using the methods apply/apply_async and map/map_async .
Practical Data Science using Python The best and most reliable way to open a file that's in the same directory as the currently running Python script is to use sys. path[0].
To fully utilize your hardware core, it's better to use the multiprocessing library.
from multiprocessing import Pool
from os import listdir 
import csv
def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv
if __name__ == '__main__':
    mypath = "some/path/"
    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'
    p = Pool(12)
    p.map(process_file, listdir(inputDir))
Document of multiprocessing: https://docs.python.org/2/library/multiprocessing.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With