Python - How to parallel consume and operate on files in a directory

Tags:

Current scenario: I have 900 files in a directory called directoryA. The files are named file0.txt through file 899.txt, each 15MB in size. I loop through each file sequentially in python. Each file I load as a list, do some operations, and write out an output file in directoryB. When the loop ends I have 900 files in directoryB. The files are named out0.csv through out899.csv.

Problem: The processing of each file takes 3 minutes, making the script run for more than 40 hours. I would like to run the process in a parallel manner as all the files are independent of each other (do not have any inter-dependencies). I have 12 cores in my machine.

The below script runs sequentially. Please help me run it parallel. I have looked at some of the parallel processing modules in python using related stackoverflow questions, but they are difficult for me to understand as I dont have much exposure to python. Thanks a billion.

Pseudo Script

    from os import listdir 
    import csv

    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    for files in listdir(inputDir):
        #load the text file as list using csv module 
        #run a bunch of operations
        #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
        #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

597

asked Aug 06 '15 19:08

user5199564

1 Answers

To fully utilize your hardware core, it's better to use the multiprocessing library.

from multiprocessing import Pool

from os import listdir 
import csv

def process_file(file):
    #load the text file as list using csv module 
    #run a bunch of operations
    #regex the int from the filename. for ex file1.txt returns 1, and file42.txt returns 42
    #write out a corresponsding csv file in dirB. For example input file file99.txt is written as out99.csv

if __name__ == '__main__':
    mypath = "some/path/"

    inputDir = mypath + 'dirA/'
    outputDir = mypath + 'dirB/'

    p = Pool(12)
    p.map(process_file, listdir(inputDir))

Document of multiprocessing: https://docs.python.org/2/library/multiprocessing.html

121

answered Sep 23 '22 02:09

ruijin

Related questions
                            
                                Beautiful Soup 4: How to replace a tag with text and another tag?
                            
                                Handling "Authentication Required" alert box with Python 2.7 + Selenium Webdriver
                            
                                Why is DataFrame.loc[[1]] 1,800x slower than df.ix [[1]] and 3,500x than df.loc[1]?
                            
                                Django : Static content not found
                            
                                nosetests framework: how to pass environment variables to my tests?
                            
                                SimpleHTTPServer launched as a thread: does not daemonize
                            
                                Python Sympy Pretty Output of Matrix
                            
                                IdeaVim plugin in pycharm doesn't support continuous scroll for long press?
                            
                                Extracting coefficients from GLM in Python using statsmodel
                            
                                How do I suppress the console window when debugging python code in Python Tools for Visual Studio (PTVS)?
                            
                                python selenium - Element is not currently interactable and may not be manipulated
                            
                                Combine Pandas data frame column values into new column
                            
                                Using __future__ style imports for module specific features in Python
                            
                                Interpolate and fill pandas dataframe with datetime index
                            
                                Redirect any urls to 404.html if not found in urls.py in django
                            
                                python multiprocessing pool: how can I know when all the workers in the pool have finished?
                            
                                How to create range of numbers in Python like in MATLAB
                            
                                csv.reader() is separating values by individual character
                            
                                Show only selected or specific matplotlib figures
                            
                                Writing to JSON produces TypeError: dump() takes at least 2 arguments (1 given)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - How to parallel consume and operate on files in a directory

Tags:

python

parallel-processing

user5199564

People also ask

1 Answers

ruijin

Recent Activity

Donate For Us