How do I run os.walk in parallel in Python?


I wrote a simple app in Java that takes a list of paths and generates a file with all the file paths under that original list.

If I have paths.txt that has:

c:\folder1\ c:\folder2\ ... ... c:\folder1000\ 

My app runs the recursive function on each path multithreaded, and returns a file with all the file paths under these folders.

Now I want to write this app in Python.

I've written a simple app that uses os.walk() to run through a given folder and print the filepaths to output.

Now I want to run it in parallel, and I've seen that Python has some modules for this: multithreaded and multiprocessing.

What is the best what to do this? And within that way, how is it performed?

1 Answers

Here is a multiprocessing solution:

from multiprocessing.pool import Pool from multiprocessing import JoinableQueue as Queue import os  def explore_path(path):     directories = []     nondirectories = []     for filename in os.listdir(path):         fullname = os.path.join(path, filename)         if os.path.isdir(fullname):             directories.append(fullname)         else:             nondirectories.append(filename)     outputfile = path.replace(os.sep, '_') + '.txt'     with open(outputfile, 'w') as f:         for filename in nondirectories:             print >> f, filename     return directories  def parallel_worker():     while True:         path = unsearched.get()         dirs = explore_path(path)         for newdir in dirs:             unsearched.put(newdir)         unsearched.task_done()  # acquire the list of paths with open('paths.txt') as f:     paths = f.read().split()  unsearched = Queue() for path in paths:     unsearched.put(path)  with Pool(5) as pool:     for i in range(5):         pool.apply_async(parallel_worker)  unsearched.join() print('Done') 
