Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I run os.walk in parallel in Python?

Tags:

I wrote a simple app in Java that takes a list of paths and generates a file with all the file paths under that original list.

If I have paths.txt that has:

c:\folder1\ c:\folder2\ ... ... c:\folder1000\ 

My app runs the recursive function on each path multithreaded, and returns a file with all the file paths under these folders.

Now I want to write this app in Python.

I've written a simple app that uses os.walk() to run through a given folder and print the filepaths to output.

Now I want to run it in parallel, and I've seen that Python has some modules for this: multithreaded and multiprocessing.

What is the best what to do this? And within that way, how is it performed?

like image 339
user1251654 Avatar asked Aug 12 '12 07:08

user1251654


People also ask

How do you use the Walk function in Python?

OS. walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames). root : Prints out directories only from what you specified.

Is threading in Python parallel?

In fact, a Python process cannot run threads in parallel but it can run them concurrently through context switching during I/O bound operations. This limitation is actually enforced by GIL. The Python Global Interpreter Lock (GIL) prevents threads within the same process to be executed at the same time.

What is the difference between OS Listdir () and OS walk?

listdir() method returns a list of every file and folder in a directory. os. walk() function returns a list of every file in an entire file tree.

How do I walk a directory in Python?

To traverse the directory in Python, use the os. walk() function. The os. walk() function accepts four arguments and returns 3-tuple, including dirpath, dirnames, and filenames.


1 Answers

Here is a multiprocessing solution:

from multiprocessing.pool import Pool from multiprocessing import JoinableQueue as Queue import os  def explore_path(path):     directories = []     nondirectories = []     for filename in os.listdir(path):         fullname = os.path.join(path, filename)         if os.path.isdir(fullname):             directories.append(fullname)         else:             nondirectories.append(filename)     outputfile = path.replace(os.sep, '_') + '.txt'     with open(outputfile, 'w') as f:         for filename in nondirectories:             print >> f, filename     return directories  def parallel_worker():     while True:         path = unsearched.get()         dirs = explore_path(path)         for newdir in dirs:             unsearched.put(newdir)         unsearched.task_done()  # acquire the list of paths with open('paths.txt') as f:     paths = f.read().split()  unsearched = Queue() for path in paths:     unsearched.put(path)  with Pool(5) as pool:     for i in range(5):         pool.apply_async(parallel_worker)  unsearched.join() print('Done') 
like image 189
Raymond Hettinger Avatar answered Dec 26 '22 10:12

Raymond Hettinger