I wrote a simple app in Java that takes a list of paths and generates a file with all the file paths under that original list.
If I have paths.txt that has:
c:\folder1\ c:\folder2\ ... ... c:\folder1000\
My app runs the recursive function on each path multithreaded, and returns a file with all the file paths under these folders.
Now I want to write this app in Python.
I've written a simple app that uses os.walk()
to run through a given folder and print the filepaths to output.
Now I want to run it in parallel, and I've seen that Python has some modules for this: multithreaded and multiprocessing.
What is the best what to do this? And within that way, how is it performed?
OS. walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames). root : Prints out directories only from what you specified.
In fact, a Python process cannot run threads in parallel but it can run them concurrently through context switching during I/O bound operations. This limitation is actually enforced by GIL. The Python Global Interpreter Lock (GIL) prevents threads within the same process to be executed at the same time.
listdir() method returns a list of every file and folder in a directory. os. walk() function returns a list of every file in an entire file tree.
To traverse the directory in Python, use the os. walk() function. The os. walk() function accepts four arguments and returns 3-tuple, including dirpath, dirnames, and filenames.
Here is a multiprocessing solution:
from multiprocessing.pool import Pool from multiprocessing import JoinableQueue as Queue import os def explore_path(path): directories = [] nondirectories = [] for filename in os.listdir(path): fullname = os.path.join(path, filename) if os.path.isdir(fullname): directories.append(fullname) else: nondirectories.append(filename) outputfile = path.replace(os.sep, '_') + '.txt' with open(outputfile, 'w') as f: for filename in nondirectories: print >> f, filename return directories def parallel_worker(): while True: path = unsearched.get() dirs = explore_path(path) for newdir in dirs: unsearched.put(newdir) unsearched.task_done() # acquire the list of paths with open('paths.txt') as f: paths = f.read().split() unsearched = Queue() for path in paths: unsearched.put(path) with Pool(5) as pool: for i in range(5): pool.apply_async(parallel_worker) unsearched.join() print('Done')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With