I need to generate a list of files with paths that contain a certain string by recursively searching. I'm doing this currently like this:
for i in iglob(starting_directory+'/**/*', recursive=True):
if filemask in i.split('\\')[-1]: # ignore directories that contain the filemask
filelist.append(i)
This works, but when crawling a large directory tree, it's woefully slow (~10 minutes). We're on Windows, so doing an external call to the unix find command isn't an option. My understanding is that glob is faster than os.walk.
Is there a faster way of doing this?
Try any one of the following commands to see recursive directory listing: ls -R : Use the ls command to get recursive directory listing on Linux. find /dir/ -print : Run the find command to see recursive directory listing in Linux. du -a . : Execute the du command to view recursive directory listing on Unix.
Use glob. glob() to search for specific files in subdirectories in Python. Call glob. glob(pathname, recursive=True) with pathname as a path to a directory and recursive as True to enable recursively searching through existing subdirectories.
Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).
Python 3.6:
import os
import glob
def walk():
pys = []
for p, d, f in os.walk('.'):
for file in f:
if file.endswith('.py'):
pys.append(file)
return pys
def iglob():
pys = []
for file in glob.iglob('**/*', recursive=True):
if file.endswith('.py'):
pys.append(file)
return pys
def iglob2():
pys = []
for file in glob.iglob('**/*.py', recursive=True):
pys.append(file)
return pys
# I also tried pathlib.Path.glob but it was slow and error prone, sadly
%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using GNU find (4.6.0) on cygwin (4.6.0-1)
Edit: The below is on WINDOWS, on LINUX I found find
to be about 25% faster
$ time find . -name '*.py' > /dev/null
real 0m8.827s
user 0m1.482s
sys 0m7.284s
Seems like os.walk
is as good as you can get on windows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With