Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way to recursively search for files in python?

I need to generate a list of files with paths that contain a certain string by recursively searching. I'm doing this currently like this:

for i in iglob(starting_directory+'/**/*', recursive=True):
    if filemask in i.split('\\')[-1]: # ignore directories that contain the filemask
        filelist.append(i) 

This works, but when crawling a large directory tree, it's woefully slow (~10 minutes). We're on Windows, so doing an external call to the unix find command isn't an option. My understanding is that glob is faster than os.walk.

Is there a faster way of doing this?

like image 555
Noise in the street Avatar asked Jun 20 '18 12:06

Noise in the street


People also ask

How do I list all files in a directory recursively?

Try any one of the following commands to see recursive directory listing: ls -R : Use the ls command to get recursive directory listing on Linux. find /dir/ -print : Run the find command to see recursive directory listing in Linux. du -a . : Execute the du command to view recursive directory listing on Unix.

Does Python search subdirectories for file?

Use glob. glob() to search for specific files in subdirectories in Python. Call glob. glob(pathname, recursive=True) with pathname as a path to a directory and recursive as True to enable recursively searching through existing subdirectories.


1 Answers

Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).

Python 3.6:

import os
import glob

def walk():
    pys = []
    for p, d, f in os.walk('.'):
        for file in f:
            if file.endswith('.py'):
                pys.append(file)
    return pys

def iglob():
    pys = []
    for file in glob.iglob('**/*', recursive=True):
        if file.endswith('.py'):
            pys.append(file)
    return pys

def iglob2():
    pys = []
    for file in glob.iglob('**/*.py', recursive=True):
        pys.append(file)
    return pys

# I also tried pathlib.Path.glob but it was slow and error prone, sadly

%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using GNU find (4.6.0) on cygwin (4.6.0-1)

Edit: The below is on WINDOWS, on LINUX I found find to be about 25% faster

$ time find . -name '*.py' > /dev/null

real    0m8.827s
user    0m1.482s
sys     0m7.284s

Seems like os.walk is as good as you can get on windows.

like image 152
FHTMitchell Avatar answered Sep 28 '22 12:09

FHTMitchell