Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient way to traverse file structure Python

Is using os.walk in the way below the least time consuming way to recursively search through a folder and return all the files that end with .tnt?

for root, dirs, files in os.walk('C:\\data'):
    print "Now in root %s" %root
    for f in files:
        if f.endswith('.tnt'):
like image 589
georges Avatar asked Sep 19 '12 22:09

georges


2 Answers

Yes, using os.walk is indeed the best way to do that.

like image 157
Martijn Pieters Avatar answered Nov 15 '22 11:11

Martijn Pieters


As everyone has said, os.walk is almost certainly the best way to do it.

If you actually have a performance problem, and profiling has shown that it's caused by os.walk (and/or iterating the results with .endswith), your best answer is probably to step outside Python. Replace all of the code above with:

for f in sys.argv[1:]:

Now you need some outside tool that can gather the paths and run your script. (Ideally batching as many paths as possible into each script execution.)

If you can rely on Windows Desktop Search having indexed the drive, it should only need to do a quick database operation to find all files under a certain path with a certain extension. I have no idea how to write a batch file that runs that query and gets the results as a list of arguments to pass to a Python script (or a PowerShell file that runs the query and passes the results to IronPython without serializing it into a list of arguments), but it would be worth researching this before anything else.

If you can't rely on your platform's desktop search index, on any POSIX platform, it would almost certainly be fastest and simplest to use this one-liner shell script:

find /my/path -name '*.tnt' -exec myscript.py {} +

Unfortunately, you're not on a POSIX platform, you're on Windows, which doesn't come with the find tool, which is the thing that's doing all the heavy lifting here.

There are ports of find to native Windows, but you'll have to figure out the command-line intricaties to get everything quoted right and format the path and so on, so you can write the one-liner batch file. Alternatively, you could install cygwin and use the exact same shell script you'd use on a POSIX system. Or you could find a more Windows-y tool that does what you need.

This could conceivably be slower rather than faster—Windows isn't designed to execute lots of little processes with as little overhead as possible, and I believe it has smaller limits on command lines than platforms like linux or OS X, so you may spend more time waiting for the interpreter to start and exit than you save. You'd have to test to see. In fact, you probably want to test both native and cygwin versions (with both native and cygwin Python, in the latter case).

You don't actually have to move the find invocation into a batch/shell script; it's probably the simplest answer, but there are others, such as using subprocess to call find from within Python. This might solve performance problems caused by starting the interpreter too many times.

Getting the right amount of parallelism may also help—spin off each invocation of your script to the background and don't wait for them to finish. (I believe on Windows, the shell isn't involved in this; instead there's a tool named something like "run" that kicks off a process detached from the shell. But I don't remember the details.)

If none of this works out, you may have to write a custom C extension that does the fastest possible Win32 or .NET thing (which also means you have to do the research to find out what that is…) so you can call that from within Python.

like image 22
abarnert Avatar answered Nov 15 '22 11:11

abarnert