I am using os.walk
to build a map of a data-store (this map is used later in the tool I am building)
This is the code I currently use:
def find_children(tickstore):
children = []
dir_list = os.walk(tickstore)
for i in dir_list:
children.append(i[0])
return children
I have done some analysis on it:
dir_list = os.walk(tickstore)
runs instantly, if I do nothing with dir_list
then this function completes instantly.
It is iterating over dir_list
that takes a long time, even if I don't append
anything, just iterating over it is what takes the time.
Tickstore
is a big datastore, with ~10,000 directories.
Currently it takes approx 35minutes to complete this function.
Is there any way to speed it up?
I've looked at alternatives to os.walk
but none of them seemed to provide much of an advantage in terms of speed.
Yes: use Python 3.5 (which is still currently a RC, but should be out momentarily). In Python 3.5, os.walk
was rewritten to be more efficient.
This work done as part of PEP 471.
Extracted from the PEP:
Python's built-in
os.walk()
is significantly slower than it needs to be, because -- in addition to callingos.listdir()
on each directory -- it executes thestat()
system call orGetFileAttributes()
on each file to determine whether the entry is a directory or not.But the underlying system calls --
FindFirstFile
/FindNextFile
on Windows andreaddir
on POSIX systems -- already tell you whether the files returned are directories or not, so no further system calls are needed. Further, the Windows system calls return all the information for astat_result
object on the directory entry, such as file size and last modification time.In short, you can reduce the number of system calls required for a tree function like
os.walk()
from approximately 2N to N, where N is the total number of files and directories in the tree. (And because directory trees are usually wider than they are deep, it's often much better than this.)In practice, removing all those extra system calls makes
os.walk()
about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro-optimizations. See more benchmarks here.
A method to optimize it in python2.7, use scandir.walk()
instead of os.walk()
, the parameters are exactly the same.
import scandir
directory = "/tmp"
res = scandir.walk(directory)
for item in res:
print item
PS: Just as @recoup mentioned in comment, scandir
needs to be installed before usage in python2.7.
os.walk
is currently quite slow because it first lists the directory and then does a stat
on each entry to see if it is a directory or a file.
An improvement is proposed in PEP 471 and should be coming soon in Python 3.5. In the meantime you could use the scandir package to get the same benefits in Python 2.7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With