I need to walk through a folder with approximately ten thousand files. My old vbscript is very slow in handling this. Since I've started using Ruby and Python since then, I made a benchmark between the three scripting languages to see which would be the best fit for this job.
The results of the tests below on a subset of 4500 files on a shared network are
Python: 106 seconds
Ruby: 5 seconds
Vbscript: 124 seconds
That Vbscript would be slowest was no surprise but I can't explain the difference between Ruby and Python. Is my test for Python not optimal? Is there a faster way to do this in Python?
The test for thumbs.db is just for the test, in reality there are more tests to do.
I needed something that checks every file on the path and doesn't produce too much output to not disturb the timing. The results are a bit different each run but not by much.
#python2.7.0
import os
def recurse(path):
for (path, dirs, files) in os.walk(path):
for file in files:
if file.lower() == "thumbs.db":
print (path+'/'+file)
if __name__ == '__main__':
import timeit
path = '//server/share/folder/'
print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))
'vbscript5.7
set oFso = CreateObject("Scripting.FileSystemObject")
const path = "\\server\share\folder"
start = Timer
myLCfilename="thumbs.db"
sub recurse(folder)
for each file in folder.Files
if lCase(file.name) = myLCfilename then
wscript.echo file
end if
next
for each subfolder in folder.SubFolders
call Recurse(subfolder)
next
end Sub
set folder = oFso.getFolder(path)
recurse(folder)
wscript.echo Timer-start
#ruby1.9.3
require 'benchmark'
def recursive(path, bench)
bench.report(path) do
Dir["#{path}/**/**"].each{|file| puts file if File.basename(file).downcase == "thumbs.db"}
end
end
path = '//server/share/folder/'
Benchmark.bm {|bench| recursive(path, bench)}
EDIT: since i suspected the print caused a delay i tested the scripts with printing all 4500 files and also printing none, the difference remains, R:5 P:107 in the first case and R:4.5 P:107 in the latter
EDIT2: based on the answers and comments here a Python version that in some cases could run faster by skipping folders
import os
def recurse(path):
for (path, dirs, files) in os.walk(path):
for file in files:
if file.lower() == "thumbs.db":
print (path+'/'+file)
def recurse2(path):
for (path, dirs, files) in os.walk(path):
for dir in dirs:
if dir in ('comics'):
dirs.remove(dir)
for file in files:
if file.lower() == "thumbs.db":
print (path+'/'+file)
if __name__ == '__main__':
import timeit
path = 'f:/'
print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))
#6.20102692
print(timeit.timeit('recurse2("'+path+'")', setup="from __main__ import recurse2", number=1))
#2.73848228
#ruby 5.7
In practice, removing all those extra system calls makes os. walk() about 8-9 times as fast on Windows, and about 2-3 times as fast on POSIX systems. So we're not talking about micro- optimizations. See more benchmarks here.
listdir() method returns a list of every file and folder in a directory. os. walk() function returns a list of every file in an entire file tree.
walk() work in python ? OS. walk() generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames). root : Prints out directories only from what you specified.
The Ruby implementation for Dir
is in C (the file dir.c
, according to this documentation). However, the Python equivalent is implemented in Python.
It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn'
, '.git'
, '.hg'
while traversing a directory hierarchy.
Most of the time, the Python implementation is fast enough.
Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).
Additional update: Skipping folders is done by changing the dirs
value in place:
for root, dirs, files in os.walk(path):
for skip in ('.hg', '.git', '.svn', '.bzr'):
if skip in dirs:
dirs.remove(skip)
# Now process other stuff at this level, i.e.
# in directory "root". The skipped folders
# won't be recursed into.
I setup directory structure with the following locally:
for i in $(seq 1 4500); do
if [[ $i -lt 100 ]]; then
dir="$(for j in $(seq 1 $i); do echo -n $i/;done)"
mkdir -p "$dir"
touch ${dir}$i
else
touch $i
fi
done
This creates 99 files with paths that are 1-99 levels deep and 4401 files in the root of the directory structure.
I used the following ruby script:
#!/usr/bin/env ruby
require 'benchmark'
def recursive(path, bench)
bench.report(path) do
Dir["#{path}/**/**"]
end
end
path = 'files'
Benchmark.bm {|bench| recursive(path, bench)}
I got the following result:
user system total real
files/ 0.030000 0.090000 0.120000 ( 0.108562)
I use the following python script using os.walk:
#!/usr/bin/env python
import os
import timeit
def path_recurse(path):
for (path, dirs, files) in os.walk(path):
for folder in dirs:
yield '{}/{}'.format(path, folder)
for filename in files:
yield '{}/{}'.format(path, filename)
if __name__ == '__main__':
path = 'files'
print(timeit.timeit('[i for i in path_recurse("'+path+'")]', setup="from __main__ import path_recurse", number=1))
I got the following result:
0.250478029251
So, it looks like ruby is still performing better. It'd be interesting to see how this one performs on your fileset on the network share.
It would probably also be interesting to see this script run on python3 and with jython and maybe even with pypy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With