benchmarks: does python have a faster way of walking a network folder?

Tags:

I need to walk through a folder with approximately ten thousand files. My old vbscript is very slow in handling this. Since I've started using Ruby and Python since then, I made a benchmark between the three scripting languages to see which would be the best fit for this job.

The results of the tests below on a subset of 4500 files on a shared network are

Python: 106 seconds
Ruby: 5 seconds
Vbscript: 124 seconds

That Vbscript would be slowest was no surprise but I can't explain the difference between Ruby and Python. Is my test for Python not optimal? Is there a faster way to do this in Python?

The test for thumbs.db is just for the test, in reality there are more tests to do.

I needed something that checks every file on the path and doesn't produce too much output to not disturb the timing. The results are a bit different each run but not by much.

#python2.7.0
import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

if __name__ == '__main__':
  import timeit
  path = '//server/share/folder/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))

'vbscript5.7
set oFso = CreateObject("Scripting.FileSystemObject")
const path = "\\server\share\folder"
start = Timer
myLCfilename="thumbs.db"

sub recurse(folder)
  for each file in folder.Files
    if lCase(file.name) = myLCfilename then
      wscript.echo file
    end if
  next
  for each subfolder in folder.SubFolders
    call Recurse(subfolder)
  next
end Sub

set folder = oFso.getFolder(path)
recurse(folder)
wscript.echo Timer-start

#ruby1.9.3
require 'benchmark'

def recursive(path, bench)
  bench.report(path) do
    Dir["#{path}/**/**"].each{|file| puts file if File.basename(file).downcase == "thumbs.db"}
  end
end

path = '//server/share/folder/'
Benchmark.bm {|bench| recursive(path, bench)}

EDIT: since i suspected the print caused a delay i tested the scripts with printing all 4500 files and also printing none, the difference remains, R:5 P:107 in the first case and R:4.5 P:107 in the latter

EDIT2: based on the answers and comments here a Python version that in some cases could run faster by skipping folders

import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

def recurse2(path):
    for (path, dirs, files) in os.walk(path):
        for dir in dirs:
            if dir in ('comics'):
                dirs.remove(dir)
        for file in files:
            if file.lower() == "thumbs.db":
                print (path+'/'+file)


if __name__ == '__main__':
  import timeit
  path = 'f:/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1)) 
#6.20102692
  print(timeit.timeit('recurse2("'+path+'")', setup="from __main__ import recurse2", number=1)) 
#2.73848228
#ruby 5.7

274

asked Oct 30 '12 11:10

peter

2 Answers

The Ruby implementation for Dir is in C (the file dir.c, according to this documentation). However, the Python equivalent is implemented in Python.

It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn', '.git', '.hg' while traversing a directory hierarchy.

Most of the time, the Python implementation is fast enough.

Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).

Additional update: Skipping folders is done by changing the dirs value in place:

for root, dirs, files in os.walk(path):
    for skip in ('.hg', '.git', '.svn', '.bzr'):
        if skip in dirs:
            dirs.remove(skip)
        # Now process other stuff at this level, i.e.
        # in directory "root". The skipped folders
        # won't be recursed into.

195

answered Oct 04 '22 12:10

Vinay Sajip

I setup directory structure with the following locally:

for i in $(seq 1 4500); do
    if [[ $i -lt 100 ]]; then
        dir="$(for j in $(seq 1 $i); do echo -n $i/;done)"
        mkdir -p "$dir"
        touch ${dir}$i
    else
        touch $i
    fi
done

This creates 99 files with paths that are 1-99 levels deep and 4401 files in the root of the directory structure.

I used the following ruby script:

#!/usr/bin/env ruby
require 'benchmark'

def recursive(path, bench)
  bench.report(path) do
    Dir["#{path}/**/**"]
  end
end

path = 'files'
Benchmark.bm {|bench| recursive(path, bench)}

I got the following result:

           user     system      total        real
    files/  0.030000   0.090000   0.120000 (  0.108562)

I use the following python script using os.walk:

#!/usr/bin/env python

import os
import timeit

def path_recurse(path):
    for (path, dirs, files) in os.walk(path):
      for folder in dirs:
          yield '{}/{}'.format(path, folder)
      for filename in files:
          yield '{}/{}'.format(path, filename)

if __name__ == '__main__':
    path = 'files'
    print(timeit.timeit('[i for i in path_recurse("'+path+'")]', setup="from __main__ import path_recurse", number=1))

I got the following result:

    0.250478029251

So, it looks like ruby is still performing better. It'd be interesting to see how this one performs on your fileset on the network share.

It would probably also be interesting to see this script run on python3 and with jython and maybe even with pypy.

answered Oct 04 '22 13:10

Wren T.

Related questions
                            
                                How to get Desktop location?
                            
                                Django: Want to display an empty field as blank rather displaying None
                            
                                Django select max id
                            
                                TypeError while using django rest framework tutorial
                            
                                replace() method not working on Pandas DataFrame
                            
                                Cannot import cv2 in python in OSX
                            
                                How to make this Block of python code short and efficient
                            
                                Running Ruby, Node, Python and Docker on the new Apple Silicon architecture? [closed]
                            
                                aiogevent event loop "fails" to track greenlets
                            
                                cx-freeze, runpy and multiprocessing - multiple paths to failure
                            
                                Occasionally, Django messages are repeated across requests (i.e., they are not cleared)
                            
                                DCGAN debugging. Getting just garbage
                            
                                Run Python Debugger (pdb) in Sublime Text 3
                            
                                Pytorch vs. Keras: Pytorch model overfits heavily
                            
                                Pretty print json but keep inner arrays on one line python
                            
                                Can packages be shared across Anaconda environments?
                            
                                Simple Python implementation of collaborative topic modeling?
                            
                                Unit Testing: Assert that a file/path exists
                            
                                Using python and matplotlib on android
                            
                                Pass --no-deps in PIP requirements.txt

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

benchmarks: does python have a faster way of walking a network folder?

Tags:

python

ruby

benchmarking

vbscript

peter

People also ask

2 Answers

Vinay Sajip

Wren T.

Recent Activity

Donate For Us