Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quicker to os.walk or glob?

I'm messing around with file lookups in python on a large hard disk. I've been looking at os.walk and glob. I usually use os.walk as I find it much neater and seems to be quicker (for usual size directories).

Has anyone got any experience with them both and could say which is more efficient? As I say, glob seems to be slower, but you can use wildcards etc, were as with walk, you have to filter results. Here is an example of looking up core dumps.

core = re.compile(r"core\.\d*") for root, dirs, files in os.walk("/path/to/dir/")     for file in files:         if core.search(file):             path = os.path.join(root,file)             print "Deleting: " + path             os.remove(path) 

Or

for file in iglob("/path/to/dir/core.*")     print "Deleting: " + file     os.remove(file) 
like image 904
joedborg Avatar asked Jan 19 '12 18:01

joedborg


People also ask

Is glob faster than OS Listdir?

listdir is quickest of three. And glog. glob is still quicker than os.

Why is OS walk so slow?

Python's built-in os. walk() is significantly slower than it needs to be, because – in addition to calling os. listdir() on each directory – it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.

Is Scandir faster than Listdir?

scandir() is a directory iteration function like os. listdir(), except that instead of returning a list of bare filenames, it yields DirEntry objects that include file type and stat information along with the name. Using scandir() increases the speed of os.

What is the difference between OS Listdir () and OS walk?

The Python os. listdir() method returns a list of every file and folder in a directory. os. walk() function returns a list of every file in an entire file tree.


2 Answers

I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:

os.listdir: 0.7268s, 1326786 files found os.walk: 3.6592s, 1326787 files found glob.glob: 2.0133s, 1326786 files found 

As you see, os.listdir is quickest of three. And glog.glob is still quicker than os.walk for this task.

The source:

import os, time, glob  n, t = 0, time.time() for i in range(1000):     n += len(os.listdir("./%d" % i)) t = time.time() - t print "os.listdir: %.4fs, %d files found" % (t, n)  n, t = 0, time.time() for root, dirs, files in os.walk("./"):     for file in files:         n += 1 t = time.time() - t print "os.walk: %.4fs, %d files found" % (t, n)  n, t = 0, time.time() for i in range(1000):     n += len(glob.glob("./%d/*" % i)) t = time.time() - t print "glob.glob: %.4fs, %d files found" % (t, n) 
like image 196
a5kin Avatar answered Sep 23 '22 10:09

a5kin


Don't waste your time for optimization before measuring/profiling. Focus on making your code simple and easy to maintain.

For example, in your code you precompile RE, which does not give you any speed boost, because re module has internal re._cache of precompiled REs.

  1. Keep it simple
  2. if it's slow, then profile
  3. once you know exactly what needs to be optimized do some tweaks and always document it

Note, that some optimization done several years prior can make code run slower compared to "non-optimized" code. This applies especially for modern JIT based languages.

like image 44
Michał Šrajer Avatar answered Sep 20 '22 10:09

Michał Šrajer