I tried to optimize a file browsing function written in Python, on Windows, by using os.scandir() instead of os.listdir(). However, time remains unchanged, about 2 minutes and a half, and I can't tell why. Below are the functions, original and altered:
os.listdir() version:
def browse(self, path, tree):
# for each entry in the path
for entry in os.listdir(path):
entity_path = os.path.join(path, entry)
# check if support by git or not
if self.git_ignore(entity_path) is False:
# if is a dir create a new level in the tree
if os.path.isdir( entity_path ):
tree[entry] = Folder(entry)
self.browse(entity_path, tree[entry])
# if is a file add it to the tree
if os.path.isfile(entity_path):
tree[entry] = File(entity_path)
os.scandir() version:
def browse(self, path, tree):
# for each entry in the path
for dirEntry in os.scandir(path):
entry_path = dirEntry.name
entity_path = dirEntry.path
# check if support by git or not
if self.git_ignore(entity_path) is False:
# if is a dir create a new level in the tree
if dirEntry.is_dir(follow_symlinks=True):
tree[entry_path] = Folder(entity_path)
self.browse(entity_path, tree[entry_path])
# if is a file add it to the tree
if dirEntry.is_file(follow_symlinks=True):
tree[entry_path] = File(entity_path)
In addition, here are the auxiliary functions used within this one:
def git_ignore(self, filepath):
if '.git' in filepath:
return True
if '.ci' in filepath:
return True
if '.delivery' in filepath:
return True
child = subprocess.Popen(['git', 'check-ignore', str(filepath)],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
output = child.communicate()[0]
status = child.wait()
return status == 0
============================================================
class Folder(dict):
def __init__(self, path):
self.path = path
self.categories = {}
============================================================
class File(object):
def __init__(self, path):
self.path = path
self.filename, self.extension = os.path.splitext(self.path)
Does anyone have a solution for how I can make the function run faster? My assumption is that the extraction of the name and path at the beginning makes it run slower than it should, is that correct?
os.walk seems to call stats more times than necessary. That seems to be the reason why it's slower than os.scandir().
In this case, I think the best way to boost your speed performance would be to use parallel processing, which can improve the speed incredibly in some loops. There are multiple posts about this issue. Here one: Parallel Processing in Python – A Practical Guide with Examples.
I have also been wondering what are the best usage of these three options (scandir, listdir, walk). There is not much documentation about performance comparisons. Probably the best way would be to test it yourself as you did. Here my conclusions about that:
It doesn't seem to have advantages compared to os.scandir() excepting that is easier to understand. I still use it when I only need to list files in directory.
PROS:
CONS:
This is the most used function when we need to fetch all the items in a directory (and subdirs).
PROS:
CONS:
It seems to have (almost) the best of both worlds. It gives you the speed of the simple os.listdir with extra features that would allow you to simplify your loops, since you could avoid using exiftool or other metadata tools when you need extra information about the files.
PROS:
CONS:
So that's my view after reading a bit and using them. I'm happy to be corrected, so I can learn more about it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With