Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate over a very large number of files in a folder

Tags:

python

windows

What is the fastest way to iterate over all files in a directory using NTFS and Windows 7, when the filecount in the directory is bigger than 2.500.000? All Files are flat under the top-level directory.

Currently I use

for root, subFolders, files in os.walk(rootdir):
    for file in files:
        f = os.path.join(root,file)
        with open(f) as cf:
            [...]

but it is very very slow. The process has been running for about an hour and still has not processed a single file but still grows with about 2kB of Memory Usage per second.

like image 750
reox Avatar asked Jun 10 '13 09:06

reox


2 Answers

By default os.walk walk the directory tree bottom-up. If you have a deep tree with many leafs, I guess this could leave to performances penalties -- or at least for an increased "statup" time, since walk has to read lots of data before processing the first file.

All of this being speculative, have you tried to force a topdown explorations:

for root, subFolders, files in os.walk(rootdir, topdown=True):
    ...

EDIT:

As the files appear to be in a flat directory, maybe glob.iglob could leave to better performance by returning an iterator (whereas other method like os.walk, os.listdir or glob.glob build first the list of all files). Could you try something like that:

import glob

# ...
for infile in glob.iglob( os.path.join(rootdir, '*.*') ):
    # ...
like image 154
Sylvain Leroux Avatar answered Oct 08 '22 14:10

Sylvain Leroux


I found that os.scandir (in python standard-library since 3.5) seems to actually do the trick also in windows! (as noted in the comments it does its job equally well on MacOS)!

consider the following example:
"retrieve 100 paths from a folder that contains millions of files"

os.scandir achieves this in a fraction of a second

import os
from itertools import islice
from pathlib import Path
path = Path("path to a folder with a lot of files")

paths = [i.path for i in islice(os.scandir(path), 100))]

All the other tested options (iterdir, glob, iglob) somehow take a ridiculous amount of time even though they are supposed to return iterators...

paths = list(islice(path.iterdir(), 100))
paths = list(islice(path.rglob(""), 100))
import glob
paths = list(islice(glob.iglob(str(path / "*.*")), 100))
like image 31
raphael Avatar answered Oct 08 '22 14:10

raphael