Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Walk, but Thread Lightly

Tags:

python

os.walk

I'd like to recursively walk a directory, but I want python to break from any single listdir if it encounters a directory with greater than 100 files. Basically, I'm searching for a (.TXT) file, but I want to avoid directories with large DPX image sequences (usually 10,000 files). Since DPXs live in directories by themselves with no sub directories, I'd like to break that loop ASAP.

So long story short, if python encounters a file matching ".DPX$" it stops listing the sub-directory, backs out, skips that sub-directory and continues the walk in other sub-directories.

Is this possible to break a directory listing loop before all the list results are returned?

like image 920
Jamie Avatar asked May 04 '12 18:05

Jamie


3 Answers

If by 'directory listing loop' you mean os.listdir() then no. This cannot be broken from. You could however look at the os.path.walk() or os.walk() methods and just remove all the directories which contain DPX files. If you use os.walk() and are walking top-down you can affect what direcotries Python walks into by just modifying the list of directories. os.path.walk() allows you to choose where you walk with the visit method.

like image 164
Will Avatar answered Oct 10 '22 18:10

Will


According to the documentation for os.walk:

When topdown is True, the caller can modify the dirnames list in-place (e.g., via del or slice assignment), and walk() will only recurse into the subdirectories whose names remain in dirnames; this can be used to prune the search, or to impose a specific order of visiting. Modifying dirnames when topdown is False is ineffective, since the directories in dirnames have already been generated by the time dirnames itself is generated.

So in theory if you empty out dirnames then os.walk will not recurse down any additional directories. Note the comment about "...via del or slice assignment"; you cannot simply do dirnames=[] because this won't actually affect the contents of the dirnames list.

like image 41
ldx.a.ldy.c Avatar answered Oct 10 '22 19:10

ldx.a.ldy.c


The right way to avoid allocating the list of names using the os.listdir is to use the os level function as @Charles Duffy said.

Inspired from this other post: List files in a folder as a stream to begin process immediately

I added how to solve the specific OP question and used the re-entrant version of the function.

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER, byref, cast, sizeof, get_errno
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)
c_dirent_pp = POINTER(c_dirent_p)
c_dir_p = POINTER(c_dir)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

readdir_r = c_lib.readdir_r
readdir_r.argtypes = [c_dir_p, c_dirent_p, c_dirent_pp]
readdir_r.restype = c_int

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

import errno

def listdirx(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)

    if not dir_p:
        raise IOError()

    entry_p = cast(c_lib.malloc(sizeof(c_dirent)), c_dirent_p)

    try:
        while True:
            res = readdir_r(dir_p, entry_p, byref(entry_p))
            if res:
                raise IOError()
            if not entry_p:
                break
            name = entry_p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        if dir_p:
            closedir(dir_p)
        if entry_p:
            c_lib.free(entry_p)

if __name__ == '__main__':
    import sys
    path = sys.argv[1]
    max_per_dir = int(sys.argv[2])
    for idx, entry in enumerate(listdirx(path)):
        if idx >= max_per_dir:
            break
        print entry
like image 28
fabrizioM Avatar answered Oct 10 '22 20:10

fabrizioM