Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python os.walk memory issue

I programmed a scanner that looks for certain files on all hard drives of a system that gets scanned. Some of these systems are pretty old, running Windows 2000 with 256 or 512 MB of RAM but the file system structure is complex as some of them serve as file servers.

I use os.walk() in my script to parse all directories and files.

Unfortunately we noticed that the scanner consumes a lot of RAM after some time of scanning and we figured out that the os.walk function alone uses about 50 MB of RAM after 2h of walk over the file system. This RAM usage increases over the time. We had about 90 MB of RAM after 4 hours of scanning.

Is there a way to avoid this behaviour? We also tried "betterwalk.walk()" and "scandir.walk()". The result was the same. Do we have to write our own walk function that removes already scanned directory and file objects from memory so that the garbage collector can remove them from time to time?

resource usage over time - second row is memory

Thanks

like image 229
JohnGalt Avatar asked Jun 29 '14 07:06

JohnGalt


People also ask

Is scandir faster than listdir?

scandir() is a directory iteration function like os. listdir(), except that instead of returning a list of bare filenames, it yields DirEntry objects that include file type and stat information along with the name. Using scandir() increases the speed of os.

Does os walk use scandir?

walk() As part of this proposal, os. walk() will also be modified to use scandir() rather than listdir() and os. path.

How does Python os walk work?

os. walk returns a generator, that creates a tuple of values (current_path, directories in current_path, files in current_path). Every time the generator is called it will follow each directory recursively until no further sub-directories are available from the initial directory that walk was called upon.


2 Answers

have you tried the glob module?

import os, glob

def globit(srchDir):
    srchDir = os.path.join(srchDir, "*")
    for file in glob.glob(srchDir):
        print file
        globit(file)

if __name__ == '__main__':
    dir = r'C:\working'
    globit(dir)
like image 156
user3892766 Avatar answered Oct 23 '22 12:10

user3892766


If you are running in the os.walk loop, del() everything that you don't need anymore. And try running gc.collect() at the end of every iteration of os.walk.

like image 1
Roland Smith Avatar answered Oct 23 '22 14:10

Roland Smith