Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast folder size calculation in Python on Windows

I am looking for a fast way to calculate the size of a folder in Python on Windows. This is what I have so far:

def get_dir_size(path):
  total_size = 0
  if platform.system() == 'Windows':
    try:
      items = win32file.FindFilesW(path + '\\*')
    except Exception, err:
      return 0

    # Add the size or perform recursion on folders.
    for item in items:
      attr = item[0]
      name = item[-2]
      size = item[5]

      if (attr & win32con.FILE_ATTRIBUTE_DIRECTORY) and \
         not (attr & win32con.FILE_ATTRIBUTE_SYSTEM):  # skip system dirs
        if name not in DIR_EXCLUDES:
          total_size += get_dir_size("%s\\%s" % (path, name))

      total_size += size

  return total_size

This is not good enough when size of folder is over 100G. Any ideas how to improve it?

On a fast machine (2Ghz+ - 5G of RAM), it took 72 seconds to go over 422GB in 226,001 files and 12,043 folders. It takes 40 seconds using the explorer properties option.

I know I am being a bit greedy but I am hoping for a better solution.

Laurent Luce

like image 675
Laurent Luce Avatar asked Dec 31 '09 21:12

Laurent Luce


3 Answers

A quick profiling of your code suggests that over 90% of the time is consumed in the FindFilesW() call alone. This means any improvements by tweaking the Python code would be minor.

Tiny tweaks (if you were to stick with FindFilesW) could include ensuring DIR_EXCLUDES is a set instead of a list, avoiding the repeated lookups on other modules, and indexing into item[] lazily, as well as moving the sys.platform check outside. This incorporates these changes and others, but it won't give more than a 1-2% speedup.

DIR_EXCLUDES = set(['.', '..'])
MASK = win32con.FILE_ATTRIBUTE_DIRECTORY | win32con.FILE_ATTRIBUTE_SYSTEM
REQUIRED = win32con.FILE_ATTRIBUTE_DIRECTORY
FindFilesW = win32file.FindFilesW

def get_dir_size(path):
    total_size = 0
    try:
        items = FindFilesW(path + r'\*')
    except pywintypes.error, ex:
        return total_size

    for item in items:
        total_size += item[5]
        if (item[0] & MASK == REQUIRED):
            name = item[8]
            if name not in DIR_EXCLUDES:
                total_size += get_dir_size(path + '\\' + name)

    return total_size

The only significant speedup would come from using a different API, or a different technique. You mentioned in a comment doing this in the background, so you could structure it to do an incremental update using one of the packages for monitoring changes in folders. Possibly the FindFirstChangeNotification API or something like it. You could set up to monitor the entire tree, or depending on how that routine works (I haven't used it) you might be better off registering multiple requests on various subsets of the full tree, if that reduces the amount of searching you have to do (when notified) to figure out what actually changed and what size it is now.

Edit: I asked in a comment whether you were taking into account the heavy filesystem metadata caching that Windows XP and later do. I just checked performance of your code (and mine) against Windows itself, selecting all items in my C:\ folder and hitting Alt-Enter to bring up the properties window. After doing this once (using your code) and getting a 40s elapsed time, I now get 20s elapsed from both methods. In other words, your code is actually just as fast as Windows itself, at least on my machine.

like image 85
Peter Hansen Avatar answered Sep 22 '22 01:09

Peter Hansen


You don't need to use a recursive algorithm if you use os.walk. Please check this question.

You should time both approaches, but this is supposed to be much faster:

import os

def get_dir_size(root):
    size = 0
    for path, dirs, files in os.walk(root):
        for f in files:
            size +=  os.path.getsize( os.path.join( path, f ) )
    return size
like image 33
jbochi Avatar answered Sep 22 '22 01:09

jbochi


I don't have a Windows box to test on at the moment, but the documentation states that win32file.FindFilesIterator is "similar to win32file.FindFiles, but avoid the creation of the list for huge directories". Does that help?

like image 45
ephemient Avatar answered Sep 20 '22 01:09

ephemient