Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Error Python When Processing Files

I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. If the md5 sum does not yet exist, then add it to the list. If the sum does already exist, delete the current file.

import sys
import os
import hashlib

def checkFile(fileHashMap, file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

    if fileHash in fileHashMap:
        ### Duplicate file.
        fileHashMap[fileHash].append(file)
        return True
    else:
        fileHashMap[fileHash] = [file]
        return False


def main(argv):
    fileHashMap = {}
    fileCount = 0
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            fileCount += 1
            print("------------: " + str(fileCount))
            print(curDir + file)
            checkFile(fileHashMap, curDir + file)

if __name__ == "__main__":
    main(sys.argv)

The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. How can I calculate the md5 sums without running into this memory error?

Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. Here was the code I ran.

import sys
import os
import hashlib

def checkFile(file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

def main(argv):
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            print("------: " + str(curDir + file))
            checkFile(curDir + file)

if __name__ == "__main__":
    main(sys.argv)

and I still get the memory crash.

like image 662
JD951 Avatar asked Sep 07 '15 16:09

JD951


People also ask

How do I fix out of memory error in Python?

To fix this, all you have to do is install the 64-bit version of the Python programming language. A 64-bit computer system can access 2⁶⁴ different memory addresses or 18-Quintillion bytes of RAM. If you have a 64-bit computer system, you must use the 64-bit version of Python to play with its full potential.

How do pandas handle memory errors?

Strategy 1: Load less data (sub-sampling) One strategy for solving this kind of problem is to decrease the amount of data by either reducing the number of rows or columns in the dataset.


1 Answers

Your problem is in reading the entire files, they're too big and your system can't load it all in memory, so then it throws the error.

As you can see in the Official Python Documentation, the MemoryError is:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C’s malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

For your purpose, you can use hashlib.md5()

In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:

def md5(fname):
    hash = hashlib.md5()
    with open(fname) as f:
        for chunk in iter(lambda: f.read(4096), ""):
            hash.update(chunk)
    return hash.hexdigest()
like image 122
arodriguezdonaire Avatar answered Sep 28 '22 06:09

arodriguezdonaire