Memory Error Python When Processing Files

Tags:

I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. If the md5 sum does not yet exist, then add it to the list. If the sum does already exist, delete the current file.

import sys
import os
import hashlib

def checkFile(fileHashMap, file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

    if fileHash in fileHashMap:
        ### Duplicate file.
        fileHashMap[fileHash].append(file)
        return True
    else:
        fileHashMap[fileHash] = [file]
        return False


def main(argv):
    fileHashMap = {}
    fileCount = 0
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            fileCount += 1
            print("------------: " + str(fileCount))
            print(curDir + file)
            checkFile(fileHashMap, curDir + file)

if __name__ == "__main__":
    main(sys.argv)

The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. How can I calculate the md5 sums without running into this memory error?

Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. Here was the code I ran.

import sys
import os
import hashlib

def checkFile(file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

def main(argv):
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            print("------: " + str(curDir + file))
            checkFile(curDir + file)

if __name__ == "__main__":
    main(sys.argv)

and I still get the memory crash.

662

asked Sep 07 '15 16:09

JD951

1 Answers

Your problem is in reading the entire files, they're too big and your system can't load it all in memory, so then it throws the error.

As you can see in the Official Python Documentation, the MemoryError is:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C’s malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

For your purpose, you can use hashlib.md5()

In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:

def md5(fname):
    hash = hashlib.md5()
    with open(fname) as f:
        for chunk in iter(lambda: f.read(4096), ""):
            hash.update(chunk)
    return hash.hexdigest()

122

answered Sep 28 '22 06:09

arodriguezdonaire

Related questions
                            
                                Python 3 set default bytes encoding
                            
                                Prevent Celery Beat from running the same task
                            
                                Is python set s.difference_update(t) O(m X n)
                            
                                How do I vectorize this loop in numpy?
                            
                                Spark + Python - Java gateway process exited before sending the driver its port number?
                            
                                Pandas groupby+transform on 50 million rows is taking 3 hours
                            
                                Most recent max/min value
                            
                                How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?
                            
                                Python aiohttp request stopped but raised no exception
                            
                                How to mock a call to __next__
                            
                                Which SQL database for Tornado has an asynchronous driver?
                            
                                What is the default if I install virtualenv using pip and pip3 respectively?
                            
                                Is it safe to append to a list during iteration if I want to iterate over the added value?
                            
                                JQuery/CSS selectors in Python? [duplicate]
                            
                                Circular Hough Transform misses circles
                            
                                Is it possible to make a code coverage assertion in Python?
                            
                                autoupdate module in IPython / jupyter notebook
                            
                                How to load data from a .csv file without importing the .csv module/library
                            
                                How to pass JavaScript variable to function in Jinja tag
                            
                                Selenium: don't wait for async resources

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory Error Python When Processing Files

Tags:

python

python-2.7

JD951

People also ask

1 Answers

arodriguezdonaire

Recent Activity

Donate For Us