I'm trying to calculate the SHA-1 value of a file. I've fabricated this script: <pre class="prettyprint"><code>def hashfile(filepath): sha1 = hashlib.sha1() f = open(filepath, 'rb') try: sha1.update(f.read()) finally: f.close() return sha1.hexdigest() </code></pre> For a specific file I get this hash value: <code>8c3e109ff260f7b11087974ef7bcdbdc69a0a3b9</code> But when i calculate the value with git hash_object, then I get this value: <code>d339346ca154f6ed9e92205c3c5c38112e761eb7</code> How come they differ? Am I doing something wrong, or can I just ignore the difference?

git calculates hashes like this: <pre class="prettyprint"><code>sha1("blob " + filesize + "\0" + data) </code></pre> Reference

For reference, here's a more concise version: <pre class="prettyprint"><code>def sha1OfFile(filepath): import hashlib with open(filepath, 'rb') as f: return hashlib.sha1(f.read()).hexdigest() </code></pre> On second thought: although I've never seen it, I think there's potential for <code>f.read()</code> to return less than the full file, or for a many-gigabyte file, for f.read() to run out of memory. For everyone's edification, let's consider how to fix that: A first fix to that is: <pre class="prettyprint"><code>def sha1OfFile(filepath): import hashlib sha = hashlib.sha1() with open(filepath, 'rb') as f: for line in f: sha.update(line) return sha.hexdigest() </code></pre> However, there's no guarantee that <code>'\n'</code> appears in the file at all, so the fact that the <code>for</code> loop will give us blocks of the file that end in <code>'\n'</code> could give us the same problem we had originally. Sadly, I don't see any similarly Pythonic way to iterate over blocks of the file as large as possible, which, I think, means we are stuck with a <code>while True: ... break</code> loop and with a magic number for the block size: <pre class="prettyprint"><code>def sha1OfFile(filepath): import hashlib sha = hashlib.sha1() with open(filepath, 'rb') as f: while True: block = f.read(2**10) # Magic number: one-megabyte blocks. if not block: break sha.update(block) return sha.hexdigest() </code></pre> Of course, who's to say we can store one-megabyte strings. We probably can, but what if we are on a tiny embedded computer? I wish I could think of a cleaner way that is guaranteed to not run out of memory on enormous files and that doesn't have magic numbers and that performs as well as the original simple Pythonic solution.

Why is the Python calculated "hashlib.sha1" different from "git hash-object" for a file?

Tags:

git

python

hash

I'm trying to calculate the SHA-1 value of a file.

I've fabricated this script:

def hashfile(filepath):     sha1 = hashlib.sha1()     f = open(filepath, 'rb')     try:         sha1.update(f.read())     finally:         f.close()     return sha1.hexdigest()

For a specific file I get this hash value:
8c3e109ff260f7b11087974ef7bcdbdc69a0a3b9
But when i calculate the value with git hash_object, then I get this value: d339346ca154f6ed9e92205c3c5c38112e761eb7

How come they differ? Am I doing something wrong, or can I just ignore the difference?

989

asked Dec 08 '09 21:12

Ikke

2 Answers

git calculates hashes like this:

sha1("blob " + filesize + "\0" + data)

Reference

answered Oct 09 '22 06:10

Brian R. Bondy

For reference, here's a more concise version:

def sha1OfFile(filepath):     import hashlib     with open(filepath, 'rb') as f:         return hashlib.sha1(f.read()).hexdigest()

On second thought: although I've never seen it, I think there's potential for f.read() to return less than the full file, or for a many-gigabyte file, for f.read() to run out of memory. For everyone's edification, let's consider how to fix that: A first fix to that is:

def sha1OfFile(filepath):     import hashlib     sha = hashlib.sha1()     with open(filepath, 'rb') as f:         for line in f:             sha.update(line)         return sha.hexdigest()

However, there's no guarantee that '\n' appears in the file at all, so the fact that the for loop will give us blocks of the file that end in '\n' could give us the same problem we had originally. Sadly, I don't see any similarly Pythonic way to iterate over blocks of the file as large as possible, which, I think, means we are stuck with a while True: ... break loop and with a magic number for the block size:

def sha1OfFile(filepath):     import hashlib     sha = hashlib.sha1()     with open(filepath, 'rb') as f:         while True:             block = f.read(2**10) # Magic number: one-megabyte blocks.             if not block: break             sha.update(block)         return sha.hexdigest()

Of course, who's to say we can store one-megabyte strings. We probably can, but what if we are on a tiny embedded computer?

I wish I could think of a cleaner way that is guaranteed to not run out of memory on enormous files and that doesn't have magic numbers and that performs as well as the original simple Pythonic solution.

answered Oct 09 '22 05:10

Ben

Related questions
                            
                                Django: Check if settings variable is set
                            
                                Why does my python not add current working directory to the path?
                            
                                Sound generation / synthesis with python?
                            
                                How to get the index with the key in a dictionary?
                            
                                Skip multiple iterations in loop
                            
                                Python method for reading keypress?
                            
                                python regular expression "\1"
                            
                                How to limit the size of a comprehension?
                            
                                Styling part of label in legend in matplotlib
                            
                                Flask-Login check if user is authenticated without decorator
                            
                                python dict to numpy structured array
                            
                                Pandas expand rows from list data available in column
                            
                                Using boto to invoke lambda functions how do I do so asynchronously?
                            
                                Difference between Module and Class in Python
                            
                                operator overloading in python [duplicate]
                            
                                How can I check if a string has a numeric value in it in Python? [duplicate]
                            
                                sqlalchemy.exc.ArgumentError: Can't load plugin: sqlalchemy.dialects:driver
                            
                                imshow() function not working
                            
                                Python script gives `: No such file or directory`
                            
                                How to list all installed Jupyter kernels?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With