Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the Python calculated "hashlib.sha1" different from "git hash-object" for a file?

Tags:

git

python

hash

I'm trying to calculate the SHA-1 value of a file.

I've fabricated this script:

def hashfile(filepath):     sha1 = hashlib.sha1()     f = open(filepath, 'rb')     try:         sha1.update(f.read())     finally:         f.close()     return sha1.hexdigest() 

For a specific file I get this hash value:
8c3e109ff260f7b11087974ef7bcdbdc69a0a3b9
But when i calculate the value with git hash_object, then I get this value: d339346ca154f6ed9e92205c3c5c38112e761eb7

How come they differ? Am I doing something wrong, or can I just ignore the difference?

like image 989
Ikke Avatar asked Dec 08 '09 21:12

Ikke


People also ask

How does Git calculate hash?

Git uses hashes in two important ways. When you commit a file into your repository, Git calculates and remembers the hash of the contents of the file. When you later retrieve the file, Git can verify that the hash of the data being retrieved exactly matches the hash that was computed when it was stored.

What does git hash object do?

In its simplest form, git hash-object would take the content you handed to it and merely return the unique key that would be used to store it in your Git database. The -w option then tells the command to not simply return the key, but to write that object to the database.

What is git SHA-1?

GIT strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits. It is essentially possible to create two GIT repositories with the same head commit hash and different contents, say a benign source code and a backdoored one.


2 Answers

git calculates hashes like this:

sha1("blob " + filesize + "\0" + data) 

Reference

like image 88
Brian R. Bondy Avatar answered Oct 09 '22 06:10

Brian R. Bondy


For reference, here's a more concise version:

def sha1OfFile(filepath):     import hashlib     with open(filepath, 'rb') as f:         return hashlib.sha1(f.read()).hexdigest() 

On second thought: although I've never seen it, I think there's potential for f.read() to return less than the full file, or for a many-gigabyte file, for f.read() to run out of memory. For everyone's edification, let's consider how to fix that: A first fix to that is:

def sha1OfFile(filepath):     import hashlib     sha = hashlib.sha1()     with open(filepath, 'rb') as f:         for line in f:             sha.update(line)         return sha.hexdigest() 

However, there's no guarantee that '\n' appears in the file at all, so the fact that the for loop will give us blocks of the file that end in '\n' could give us the same problem we had originally. Sadly, I don't see any similarly Pythonic way to iterate over blocks of the file as large as possible, which, I think, means we are stuck with a while True: ... break loop and with a magic number for the block size:

def sha1OfFile(filepath):     import hashlib     sha = hashlib.sha1()     with open(filepath, 'rb') as f:         while True:             block = f.read(2**10) # Magic number: one-megabyte blocks.             if not block: break             sha.update(block)         return sha.hexdigest() 

Of course, who's to say we can store one-megabyte strings. We probably can, but what if we are on a tiny embedded computer?

I wish I could think of a cleaner way that is guaranteed to not run out of memory on enormous files and that doesn't have magic numbers and that performs as well as the original simple Pythonic solution.

like image 25
Ben Avatar answered Oct 09 '22 05:10

Ben