Recently a team of researchers generated two files with the same SHA-1 hash (https://shattered.it/). Since Git uses this hash for its internal storage, how far does this kind of attack influence Git?

Edit, late December 2017: Git version 2.16 is gradually acquiring internal interfaces to allow for different hashes. There is a long way to go yet. <hr> The short (but unsatisfying) answer is that the example files are not a problem for Git—but two other (carefully calculated) files could be. I downloaded both of these files, <code>shattered-1.pdf</code> and <code>shattered-2.pdf</code>, and put them into a new empty repository: <pre class="prettyprint"><code>macbook$ shasum shattered-* 38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-1.pdf 38762cf7f55934b34d179ae6a4c80cadccbb7f0a shattered-2.pdf macbook$ cmp shattered-* shattered-1.pdf shattered-2.pdf differ: char 193, line 8 macbook$ git init Initialized empty Git repository in .../tmp/.git/ macbook$ git add shattered-1.pdf macbook$ git add shattered-2.pdf macbook$ git status On branch master Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: shattered-1.pdf new file: shattered-2.pdf </code></pre> Even though the two files have the same SHA-1 checksum (and display mostly the same, although one has a red background and the other has a blue background), they get different Git hashes: <pre class="prettyprint"><code>macbook$ git ls-files --stage 100644 ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0 0 shattered-1.pdf 100644 b621eeccd5c7edac9b7dcba35a8d5afd075e24f2 0 shattered-2.pdf </code></pre> Those are the two SHA-1 checksums for the files as stored in Git: one is <code>ba9aa...</code> and the other is <code>b621e...</code>. Neither is <code>38762c...</code>. But—why? The answer is that Git stores files, not as themselves, but rather as the string literal <code>blob</code>, a blank, the size of the file decimalized, and an ASCII NUL byte, and then the file data. Both files are exactly the same size: <pre class="prettyprint"><code>macbook$ ls -l shattered-?.pdf ... 422435 Feb 24 00:55 shattered-1.pdf ... 422435 Feb 24 00:55 shattered-2.pdf </code></pre> so both are prefixed with the literal text <code>blob 422435\0</code> (where <code>\0</code> represents a single byte, a la C or Python octal escapes in strings). Perhaps surprisingly—or not, if you know anything of how SHA-1 is calculated—adding the same prefix to two different files that nonetheless produced the same checksum before, causes them to now produce different checksums. The reason this should become unsurprising is that if the final checksum result were not exquisitely sensitive to the position, as well as the value, of each input bit, it would be easy to produce collisions on demand by taking a known input file and merely re-arranging some of its bits. These two input files produce the same sum despite having a different byte at <code>char 193, line 8</code>, but this result was achieved, according to the researchers, by trying over 9 quintillion (short scale) inputs. To get that result, they put in carefully chosen blocks of raw data, at a position they controlled, that would affect the sums, until they found pairs of inputs that resulted in a collision. By adding the <code>blob</code> header, Git moved the position, destroying the 110-GPU-years of computation in a single more or less accidental burp. Now, knowing that Git will do this, they could repeat their 110-GPU-years of computation with inputs that begin with <code>blob 422435\0</code> (provided their sacrificial blocks don't get pushed around too much; and the actual number of GPU-years of computation needed would probably vary, as the process is a bit stochastic). They would then come up with two different files that could have the <code>blob</code> header stripped off. These two files would now have different SHA-1 checksums from each other, but when <code>git add</code>-ed, both would produce the same SHA-1 checksum. In that particular case, the first file added would "win" the slot. (Let's assume it's named <code>shattered-3.pdf</code>.) A good-enough Git—I'm not at all sure that the current Git is this good; see Ruben's experiment-based answer to How would Git handle a SHA-1 collision on a blob?—would notice that <code>git add shattered-4.pdf</code>, attempting to add the second file, collided with the first-but-different <code>shattered-3.pdf</code> and would warn you and fail the <code>git add</code> step. In any case you would be unable to add both of these files to a single repository. But first, someone has to spend a lot more time and money to compute the new hash collision.

How does the newly found SHA-1 collision affect Git?

1 Answers

Edit, late December 2017: Git version 2.16 is gradually acquiring internal interfaces to allow for different hashes. There is a long way to go yet.

The short (but unsatisfying) answer is that the example files are not a problem for Git—but two other (carefully calculated) files could be.

I downloaded both of these files, shattered-1.pdf and shattered-2.pdf, and put them into a new empty repository:

macbook$ shasum shattered-* 38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-1.pdf 38762cf7f55934b34d179ae6a4c80cadccbb7f0a  shattered-2.pdf macbook$ cmp shattered-* shattered-1.pdf shattered-2.pdf differ: char 193, line 8 macbook$ git init Initialized empty Git repository in .../tmp/.git/ macbook$ git add shattered-1.pdf  macbook$ git add shattered-2.pdf  macbook$ git status On branch master  Initial commit  Changes to be committed:   (use "git rm --cached <file>..." to unstage)      new file:   shattered-1.pdf     new file:   shattered-2.pdf

Even though the two files have the same SHA-1 checksum (and display mostly the same, although one has a red background and the other has a blue background), they get different Git hashes:

macbook$ git ls-files --stage 100644 ba9aaa145ccd24ef760cf31c74d8f7ca1a2e47b0 0   shattered-1.pdf 100644 b621eeccd5c7edac9b7dcba35a8d5afd075e24f2 0   shattered-2.pdf

Those are the two SHA-1 checksums for the files as stored in Git: one is ba9aa... and the other is b621e.... Neither is 38762c.... But—why?

The answer is that Git stores files, not as themselves, but rather as the string literal blob, a blank, the size of the file decimalized, and an ASCII NUL byte, and then the file data. Both files are exactly the same size:

macbook$ ls -l shattered-?.pdf ...  422435 Feb 24 00:55 shattered-1.pdf ...  422435 Feb 24 00:55 shattered-2.pdf

so both are prefixed with the literal text blob 422435\0 (where \0 represents a single byte, a la C or Python octal escapes in strings).

Perhaps surprisingly—or not, if you know anything of how SHA-1 is calculated—adding the same prefix to two different files that nonetheless produced the same checksum before, causes them to now produce different checksums.

The reason this should become unsurprising is that if the final checksum result were not exquisitely sensitive to the position, as well as the value, of each input bit, it would be easy to produce collisions on demand by taking a known input file and merely re-arranging some of its bits. These two input files produce the same sum despite having a different byte at char 193, line 8, but this result was achieved, according to the researchers, by trying over 9 quintillion (short scale) inputs. To get that result, they put in carefully chosen blocks of raw data, at a position they controlled, that would affect the sums, until they found pairs of inputs that resulted in a collision.

By adding the blob header, Git moved the position, destroying the 110-GPU-years of computation in a single more or less accidental burp.

Now, knowing that Git will do this, they could repeat their 110-GPU-years of computation with inputs that begin with blob 422435\0 (provided their sacrificial blocks don't get pushed around too much; and the actual number of GPU-years of computation needed would probably vary, as the process is a bit stochastic). They would then come up with two different files that could have the blob header stripped off. These two files would now have different SHA-1 checksums from each other, but when git add-ed, both would produce the same SHA-1 checksum.

In that particular case, the first file added would "win" the slot. (Let's assume it's named shattered-3.pdf.) A good-enough Git—I'm not at all sure that the current Git is this good; see Ruben's experiment-based answer to How would Git handle a SHA-1 collision on a blob?—would notice that git add shattered-4.pdf, attempting to add the second file, collided with the first-but-different shattered-3.pdf and would warn you and fail the git add step. In any case you would be unable to add both of these files to a single repository.

But first, someone has to spend a lot more time and money to compute the new hash collision.

154

answered Oct 07 '22 22:10

torek

Related questions
                            
                                How to have git-svn take care of empty directories gracefully?
                            
                                Git search all diffs
                            
                                Remove empty commits in git
                            
                                Using custom diff tool with `git show`
                            
                                How to list versioned files in git?
                            
                                Git interactive rebase without opening the editor
                            
                                Can EGit (Eclipse git plugin) use an SSH key instead of a username and password?
                            
                                How to install/setup TortoiseGit to work with GitHub
                            
                                How to use git log --graph --oneline --all only on my local branches?
                            
                                How to completely clear git repository, without deleting it
                            
                                Git: Manage each version of my app?
                            
                                git show old version of file in editor
                            
                                Git lists same file modified and not staged for commit?
                            
                                How can you unstash changes using EGit?
                            
                                How to get an Ansible check to run only once in a playbook?
                            
                                Notorious Git Error: remote rejected (failed to lock)
                            
                                .gitignore is ignoring other directories with the same name
                            
                                Git: index file open failed: Permission denied on "git status". Hosted on Bitbucket
                            
                                What is a "branch tip" in Git?
                            
                                How to Prevent Garbage Collection in GIT?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does the newly found SHA-1 collision affect Git?

Tags:

git

sha1

Rudi

People also ask

1 Answers

torek

Recent Activity

Donate For Us