Just for fun, I am trying to put around 85GB of mostly-around-6MB binary files into git. Git chugs along for a while but invariably fails about halfway through with the message "fatal: confused by unstable object source data" followed by a SHA1. Do you know why? Is there any way to fix it?
Either
Short version: Git’s developers did not intend for it to be used on volatile files.
Due to the layout* that Git uses for “loose objects” and the limited filesystem semantics that it assumes**, Git must know the first byte (two hex characters) of the object name (SHA-1) of a new object before it can start storing that object.
*
The objects/[0-9a-f][0-9a-f]/
directories. See gitrepository-layout
.
**
Specifically, it needs to be able to do “atomic” file renames. Certain filesystems (usually network filesystems; specifically AFS, I believe) only guarantee rename atomicity when the source and the destination of a rename are inside the same directory.
Currently, Git does two SHA-1 passes over each new file. The first pass is used to check whether whether Git already knows about the contents of the file (whether its SHA-1 object name already exists in the object store). If the object already exists, the second pass is not made.
For new contents (object was not already in the object store), the file is read a second time while compressing and computing the SHA-1 of the data being compressed. The compressed data is written to a temporary file that is only renamed to its final loose object name if the initial SHA-1 (“already stored?” check) matches the later SHA-1 (hash of the data that was compressed and written). If these SHA-1 hashes do not match, then Git shows the error message you are seeing and aborts. This error checking was added in 748af44c63 which was first released in Git 1.7.0.2.
There is another possibility, even if remote. That would be a really big file (e.g 3 or more gb) putting it simply, git is unable to handle it. We found that error trying to create a repository in a structure with huge files
From the source, the blob's sha1 is computed twice:
both called from write_sha1_file (there's also a path from force_object_loose, but it is used for repacks).
The first hash is used to check if the object is already known (though git tries its best to get the filesystem's reassurance that files are unmodified, a touch
or such would make it lose track); the second is the hash of the data that is actually fed to zlib for compression, then written.
The second hash might be a bit more expensive to compute due to zlib, which may explain why two hashes are computed (though that seems to be historical accident, and I'm guessing the performance cost when adding a new object has more impact than the cpu win when detecting spurious changes). Someone could add a fallback so that the write_changed_sha1 existence checking logic is redone with the new sha1, so that those unstable files can also be added. That would be useful for backups, when a few of the files being added are open.
Two theories:
Something is writing to these files while you are trying to put them into git.
You have some sort of disk/memory failure causing data corruption.
Although other responses have provided a very good explanation on why the error occurs here is a possible solution to the problem:
Track down the problematic file, adding -v to your git add command will give you some clue about the problematic file:
git add -Av
The problem might just be that the file is too large (a zipped source, some sql data file): add it to .gitignore
In fact a good practice is to regularly configure your .gitignore file to avoid compiled and compressed files like in: https://gist.github.com/octocat/9257657
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With