We have a big git repository, which I want to push to a self-hosted gitlab instance.
The problem is that the gitlab remote does not let me push my repo:
git push --mirror https://mygitlab/xy/myrepo.git
This will give me this error:
Enumerating objects: 1383567, done.
Counting objects: 100% (1383567/1383567), done.
Delta compression using up to 8 threads
Compressing objects: 100% (207614/207614), done.
remote: error: object c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867:
duplicateEntries: contains duplicate file entries
remote: fatal: fsck error in packed object
So I did a git fsck:
error in tree c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867: duplicateEntries: contains duplicate file entries
error in tree 0d7286cedf43c65e1ce9f69b74baaf0ca2b73e2b: duplicateEntries: contains duplicate file entries
error in tree 7f14e6474400417d11dfd5eba89b8370c67aad3a: duplicateEntries: contains duplicate file entries
Next thing I did was to check git ls-tree c05ac7f76dcd3e8fb3b7faf7aab9b7a855647867
:
100644 blob c233c88b192acfc20548d9d9f0c81c48c6a05a66 fileA.cs
100644 blob 5d6096cb75d27780cdf6da8a3b4d357515f004e0 fileB.cs
100644 blob 5d6096cb75d27780cdf6da8a3b4d357515f004e0 fileB.cs
100644 blob d2a4248bcda39c0dc3827b495f7751b7cc06c816 fileC.xaml
Notice that fileB.cs
is displayed twice, with the same hash. I assume that this is the problem, because why would the file be two times in the same tree with the same file name and blob hash?
Now I googled the problem but could not find a way how to fix this. One seemingly good resource I found was this: Tree contains duplicate file entries
However, it basically comes down to using git replace which does not really fix the problem, so git fsck will still print the error and prevent me from pushing to the remote.
Then there is this one which seems to remove the file entirely (but I still need the file, but only once, not twice in the tree): https://stackoverflow.com/a/44672692/826244
Is there any other way to fix this? I mean it really should be possible to fix so that git fsck does not throw any errors, right? I am aware that I will need to rewrite the entire history after the corrupted commits. I could not even find a way to get the commit that points to the specific trees, otherwise I might be able to use rebase and patching the corrupted commit or something. Any help would be greatly appreciated!
UPDATE: Pretty sure I know what to do, but not yet how to do it:
git mktree
<- donegit filter-branch -- --all
<- Should persist the replacements of the commitsSadly I cannot just use git replace --edit
on the bad tree and then run git filter-branch -- --all
because filter-branch
seems to only work on commits, but ignores tree-replaces...
The final solution was to write a tool that tackles this problem.
First step was to git unpack-objects all packfiles.
Then I had to identify the commits that pointed to the tree entries with duplicates by reading all refs and then walking back in history checking all the trees.
After I had the tools for that it was not so hard to now rewrite the trees of those commits and then rewriting all commits after that. After that I had to update the changed refs. This is the moment where I thoroughly tested the result as nothing was lost yet.
Finally a git reflog expire --expire=now --all && git gc --prune=now --aggressive
rewrote the pack and removed all loose objects that are not accessible anymore.
When I have the time I will upload the source code to github, as it performs really well and could be a template to similar problems. It ran only a few minutes on a 3.7GB repository (about 20GB unpacked). By now I also implemented reading from the packfiles, so no need to unpack anything anymore (which takes a lot of time and space).
Update: I worked a little more on the source and it now performs really well, even better than bfg for deleting a single file (no option switches yet). The source code is available here: https://github.com/TimHeinrich/GitRewrite Be aware, this was only tested against a single repository and only under windows on a core i7. It is highly unlikely that it will work on linux or with any other processor architecture
You can try running git fast-export
to export your repository into a data file, and then run git fast-import
to re-import the data file into a new repository. Git will remove any duplicate entries during the fast-import process, which will solve your problem.
Be aware that you may have to make a decision about how to handle signed tags and such when you export by passing appropriate arguments to git fast-export
; since you're rewriting history, you probably want to pass --signed-tags=strip
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With