Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repair git broken link from commit to commit

I have a situation where a git fsck call returns several broken links. This is because, for this repository, a rm command was run and several write protected files were deleted (mistakes were made). There is also no recent backup of this repository (again, mistakes were made). Because Git was being used the repository was not a total loss, but some of the history has been scrambled. This escaped notice until recently when a re-sync to source was to be done and, because of the broken history, this failed.

I would like to repair this history (if possible) so that it can be merged with the upstream source. I recognize that I will not be able to get the full history back because some files are just gone, but I would like to keep as much of it as I can with things working correctly.

I've reviewed Linus' email, "How to recover a corrupted blob object," (MIT hosted copy) and have also looked at:

How to recover Git objects damaged by hard disk failure?

Repair corrupted Git repository

along with many others but I'm not seeing much advise for the broken link from commit to commit errors. Note, I did make a copy of this repository so I'm not wiping anything.

The results of git fsck are

    $ git fsck
    broken link from commit <SHA1>
                  to commit <SHA2>
    broken link from   tree <SHA3>
                  to   blob <SHA4>
    ...
    dangling blob <SHA5>
    missing commit <SHA2>
    missing blob <SHA4>
    ...

When I go through the git history via git log eventually I get the error

error: Could not read <SHA2>
fatal: Failed to traverse parents of commit <SHA1>

which is near(ish) to where the last backup exists but not quite there so I have no overlapping coverage. I wanted to try traversing the history in reverse, thinking I could move through my log from the oldest commit to the newest but

$ git log --reverse
error: Could not read <SHA2>
fatal: Failed to traverse parents of commit <SHA1>

so I can't try and bound the commit on both sides (unless someone knows how to do that). I tried using git repair which seemed to be able to get past some of the issues but not all of them. It also seems to be corrupting things since now with git log

$ git log
...
error: Could not read <SHA6>
fatal: Failed to traverse parents of commit <SHA7>

which occurs much sooner in the history than the issue. Interestingly, this commit DOES exist in my original un-repaired repository. Copying over the sha file gets me past the failure only for another one to crop up which also exists.

It suggested I run git repair --force but that ended up re-initializing the repository entirely which is not really what I wanted either.

What can I do to restore this repository to working order?

like image 599
NateM Avatar asked Dec 06 '20 00:12

NateM


2 Answers

@LeGEC provided the final pieces for me to get this together but I think it's worthwhile to present the full approach I used. Note: I expect that a lot of the things I was able to do are specific to my case BUT there are some things that can be generalized.

When looking at the results of git fsck I found that there were several dangling commits. When I checked out those hashes, I found segments of good commits. So a repository which had an original structure of

(a)->(b)->(c)->(d)->(e)->(f)->(g)->(h)->(i)->(j)

after the, lets call it, "ill advised," rm command might be left in a state like

(b)->(c) (e)->(f) (h)->(i)->(j)

As stated in the question, the backup was very old and had the form

(a)->(b)

but that's it. What one can do is to use git replace to try and solve this problem. BE WARNED git replace seems to be an excellent tool to truly destroy your repository. I did this on a copy of my original repository and I am VERY glad it wasn't the real deal!

We will build our new repository on a new (good) foundation. We first initialize a fresh repository from the backup we do have.

$ mkdir my/new/fixed/repository
$ cd my/new/fixed/repository
$ git init

Now, from our backup (which doesn't cover the full space of the corrupted repository) we will unpack the existing structure such as it is.

$ git remote add origin /path/to/backup/repository
$ get remote fetch
$ get checkout --track my-broken-branch # This may not be necessary

To avoid messing anything up with our corrupted repository, we make a copy

$ cd /path/to/repository/root
$ mkdir repository-copy
$ cp -R /path/to/broken/repository /path/to/repository-copy
$ cd /path/to/repository-copy

First things first, lets try to use our previous repository to fix what we can:

git remote add backup /path/to/backup/repository
git unpack-objects < /path/to/backup/repository/.git/objects/pack/pack-*.pack

Okay, lets see what the damage is:

$ git fsck
broken link from  commit <SHA1>
              to  commit <SHA2>
broken link from    tree <SHA3>
              to    blob <SHA4>
...
dangling commit <SHA5>
...
missing commit <SHA2>
...
missing blob <SHA4>
...
dangling commit <SHA6>
...

Of interest are the dangling commits because those are likely to be the little sub-branches that we want to try and stitch back together. Note, these commits are NOT always in chronological order. For me the order happened to be (from oldest to newest) <SHA5>-<SHA6> but you will likely have your own knot to untangle. You can check the commit date/time by running

$ git show -s <SHAX>

One thing to note at this point is this, if you are in the broken repository copy, and then run the command git log you will be able to traverse the repository until you run into at which time you will get the error:

error: Could not read <SHA2>
fatal: Failed to traverse parents of commit <SHA1>

So we need to replace the parent of with a commit that is actually good. The pattern for this is called a graft but doing a pure graft is no-longer considered best practice (How do git grafts and replace differ? (Are grafts now deprecated?)) because of the new(er) best practice git replace.

So I now make the parent of

$ git replace --graft <SHA1> <SHA6>
$ git fsck
broken link from  commit <SHA1>
              to  commit <SHA2>
broken link from    tree <SHA3>
              to    blob <SHA4>
...
broken link from  commit <SHA7>
              to  commit <SHA8>

So a new broken commit has appeared. If I investigate that commit using git log I find that the previous commit ended prior to the remaining dangling commit's commit time. So I'm going to graft those two together. Note, this may not be a safe thing to do if you have lots of people working on this repository but, in this case, I believe it to be okay.

$ git replace --graft <SHA7> <SHA5>
$ git fsck
broken link from  commit <SHA1>
              to  commit <SHA2>
broken link from    tree <SHA3>
              to    blob <SHA4>
...
broken link from  commit <SHA7>
              to  commit <SHA8>

No new dangling commits and, in my case, was able to connect to my backup repository. In other cases I imagine this will not always be true. If so, you can eventually get to the point where you could graft the head of the remote repository as the remaining bad commit link.

Now we must deal with the missing blobs. You can try and repair them following Linus' method or, if you are willing to accept the missing history, you can use git replace again to excise them from the history. The general approach is

$ git ls-tree <SHA3>
...
100644 blob <SHA4>  my-magic-file
...
$ git log --raw --all --full-history -- subdirectory/my-magic-file | grep -B 20 -A 20 "<SHA4>" # May just need to use first few values from SHA4
# commit information after missing blob
# commit information for missing blob
# commit information before missing blob
$ git replace --graft <commit-after-missing-blob> <commit-before-missing-blob>

Repeat this until git rev-list --objects my/branch runs to completion.

Now, you need to remove the extraneous commits. Fortunately, a new tool has been developed to do just this: git-filter-repo. This tool will commit our grafts and refactor the history.

$ git filter-repo --force
$ git fsck
Checking object directories: 100%...
Checking objects: 100%...

Now lets see if we can successfully fetch our repository from our broken branch.

$ cd /path/to/my/new/fixed/repository
$ git fetch broken my/branch
...
From /path/to/my/broken/repository
 * branch            my/branch        -> FETCH_HEAD
 * [new branch]      my/branch        -> broken/my/branch

And, because we have a common history with the remote we can now merge with our previously broken branches

$ git merge broken/my/branch

And the history is once again clean.

like image 177
NateM Avatar answered Sep 16 '22 21:09

NateM


(from your comments : I'll assume you managed to build a branch with a history of commits, which you deem satisfying)

You can create a fresh clone next to your broken one, and iteratively pull what you can from broken on top of fresh, to both check that you are pulling in valid objects, and to work on a valid repo.

Start with a fresh clone :

# next to your broken 'myproject' directory :
git clone <url> fresh
cd fresh
git remote add broken ../myproject

See if you can fecth the branch you created in your original myproject directory :

# from fresh :
git fetch broken my/branch

If this action works, this means you pulled in only valid commits, pointing at valid trees and valid blobs, and you are in a stable state.

If on the other hand this action doesn't work : you will need to find out what commits have a valid content.

For trees : run git ls-tree -r <commit> on all commits from "the last one that's in the remote" to the head of your branch. If a tree is invalid, git ls-tree -r will mention an error.

For blobs : run git cat-file -p on all blobs mentioned by the git ls-tree -r commands above. Again : you will have an error if a blob is missing.

like image 37
LeGEC Avatar answered Sep 18 '22 21:09

LeGEC