Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git filter-branch led to a disconnected history: how to get rid of the old commits?

The scenario is the following:

I have a big CVS repository that I want to convert to 14 distinct git repositories. The cvs2git part of the process is fine and leads to a big repository repo.git.

For each of the 14 git repo, I clone the main repo and I run the following command :

git filter-branch -d /tmp/rep --tag-name-filter cat --prune-empty --subdirectory-filter "sub/directory" -- --all

However, prior to this command, I have to perform another git filter-branch command for some git repositories because I have to rewrite the commits to move a file from a directory to another. The --tree-filter is the option I use. Here is a example of the command line executed:

script_tree_filter="if test -f rep/to/my/file && test -d another/rep ; then echo Moving my file ; mv rep/to/my/file another/rep; fi"
git filter-branch -d /tmp/rep --tag-name-filter cat --prune-empty --tree-filter '$script_tree_filter' -- --all

At the end of the process (14500 commits: it takes about 1 hour !) I clean the refs and use git gc:

git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
git reflog expire --expire=now --all
git gc --prune=now

At the end I obtain a repository with a size of 1.2Go (which is still obviously too big), and by looking at the commits, I can see that a lot of old ones are still present. They concern file and directories which should not be here anymore after the --subdirectory-filter command.

In the history of the commits, there is a discontinuity between the unwanted commits and the good ones as seen in gitk --all:

discontinuity seen in gitk

I am pretty certain that those commits are still present because of the tags on some on them. If this is the case, is it possible to remove those tags without removing the one on the good commits ?

If the tags are not the reason, any idea ?

For more information, the content of the refs directory (in the git repository obtained by subdirectory-filter) is empty:

$ ls -R refs/
refs/:
heads  original  tags

refs/heads:

refs/original:
refs

refs/original/refs:
heads  tags

refs/original/refs/heads:

refs/original/refs/tags:

refs/tags:

I've found that the branches and tags are listed in the file packed-refs in the git repository:

d0c675d8f198ce08bb68f368b6ca83b5fea70a2b refs/tags/v03-rev-04
95c3f91a4e92e9bd11573ff4bb8ed4b61448d8f7 refs/tags/v03-rev-05

There are 817 tags and 219 branches listed in the file.

like image 558
Frodon Avatar asked Jul 26 '13 16:07

Frodon


2 Answers

I managed to solve my problem by changing the way I used cvs2git: instead of converting the whole CVS base and then use the subdirectory-filter command, I converted each of the submodules I wanted. In my case, this led to launch 18 different cvs2git commands:

Before

cvs2git --blobfile=blob --dump=dump /path/to/cvs/base
# Module 1
git filter-branch --tag-name-filter cat --prune-empty --subdirectory-filter "path/to/module1" -- --all
# Module 2
git filter-branch --tag-name-filter cat --prune-empty --subdirectory-filter "path/to/module2" -- --all

Now

# Module 1
cvs2git --blobfile=blob_module1 --dump=dump_module1 /path/to/cvs/base/path/to/module1
# Module 2
cvs2git --blobfile=blob_module2 --dump=dump_module2 /path/to/cvs/base/path/to/module2

Each repository has now a perfect history.

Why the previous method didn't work ? My guess is that cvs2git was confused with all the submodules (some of them had their directory name changed during their history).

@Michael @CharlesB Thank you for taking your time to answer and help me.

like image 167
Frodon Avatar answered Oct 10 '22 02:10

Frodon


I bet you are getting hit with this:

  • Differences between CVS and git branch/tag models: CVS allows a branch or tag to be created from arbitrary combinations of source revisions from multiple source branches. It even allows file revisions that were never contemporaneous to be added to a single branch/tag. Git, on the other hand, only allows the full source tree, as it existed at some instant in the history, to be branched or tagged as a unit. Moreover, the ancestry of a git revision makes implications about the contents of that revision. This difference means that it is fundamentally impossible to represent an arbitrary CVS history in a git repository 100% faithfully. cvs2git uses the following workarounds:

    • cvs2git tries to create a branch from a single source, but if it can't figure out how to, it creates the branch using a "merge" from multiple source branches. In pathological situations, the number of merge sources for a branch can be arbitrarily large. The resulting history implies that whenever any file was added to a branch, the entire source branch was merged into the destination branch, which is clearly incorrect. (The alternative, to omit the merge, would discard the information that some content was moved from one branch to the other.)

    • If cvs2git cannot determine that a CVS tag can be created from a single revision, then it creates a tag fixup branch named TAG.FIXUP, then tags this branch. (This is a necessary workaround for the fact that git only allows existing revisions to be tagged.) The TAG.FIXUP branch is created as a merge between all of the branches that contain file revisions included in the tag, which involves the same tradeoff described above for branches. The TAG.FIXUP branch is cleared at the end of the conversion, but (due to a technical limitation of the git fast-import file format) not deleted. There are some situations when a tag could be created from a single revision, but cvs2git does not realize it and creates a superfluous tag fixup branch. It is possible to delete superfluous tag fixup branches after the conversion by running the contrib/git-move-refs.py script within the resulting git repository.

  • There are no checks that CVS branch and tag names are legal git names. There are probably other git constraints that should also be checked. see cvs2git

Are you showing the refs directory of the new dirs or of the large repo after conversion? You could delete the tags in your single large export repo before you filter and split the large repo.

You can delete tags in the large repo by just deleting the file in the directory - it is just a reference to a SHA.

like image 2
Michael Avatar answered Oct 10 '22 01:10

Michael