Despite involving two subparts, I'm asking this as a combined question because the way it's broken down into parts isn't what's important. I'm open to different ways to achieve what I want as long as the end result retains all the meaningful history and ability to check out, study, and build/test historical versions. The goal is to retire hg and the subrepo model that's been used so far and move to a unified tree in git, but without sacrificing history.
What I'm starting with is a Mercurial repository that consists of some top-level code and a number of subrepositories where the bulk of interesting history lies. The subrepos have some branching/merges, but nothing too crazy. The final result I want to achieve is a single git repository, with no submodules, such that:
For each commit in the original top-level hg repo, there is a git commit that checks out exactly the same tree as you'd get checking out the corresponding hg commit with all its references subrepo commits.
These git commits corresponding to successive top-level hg commits are descendants of each other, with commits corresponding to all relevant subrepo commits in between.
The basic idea I have for how to achieve this is to iterate over all top-level hg commits, and for each top-level commit that changes .hgsubstate
, also iterate over all paths from the old revision to the new revision for the submodule (possibly involving branching). At each step:
git-write-tree
and git-commit-tree
to generate a commit with the desired parents, using authorship, date, and commit message from the corresponding hg commit.Should this work? Is there a better way to achieve what I want, perhaps doing the subrepo collapse with hg first? The biggest thing I'm not clear on is how to perform the desired iteration, so practical advice for how to achieve it would be great.
One additional constraint: the original repos involve content which can't be published (this an additional git-filter-branch
step once the basic conversion is done) so solutions that involve uploading the repo for processing by a third party are not viable.
It seems what I was missing from my question and discussion of possible solutions was a proper understanding of the graph theory involved. Ideas like "iterate over all paths from the old revision to the new revision" were not really well-defined, or at least didn't reflect what I expected them to reflect. Coming at it from a more rigorous standpoint, I think I have an approach that works.
To begin with, the problem: Subrepo revisions only represent the state of their own subtrees at a given point in history. I want to map them to revisions that represent the state of the whole combined tree. Then the subrepo DAGs can be merged with the top-level DAG in a meaningful way.
For a given subrepo revision R, we can ask what top-level-repo (or parent-repo, if we had multiple levels of subrepos) revisions include R or any descendant of R. Assuming a single root, this set of revisions has a Lowest Common Ancestor (or maybe more than one), which seems like a good candidate. Indeed, if the top-level revision S we use with R is not a common ancestor of revisions which use R or its descendants (but the mapping is otherwise reasonable), then R will have a descendant R' whose associated top-level revision S' is not a descendant of S. In other words, the history derived from the subrepo will have confusing/nonsensical jumps between revisions of the top-level tree.
Now, if we want to choose a common ancestor, the lowest one makes sense from a standpoint of making these revisions something that can be checked-out, built, and tested, and from a standpoint of giving a reasonable idea what the state of the top-level repo (and other subrepos) was at the time the changes in the subrepo were made. The root of the whole top-level DAG would of course also work, but it would not give meaningful, usable revisions that could be checked out; choosing the root would be equivalent (from a usability standpoint) to a naive repo-merge that has one root per subrepo and just merges from the subrepo histories whenever the top-level repo updates the revisions it's using.
So, if we can use the LCA to assign a top-level revision T(R) to each subrepo revision R, how does that translate into
Whenever a subrepo revision R has T(R) distinct from T(P) for each parent P of R, it's effectively merging new changes from the top-level repo (and other subrepos) into the subrepo history. The conversion should represent this as two commits:
The actual subrepo commit R, using an old top-level revision. If R has a single parent P (not a merge commit), this will be T(P). If R had multiple parents, it's not clear whether there's a perfect choice of which one to use, but T(P) for any parent P should be reasonable.
A merge commit merging back the conversion C(T(R)) of the top-level-repo commit T(R) associated with R, where C(T(R)) itself just merged (1) above.
Aside from C(T(R)), which references (1) as a merge parent, all other references to R in the conversion should use (2). This includes the conversions of any descendants of T(R) in the top-level repo which use revision R of this subrepo, and the conversions of direct children of R itself.
I believe the above (albeit poorly worded) description specifies all that's needed for merging the top-level and subrepo DAGs. Each subrepo revision gets a full version of the tree, and ends up connected into a unified DAG for the converted repo via "merge commits" (when the subrepo merges a new associated top-level revision, and when the top-level merges subrepo revisions that have changed).
The final step of producing the git repo, then, is simply replaying the merged DAG, either in topologically sorted form or via a depth-first walk, such that each git commit-tree
already has all the parent revisions it needs present.
What you have written might or might not solve the issue. But it isn't simple. Main issue is that you need commit in order so that your subrepos and main repo are consistent. I recreated this problem in a small scale and was able to have consistency between subrepos also).
My solution:
Using hg convert extension, I converted main repo to a repo without subrepos (and related information).
cd main
awk '{ print $1}' .hgsub | xargs -n 1 echo 'exclude' > ../filemap
echo exclude .hgsub >> ../filemap
echo exclude .hgsubstate >> ../filemap
cd ..
hg convert --filemap filemap main mainConv
cd mainConv
hg update
Convert subrepo by using rename in --filemap.
cd ..
echo rename . subRepo > subFileMap
hg convert --filemap main/subRepo subRepoConv
cd subRepoConv
hg update
Pull subrepos to converted main repo.
cd ../mainConv
hg pull -f ../subRepoConv
You will notice multiple heads in the repo while pulling (because subrepo have their own head). Merge them:
hg heads
hg merge <RevID from subrepo (not main repo)>
hg ci -mMergeOfSubRepo
You have to repeat 3 & 4 for every subrepo.
But commits won't be sorted. So put them in order as done here https://stackoverflow.com/a/16012597:
cd ..
hg clone -r 0 mainConv mainOrdered
cd mainOrdered
for REV in `hg log -R ../main -r 'sort(1:tip, date)' --template '{rev}\n'`
do
hg pull ../main -r $REV
done
Now convert this ordered mercurial repo to git using http://repo.or.cz/w/fast-export.git:
cd ..
git clone git://repo.or.cz/fast-export.git
git init mainGit
cd mainGit
../fast-export/hg-fast-export.sh -r ../mainOrdered
git checkout HEAD
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With