Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Migration from svn to git. Which option is the best: giant trunk, submodules, subtrees

Tags:

git

svn

migration

I know there are much questions about the same thing but i still need more information. I am investigating possibility of migrating our SVN repo to git and trying to understand what approach (monolith trunk, submodules, subtrees, etc) will be the best for our repo.

Here is some information about our project and SVN repository:

  • Project is java web application packaged is war.
  • It is modular application. Each module developed by separate team and then packaged as jar.
  • War depends on this jars.

Basically our structure looks like:

repo
|-application(war)
|-module1 (for example, ui stuff)
|--module1Submodule1
|--module1Submodule2
|-module2 (for example, database access stuff)
|-...

Each module has it's own tags and branches.

The size of svn repo on my local machine with all branches, tags, etc is:

  • over 2,5 million files
  • over 20Gb space
  • there are 311615 revisions
  • Files are mostly source code, no large binary objects

Typical usecases:

  • 200+ Dev and QA in whole team
  • Different teams commit to their modules/submodules. (Can it be a problem with monolith git repo, as git requires to pull all changes before push, svn warns about only out-of-date changes)
  • Branch module
  • Branch application

Future usecases:

  • Gerrit
  • Developer commits, commit is reviewed, tests are run against commit, if green, then commit is approved to merge to 'master' branch

The questions are:

  1. Can we consider such repo as a large for git (i mean there are a lot of posts which note that git scales badly for large repos, but what is 'large'?)
  2. What are pros and cons of each of approaches:
    • Monolith repo (just git as svn, anti-pattern?)
    • Submodules
    • Subtrees (Am i right that every change in module will require to commit in subtree repo and then pull change to aggregated subtrees repo?)
    • Separate repos for each module
    • Any other..
  3. Can history from SVN be preserved for each of them?
  4. I need as much links as possible (i didn't find any official links for 'slow for large repo')

Thank you in advance!

like image 963
Vadim Kirilchuk Avatar asked Nov 13 '13 22:11

Vadim Kirilchuk


1 Answers

History

History can be preserved for all mentioned approaches by using git svn: http://git-scm.com/book/en/Git-and-Other-Systems-Migrating-to-Git Even switching back to previous commits is possible.

However, there were suggestions to not preserve history and just leave svn repository freezed for about 6 months, while all history will change in a git repo. I disagree with such advices because history is essential for our project. I bet no one accept such solution.

Giant trunk approach

  • You have to clone the whole big tree, even if you only plan on working on one subdirectory (main use case)
  • some git commands will be slow (for example: git status, as it needs to check whole tree)
  • Even if you tune jenkins to trigger builds only for particular parts of repo (This can be done using “include” property of jenkins git plugin). It is still required to pull all repo to perform a build. This will hardly impact all the work, because “clean” checkout will take much time even for building small modules.

Concern: Having 200+ Dev and QA in whole team, I suspect it will be quite uneasy to eventually push the changes.

  • Changes are pushed to master branch only after review is approved on gerrit and tests were passed, so we won’t have continuous flow of pull-push-fail-pull-push
  • However gerrit could reject merging if master branch was changed since commit was pushed to gerrit, it will require to click ‘rebase’ button and rerun tests.
  • Linux kernel has monolith repo, because c/c++ have no dependency management like java has: building a kernel tar like war with jar dependencies is not the case.

Quiz

What are the steps, their cost and total cost of migration using this approach?

  • git svn clone SVN_URL REPO_NAME
  • Jenkins stuff

How can it support code gating? What changes are required from VCS / tools perspective? Suppose here that full CI run takes 15 minutes.

  • Jenkins should have “include” filter in scm trigger to filter changes for particular part of project. Is’t not that hard, but still requires some efforts to set up and verify them. In case of “wipe workspace before build” builds, whole repo should be cloned all the time. It can increase overall time from commit to “approved by tests”, because checkout will be quite slow.

What are efficient developer workflows?

  • Developers use local/remote feature branches
  • Push changes to gerrit
  • Gerrit verifies changes against tests
  • Change is merged to master branch

Submodules

Most caveats explained here http://git-scm.com/book/en/Git-Tools-Submodules and here http://codingkilledthecat.wordpress.com/2012/04/28/why-your-company-shouldnt-use-git-submodules/

The main issue is that you will have to commit twice

  • To submodule itself
  • To aggregating repo - to update submodule No sense. Why you ever need aggregating repo if dependencies are managed through artifacts repo?

Actually submodules created for cases when there is a library which can be reused with different projects, but you want to depend on particular tag of the library with ability to update reference in future. However we are not going to tag each commit (only release after each commit) and changing dependencies versions (to released ones) in war will be easier than maintaining submodules approach. Java dependency management make things simpler.

It is not recommended to point to submodule head and leads to troubles with submodules, so this approach is dead end for going to snapshots. And again we don’t need it because java dependency management will do everything for us.

Quiz What are the steps, their cost and total cost of migration using this approach?

  • git svn clone SVN_URL REPO_NAME for each module
  • Create aggregating git repo
  • Add module repositories as submodules to aggregating repo

How can it support code gating? What changes are required from VCS / tools perspective? Suppose here that full CI run takes 15 minutes.

  • Gerrit supports both merges and commits to submodules, so it should be ok.
  • Jenkins stuff - triggers on submodules changes and aggregating repo changes (argh! no sense in two places!)

What are efficient developer workflows? (Gerrit process is ommited)

  • Developers commit into submodule
  • Making a tag of his commit
  • Developer goes into aggregating repo
  • cd into submodule, checkouting tag
  • commit aggregating repo with changed submodule hash

Or

  • Developer changes submodule
  • Pushes change to submodule to not lose changes
  • commit aggregating repo with changed submodule hash

As you see developer workflow is cumbersome (requires to always update two places) and doesn’t suit our needs.

Subtrees

The main issue is that you will have to commit twice To tree merged subdirectory Push changes to original repo

Subtrees is a better alternative to submodules, it’s more robust and merges source code of submodules to aggregating repo instead of just referencing it. It makes things simpler to maintain such aggregating repo, however the problem with subtrees is the same as for submodules, making double commits is totally useless. You are not forced to commit changes to original module repo, and can commit it with aggregating repo, it can lead to inconsistense between repos...

The differences are explained quite well here: http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/

Quiz What are the steps, their cost and total cost of migration using this approach?

  • git svn clone SVN_URL REPO_NAME for each module
  • Create aggregating repo
  • Perform subtree merge for each module

How can it support code gating? What changes are required from VCS / tools perspective? Suppose here that full CI run takes 15 minutes.

  • Looks like Gerrit supports subtree merges not very well (https://www.google.com/#q=Gerrit+subtrees)
  • But we can’t be sure untill try
  • Jenkins stuff. Triggers on subtree repoes and aggregating repo changes (argh! no sense in two places!)

What are efficient developer workflows? (Gerrit process is ommited)

  • Developer changes something in subtree (inside aggregating repo)
  • Developer commits aggregating repo
  • Developer doesn’t forget about pushing change to original repo (no sense!)
  • Developer doesn’t forget to NOT mix subtree changes with aggregating repo changes in one commit

Again like with submodules there is no sense in having two places (repoes) where code/changes are present. Not for our case.

Separate repos

Separate repos looks like a best solution and follow original git intension. Granularity of repoes can vary. The most fine-grained case is to have repo per maven release group, however it can lead to too many repos. Also we need to consider how often one particular svn commit affects several modules or release groups. If we see, that commit usually affects 3-4 release groups then this groups should form a repo.

Also i believe it’s worth to at least separate api modules from implementation modules.

Quiz What are the steps, their cost and total cost of migration using this approach?

  • git svn clone SVN_URL REPO_NAME for each more or less fine-grained number of modules

How can it support code gating? What changes are required from VCS / tools perspective? Suppose here that full CI run takes 15 minutes.

  • Jenkins triggered for each repo separately. No ‘include’ filters. Just checkout, build, deploy.

What are efficient developer workflows?

  • Developers use local/remote feature branches for each repo
  • Push changes to gerrit
  • Gerrit verifies changes against tests
  • Change is merged to master branch
like image 86
Vadim Kirilchuk Avatar answered Sep 30 '22 01:09

Vadim Kirilchuk