Transferring legacy code base from cvs to distributed repository (e.g. git or mercurial). Suggestions needed for initial repository design [closed]

Introduction and Background

We are in the process of changing source control system and we are currently evaluating git and mercurial. The total code base is around 6 million lines of code, so not massive and not really small either.

Let me first start off with a very brief introduction to how the current repository design looks.

We have one base folder for the complete code base, and beneath that level there are all sorts modules used in several different contexts. For example “dllproject1” and “dllproject2” can be looked at as completely separate projects.

The software we are developing is something we call a configurator, which can be customized endlessly for different customer needs. At total we probably have 50 different versions of them. However, they have one thing in common. They all share a couple of mandatory modules (mandatory_module1 ..). These folders basically contain kernel/core code and common language resources etc. All customizations can then be any combination between the other modules (module1 ..).

Since we currently are using cvs we've added aliases in the CVSROOT/modules file. They might look something like:

core –a mandatory_module1 mandatory_module2 mandatory_module3
project_x –a module1 module3 module5 core

So if someone decides to work on project_x, he/she can quickly checkout the modules needed by:

base>cvs co project_x

Questions

Intuitively it just feels wrong to have the base folder as a single repository. As a programmer you should be able to check out the exact code sub set needed for the current project you are working with. What are your thoughts on this?

On the other hand it feels more right to have each of these modules in separate repositories. But this makes it harder for programmers to check out the modules that they need. You should be able to do this by a single command. So my question is: Are there similar ways of defining aliases in git/mercurial?

Any other questions, suggestions, pointers are highly welcome!

PS. I have searched for similar questions but didn’t feel that any of them applied 100% to my situation.

741

asked May 22 '09 18:05

ralphtheninja

2 Answers

Just a quick comment to remind you that:

those migrations often offer the opportunity to reorganize the sources, not along modules (each with one repositories) but rather along a functional domain split (several modules for a same given functional domain being put in the same repository).

Then submodules are to be used, as a way to define a configuration.

Git is alright, but from Linus's admission himself, to put everything into one repository can be problematic.

[...] CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

Those two aforementioned points advocate for a more component-oriented approach for large system (and large legacy repository).

With Git submodule, you can checkout them in your project (even if it is a two-steps process). You have however tools than can make the submodule management easier (git.rake for instance).

When I'm thinking of fixing a bug in a module that's shared between several projects, I just fix the bug and commit it and all just do their updates

That is what I describe in the post Vendor Branch as the "system approach": everyone works on the latest (HEAD) of everything, and it is effective for small number of projects.
For a large number of modules though, the notion of "module" is still very useful, but its management is not the same with DVCS:

for closely related modules (aka "in the same functional domain", like "all modules related to PNL - Profit aNd Losses - or "Risk analysis", in a financial domain), you do need to work with the latest (HEAD) of all components involved.
That would be achieved with the use of a subtree strategy, not in order for you to publish (push) corrections on those other submodules, but to track works done by other teams.
Git allows that with the extra-bonus that this "tracking" does not have to take place between your repository and one "central" repository, but can also take place between you and the local repository of the other team, allowing for a very quick back-and-forth integration and testing between projects of similar nature.
however, for modules which are not directly in your functional domain, submodules are a better option, because they refer to a fix version of a module (a commit):
when a low-level framework changes, you do not want it to be propagated instantaneously, since it would impact all the other teams, which would then have to drop what they were doing to adapt their code to that new version (you do want though all the other teams to be aware of this new version, in order for them to not forget to update that low-level component or "module").
That allows you to work only with official stable identified versions of other modules, and not potentially un-stabled or not fully tested HEADs.

163

answered Oct 20 '22 05:10

VonC

As for the Mercurial side, the recommendation is also to refactor large legacy CVS/SVN repositories into smaller components. Common code should be put into its own libraries, and the application code will then depend on those libraries in a similar way to how it depends on other libraries.

Mercurial has the forest extension which allows you to manage a "forest" of "source trees". With that approach you combine several smaller repositories into a larger one. With CVS you do the opposite: you checkout a smaller portion of a large repository.

I have not personally used the forest extension and its page says that one should use an updated version compared to the one bundled with Mercurial. However, it is used by a big organization like Sun in its OpenJDK project.

There is also currently work underway to add sub-repository report directly to the core of Mercurial, as per the design on nested repositories page in the Mercurial wiki.

answered Oct 20 '22 06:10

Martin Geisler

Related questions
                            
                                Get git diff for any merged branch
                            
                                Git diff tool on every commit?
                            
                                What characters can I use in a Git alias?
                            
                                Git pull error: "fatal: Couldn't find remote ref master" from Heroku
                            
                                Is it possible to get branch names without clone or pull from git?
                            
                                how to see the git commands executed by Intellij
                            
                                Jenkins git plugin with ssh access to bitbucket: Permission denied (publickey). fatal: Could not read from remote repository
                            
                                Make git to track auto-generated files but ignore from diff
                            
                                How to rename a local Git branch from inside IntelliJ IDEA?
                            
                                What does `git fetch origin master:master` mean?
                            
                                how to download single folder OR file in gitlab repository
                            
                                Linux Kernel sources modified on OSX right after clone
                            
                                Ignore fsck / zero-padded file mode errors in "git clone"
                            
                                AWS CodePipeline, build failed & getting error as YAML_FILE_ERROR M
                            
                                Yarn package manager: install dependencies from private Bitbucket repository
                            
                                Possible to resolve Git conflict on single file using Ours / Theirs?
                            
                                How to copy commit metadata (author, date) from another commit
                            
                                How to exclude a specific git submodule from update?
                            
                                Editing files as a collaborator on GitHub
                            
                                How to run Git 1.6.x for Windows from a USB memory stick

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Transferring legacy code base from cvs to distributed repository (e.g. git or mercurial). Suggestions needed for initial repository design [closed]

Tags:

git

dvcs

mercurial

cvs