Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to break up a large Git repository into smaller ones?

I am working on doing a migration from SVN to Git. I have already used git-svn to get the history into a single git repository, and I already know how to use git-subtree to split that repository into smaller ones. This question is not about how to do the migration, it is about when to split and when not to split.

I want to split the large repository because some of the directories are self-contained libraries that are also shared with other projects. Previously an svn checkout was done on the library without the need to checkout the entire project. During all of this I discovered that there are probably dozens of directories that make sense to be in their own repository because they are 1) independent and 2) shared across projects.

Once you get above a handful of git repositories, it seems prudent to use a tool that makes working with many repositories easier. Some examples are Google's repo, git submodules, git subtree, and creating a custom script (it appears that chromium does this). I have explored these various methods, and understand how to use them.

So the question is about direction for the transition from subversion.

Should I try and stick to one large git repository, only splitting it into smaller pieces when absolutely necessary or should I split it into dozens or potentially hundreds of smaller repositories? Which would be easier to work with? Is there another solution that I have missed? If going with the many repositories, which tool should I use? What factors will make someone favor one method over another?

Note: The source needs to be checked out on Windows, MacOS, and Linux.

like image 254
onionjake Avatar asked Feb 21 '14 17:02

onionjake


People also ask

How big is too big for a Git repo?

File size limits GitHub limits the size of files allowed in repositories. If you attempt to add or update a file that is larger than 50 MB, you will receive a warning from Git. The changes will still successfully push to your repository, but you can consider removing the commit to minimize performance impact.

Should you have separate repositories?

If the project you are on is a single project with lots of parts it is less important to keep things separate. It might still be easier if each module were separate. But if you have lots of small or medium projects that use various modules then it is very useful to have separate repos.

How do I manage large Git repository?

Using submodules One way out of the problem of large files is to use submodules, which enable you to manage one Git repository within another. You can create a submodule, which contains all your binary files, keeping the rest of the code separately in the parent repository, and update the submodule only when necessary.

Does deleting branches reduce repository size?

Deleting files in a commit doesn't actually reduce the size of the repo since the earlier commits and blobs are still around. What you need to do is rewrite history with Git's filter-branch option.


1 Answers

That process can be guided by a component approach, where you identified coherent set of files (an application, a project, a library)

In term of history (in a source control tool), a coherent set means it will be labelled, branched or merged as a all, independently of the other set of files.

For a distributed version control system (like git), each of those set of files is a good candidate for a git repo of its own, and you can then group those you need for a specific project in a parent repo with submodules.

I describe this approach for instance in;

  • "Git repository setup for a project that has a server and client" (server and client being two obvious coherent separate sets which benefit from having their own repo)
  • "What is Component-Driven Development?"

The opposite (keeping everything in one repo) is called "system-based approach", but can lead to huge Git repo, which, as I mentioned in "Performance for Git", isn't compatible with how Git is implemented.


The OP onionjake asks in the comments:

Could you please include more information on the subtleties of identifying components?

This process (of identifying "components", which in turn become git repos) is guide by the software architecture of your system.
Any subset which acts as an independent set of file is a good candidate for its own repo. It can be a library, or dll, but also part of an application (a GUI, a client vs. a server, a dispatcher, ...)

Each time you identify a group of tightly linked files (meaning modifying one will likely have effect to others), there should be part of the component, or in git, the same repo.

like image 176
VonC Avatar answered Sep 21 '22 14:09

VonC