Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting large git repository

Tags:

git

split

We have a large c++ repository with size of 80 GB with nearly 200,000 files, containing multiple components.

The libraries (archives) are common for more number of components with tightly coupled.

With this all git operations and the compilation/building a particular component is taking too long time.

Please suggest me how to how to divide this single repo into multiple repos.

like image 708
user2463892 Avatar asked Jun 07 '13 14:06

user2463892


2 Answers

First, 200000 source files are likely to take less than 80GB of space (unless each file represents a 400KB of source!)

Update 2015: git-lts can actually manage that kind of volume.
See "Efficient storage of binary files in a git repository".


Original answer (2013)

That means:

  • any generated binary needs to be excluded from the git repo
  • any large binary need to be stored elsewhere (either in a Nexus-like artifact repository, or in any other storage space, like with git-annex)

Second, git operations are only slow if we are talking about one huge repo.
git is done to manage multiple small repos (even the git Linux kernel repo is nowhere near the size and number of files you mention)

So you need:

  • to split the huge git repo around:

    • functional components (a component being a coherent group of file representing a major feature of your program: the GUI, a dispatcher, a launcher, anything that implements your program main functional blocks)
    • technical components (all those common technical libraries, reused by multiple other components, providing features not visible by the end users, only used by the developers)
  • speed up the compilation process, especially when doing unit or small integration tests, by using binary dependencies: instead of getting all the sources and recompiling everything, you could setup each project in order for them to use the binaries/exes produced the other projects in order for a specific project to compile and run.
    That depends on how tightly coupled your libraries are with the other components.


The OP user2463892 adds in the comments:

I heard some thing about GIT submodules which will helps in dividing or splitting the large code base.
I am not familiarized with this, Can any one help me understand few of my questions regarding this as below?

1) How git submodule works? will it divide the huge code into multiple repos? with this can we solve the problem of GIT slowness?

A submodule is a git repo declared within another repo (which becomes a "parent" repo).

  • See the Pro Git book for a general presentation of submodules.
  • See my old answer about submodules regarding what you can do within a submodule.

The parent repo has a fixed know reference to a submodule repo as a special entry, which means:
when you clone a parent repo, you don't clone by default all the submodules declared in it

And that could be interesting in your case, as you don't need to clone all the sources in order to make the kind of incremental compilation you mention.
Plus, multiple repos means smaller repos, with commands like checkout, log, diff and status going faster.

2) Assume we divided the main repo into multiple repos by using this submodules... will this solve the problem which we faced (dependency between repos)?

Example: Assume we devide the main core repo into Super, RepoA, RepoB, RepoC etc...
Then will it be possible to compile all these repos together?
Can RepoA access the library from other repos (Super, RepoB, RepoC etc) and vice versa?

The mutual dependencies will still be there, but you would be able:

  • the checkout only the repos you need for a given step
  • store the compiled libraries outside of those repos, in order for repoB or repoC to use.

The goal is to switch from a source-only dependency to a (generated) binary dependency, where repoB can be compiled based on the binaries produced by repoA compilation step.

like image 156
VonC Avatar answered Sep 24 '22 18:09

VonC


You can create repositories for folders in Github using the following command.

git filter-branch --prune-empty --subdirectory-filter foldername master

This assumes you have already identified which components to extract and you sorted out the build processes once the repositories were created.

Reference:

  • Splitting a subfolder out into a new repository
like image 31
bloudraak Avatar answered Sep 20 '22 18:09

bloudraak