Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Impact of large number of branches in a git repo?

Tags:

git

Does anyone know what the impact is of a git repo that has a lot of branches (2000+)? Does git pull or git fetch slow down due to having that many branches? Please provide benchmarks if there is a difference.

like image 666
ajma Avatar asked Mar 04 '15 08:03

ajma


People also ask

Can you have too many branches in git?

Yes, it does. Locally, it's not much of a problem--though it does still affect several local commands. In particular, when you are trying to describe a commit based on the available refs.

How many branches should a git project have?

There are five different branch types in total: Main.

How many branches can a git repo have?

There is no hard limit on the number of branches, tags, remote-tracking names, and other references. (All of Git's name-to-hash-ID map entries are refs or references: branch names are just refs whose full name starts with refs/heads/ ). These are not always stored in separate files.

How big is too big for a git repo?

The total repository size will be limited to 10GB. You will receive warning messages as your repository size grows to ensure you're aware of approaching any size limits. Eventually, if the repository size exceeds the limit, you will receive an error message and the push will be blocked.


4 Answers

As others have pointed out, branches and other refs are just files in the file system (except that's not quite true because of packed refs) and are pretty cheap, but that doesn't mean their number can't affect performance. See e.g. the Poor push performance with large number of refs thread on the Git mailing list for a recent (Dec 2014) example of Git performance being affected by having 20k refs in a repository.

If I recall correctly, some part of the ref processing was O(n²) a few years ago but that can very well have been fixed since. There's a repo-discuss thread from March 2012 that contains some potentially useful details, if perhaps dated and specific to JGit.

The also somewhat dated Scaling Gerrit article talks about (among other things) potential problems with high ref counts, but also notes that several sites have gits with over 100k refs. We have a git with ~150k refs and I don't think we're seeing any performance issues with it.

One aspect of having lots of refs is the size of the ref advertisement at the start of some Git transactions. The size of the advertisement of aforementioned 150k ref git is about 10 MB, i.e. every single git fetch operation is going to download that amount of data.

So yes, don't ignore the issue completely but you shouldn't lose any sleep over a mere 2000 refs.

like image 53
Magnus Bäck Avatar answered Oct 19 '22 23:10

Magnus Bäck


March 2015: I don't have benchmarks but one way to ensure a git fetch remains reasonable even if the upstream repo has a large set of branches would be to specific a less general refspec than the one by default.

fetch = +refs/heads/*:refs/remotes/origin/*

You can add as many fetch refspecs to a remote as you want, effectively replacing the catch-all refspec above with more specific specs to just include the branches you actually need (even though the remote repo has thousands of them)

fetch = +refs/heads/master:refs/remotes/origin/master
fetch = +refs/heads/br*:refs/remotes/origin/br*
fetch = +refs/heads/mybranch:refs/remotes/origin/mybranch
....

April 2018: git fetch will improve with Git 2.18 (Q2 2018).

See commit 024aa46 (14 Mar 2018) by Takuto Ikuta (atetubou).
(Merged by Junio C Hamano -- gitster -- in commit 5d806b7, 09 Apr 2018)

fetch-pack.c: use oidset to check existence of loose object

When fetching from a repository with large number of refs, because to check existence of each refs in local repository to packed and loose objects, 'git fetch' ends up doing a lot of lstat(2) to non-existing loose form, which makes it slow.

Instead of making as many lstat(2) calls as the refs the remote side advertised to see if these objects exist in the loose form, first enumerate all the existing loose objects in hashmap beforehand and use it to check existence of them if the number of refs is larger than the number of loose objects.

With this patch, the number of lstat(2) calls in git fetch is reduced from 411412 to 13794 for chromium repository, it has more than 480000 remote refs.

I took time stat of git fetch when fetch-pack happens for chromium repository 3 times on linux with SSD.

* with this patch
8.105s
8.309s
7.640s
avg: 8.018s

* master
12.287s
11.175s
12.227s
avg: 11.896s

On my MacBook Air which has slower lstat(2).

* with this patch
14.501s

* master
1m16.027s

git fetch on slow disk will be improved largely.


Note this hashmap used in packfile does improve with Git 2.24 (Q4 2019)

See commit e2b5038, commit 404ab78, commit 23dee69, commit c8e424c, commit 8a973d0, commit 87571c3, commit 939af16, commit f23a465, commit f0e63c4, commit 6bcbdfb, commit 973d5ee, commit 26b455f, commit 28ee794, commit b6c5241, commit b94e5c1, commit f6eb6bd, commit d22245a, commit d0a48a0, commit 12878c8, commit e010a41 (06 Oct 2019) by Eric Wong (ele828).
Suggested-by: Phillip Wood (phillipwood).
(Merged by Junio C Hamano -- gitster -- in commit 5efabc7, 15 Oct 2019)

For example:

packfile: use hashmap_entry in delta_base_cache_entry

Signed-off-by: Eric Wong
Reviewed-by: Derrick Stolee

This hashmap_entry_init function is intended to take a hashmap_entry struct pointer, not a hashmap struct pointer.

This was not noticed because hashmap_entry_init takes a "void *" arg instead of "struct hashmap_entry *", and the hashmap struct is larger and can be cast into a hashmap_entry struct without data corruption.

This has the beneficial side effect of reducing the size of a delta_base_cache_entry from 104 bytes to 72 bytes on 64-bit systems.


Before Git 2.29 (Q4 2020), there was a logic to estimate how many objects are in the repository, which is mean to run once per process invocation, but it ran every time the estimated value was requested.

This is faster with Git 2.29:

See commit 67bb65d (17 Sep 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 221b755, 22 Sep 2020)

packfile: actually set approximate_object_count_valid

Reported-by: Rasmus Villemoes
Signed-off-by: Jeff King

The approximate_object_count() function tries to compute the count only once per process. But ever since it was introduced in 8e3f52d778 (find_unique_abbrev: move logic out of get_short_sha1(), 2016-10-03, Git v2.11.0-rc0), we failed to actually set the "valid" flag, meaning we'd compute it fresh on every call.

This turns out not to be too bad, because we're only iterating through the packed_git list, and not making any system calls. But since it may get called for every abbreviated hash we output, even this can add up if you have many packs.

Here are before-and-after timings for a new perf test which just asks rev-list to abbreviate each commit hash (the test repo is linux.git, with commit-graphs):

Test                            origin              HEAD
----------------------------------------------------------------------------
5303.3: rev-list (1)            28.91(28.46+0.44)   29.03(28.65+0.38) +0.4%
5303.4: abbrev-commit (1)       1.18(1.06+0.11)     1.17(1.02+0.14) -0.8%
5303.7: rev-list (50)           28.95(28.56+0.38)   29.50(29.17+0.32) +1.9%
5303.8: abbrev-commit (50)      3.67(3.56+0.10)     3.57(3.42+0.15) -2.7%
5303.11: rev-list (1000)        30.34(29.89+0.43)   30.82(30.35+0.46) +1.6%
5303.12: abbrev-commit (1000)   86.82(86.52+0.29)   77.82(77.59+0.22) -10.4%
5303.15: load 10,000 packs      0.08(0.02+0.05)     0.08(0.02+0.06) +0.0%  

It doesn't help at all when we have 1 pack (5303.4), but we get a 10% speedup when there are 1000 packs (5303.12).
That's a modest speedup for a case that's already slow and we'd hope to avoid in general (note how slow it is even after, because we have to look in each of those packs for abbreviations). But it's a one-line change that clearly matches the original intent, so it seems worth doing.

The included perf test may also be useful for keeping an eye on any regressions in the overall abbreviation code.

like image 28
VonC Avatar answered Oct 19 '22 23:10

VonC


Yes, it does. Locally, it's not much of a problem--though it does still affect several local commands. In particular, when you are trying to describe a commit based on the available refs.

Over the network, Git does an initial ref advertisement when you connect to it for updates. You can learn about this in the pack protocol document. The problem here is that your network connection may be flaky or latent, and that initial advertisement can take a while as a result. There has been discussions of removing this requirement, but, as always, compatibility issues make it complicated. The most recent discussion about it is here.

You probably want to look at a recent discussion about Git scaling too. There's many ways in which you may want Git to scale, and it's discussed the majority of them so far. I think it gives you a good idea what Git is good at, and where it could use some work. I'd summarize it for you, but I don't think I could do it justice. There's a lot of useful information there.

like image 26
John Szakmeister Avatar answered Oct 20 '22 00:10

John Szakmeister


In order to answer your question, you should know how Git handle branches. What are branches?

A branch is only a reference to a commit to the local repo, creating branches is very cheap. .git directory contains directories that contains metadata that git uses, when you create a branch, what happens is that a reference is created to local branch and a history log is created. In other words, creating branches is creating files and references, the system can easily handle 2000 files.

I advise you to go through 3.1 Git Branching - Branches in a Nutshell, it contains information that might help you to better undestand how branches are handled.

like image 39
Maroun Avatar answered Oct 20 '22 00:10

Maroun