Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WHAT operations become slow when git repos become large, and WHY?

Tags:

git

This question was asked in various forms on SO and elsewhere, but no answer I was able to find has satisfied me, because none list the problematic/non problematic actions/commands, and none give a through explanation of the technical reason for the speed hit.

For instance:

  • Why can't Git handle large files and large repos
  • Why git operations becomes slow when repo gets bigger
  • Git is really slow for 100,000 objects. Any fixes?

So, I am forced to ask again:

  1. Of the basic git actions (commit, push, pull, add, fetch, branch, merge, checkout), which actions become slower when repos become larger (NOTICE: repos, not files for this question)

And,

  1. Why each action depends on repo size (or doesn't)?

I don't care right now about how to fix that. I only care about which actions' performance gets hit, and the reasoning according to current git architecture.


Edit for clarification:

It is obvious that git clone for instance, would be o(n) the size of the repo.

However it is not clear to me that git pull would be the same, because it is theoretically possible to only look at differences.

Git does some non trivial stuff behind the scenes, and I am not sure when and which.


Edit2:

I found this article, stating

If you have large, undiffable files in your repo such as binaries, you will keep a full copy of that file in your repo every time you commit a change to the file. If many versions of these files exist in your repo, they will dramatically increase the time to checkout, branch, fetch, and clone your code.

I don't see why branching should take more than O(1) time, and I am also not sure the list is full. (for example, what about pulling?)

like image 431
Gulzar Avatar asked Jul 21 '19 15:07

Gulzar


People also ask

How do I manage large Git repository?

Using submodules One way out of the problem of large files is to use submodules, which enable you to manage one Git repository within another. You can create a submodule, which contains all your binary files, keeping the rest of the code separately in the parent repository, and update the submodule only when necessary.

What is a big Git repo?

A large Git repository can be a repository that contains a large number of files in its head commit. This can negatively affect the performance of virtually all local Git operations. A common mistake that leads to this problem is to add the source code of external libraries to a Git repository.


Video Answer


2 Answers

However it is not clear to me that git pull would be the same, because it is theoretically possible to only look at differences.

Since Git 2.23 (Q3 2019), it is not O(N), but O(n log(N)): see "Git fetch a branch once with a normal name, and once with capital letter".

The main issue is the log graph traversal, checking what we have and have not (or computing forced update status).
That is why, for large repositories, recent Git editions have introduced:

  • reachability bitmap,
  • commit graph,
  • loose cache,
  • Commit Graphs Chains.
  • And pack-file tree discovery for push commands.

they will dramatically increase the time to checkout, branch, fetch, and clone

That won't be because of operation being not O(1).
It has to do with the size of the large number of binaries to transfert/copy around when doing those operations.
Creating a new branch remains very fast, but switching to it when you have to update those binary files can be slow, simply from an i/o perspective (copy/update/delete large files).

like image 88
VonC Avatar answered Oct 21 '22 07:10

VonC


I see two major issues which you have opened for discussion. First, you are asking about which Git operations get slower as repos get larger. The answer is, most Git operations will get slower as the repo gets larger. But the operations which would make Git seem noticeably slower are those which involve interacting with the remote repository. It should be intuitive to you that if the repo bloats, then things like cloning, pulling, and pushing would take longer.

The other issue you have touched on concerns whether or not large binary files should even be committed in the first place. When you make a commit, a copy of each file in the commit is compressed and added to the tree. Binary files have a tendency to not compress well. As a result, adding large binary files can over time cause your repo to bloat. In fact, many teams will configure their remote (e.g. GitHub) to block any such commits containing large binaries.

like image 36
Tim Biegeleisen Avatar answered Oct 21 '22 07:10

Tim Biegeleisen