This question was asked in various forms on SO and elsewhere, but no answer I was able to find has satisfied me, because none list the problematic/non problematic actions/commands, and none give a through explanation of the technical reason for the speed hit.
For instance:
So, I am forced to ask again:
And,
I don't care right now about how to fix that. I only care about which actions' performance gets hit, and the reasoning according to current git architecture.
Edit for clarification:
It is obvious that git clone
for instance, would be o(n) the size of the repo.
However it is not clear to me that git pull
would be the same, because it is theoretically possible to only look at differences.
Git does some non trivial stuff behind the scenes, and I am not sure when and which.
Edit2:
I found this article, stating
If you have large, undiffable files in your repo such as binaries, you will keep a full copy of that file in your repo every time you commit a change to the file. If many versions of these files exist in your repo, they will dramatically increase the time to checkout, branch, fetch, and clone your code.
I don't see why branching should take more than O(1) time, and I am also not sure the list is full. (for example, what about pulling?)
Using submodules One way out of the problem of large files is to use submodules, which enable you to manage one Git repository within another. You can create a submodule, which contains all your binary files, keeping the rest of the code separately in the parent repository, and update the submodule only when necessary.
A large Git repository can be a repository that contains a large number of files in its head commit. This can negatively affect the performance of virtually all local Git operations. A common mistake that leads to this problem is to add the source code of external libraries to a Git repository.
However it is not clear to me that
git pull
would be the same, because it is theoretically possible to only look at differences.
Since Git 2.23 (Q3 2019), it is not O(N)
, but O(n log(N))
: see "Git fetch a branch once with a normal name, and once with capital letter".
The main issue is the log graph traversal, checking what we have and have not (or computing forced update status).
That is why, for large repositories, recent Git editions have introduced:
push
commands.they will dramatically increase the time to checkout, branch, fetch, and clone
That won't be because of operation being not O(1)
.
It has to do with the size of the large number of binaries to transfert/copy around when doing those operations.
Creating a new branch remains very fast, but switching to it when you have to update those binary files can be slow, simply from an i/o perspective (copy/update/delete large files).
I see two major issues which you have opened for discussion. First, you are asking about which Git operations get slower as repos get larger. The answer is, most Git operations will get slower as the repo gets larger. But the operations which would make Git seem noticeably slower are those which involve interacting with the remote repository. It should be intuitive to you that if the repo bloats, then things like cloning, pulling, and pushing would take longer.
The other issue you have touched on concerns whether or not large binary files should even be committed in the first place. When you make a commit, a copy of each file in the commit is compressed and added to the tree. Binary files have a tendency to not compress well. As a result, adding large binary files can over time cause your repo to bloat. In fact, many teams will configure their remote (e.g. GitHub) to block any such commits containing large binaries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With