Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pushing to Github after a shallow clone is horribly slow

Tags:

git

github

I've done a shallow clone of a large repo (git clone --depth 1 [email protected]:myOrg/myRepo.git). I can push new changes to the remote but the first push is horribly slow. Subsequent pushes are fine. The command indicates that the first push writes a lot of data to the remote:

$ git checkout -b test && \
  touch tmp.txt && \
  git add tmp.txt && \
  git commit -m tmp && \
  git push origin test

Enumerating objects: 164300, done.
Counting objects: 100% (164300/164300), done.
Delta compression using up to 12 threads
Compressing objects: 100% (93987/93987), done.
Writing objects: 100% (164300/164300), 368.72 MiB | 139.00 KiB/s, done.
Total 164300 (delta 41183), reused 164297 (delta 41182)
remote: Resolving deltas: 100% (41183/41183), done.

This doesn't seem to write anything specific to the local as the size of .git is mostly unchanged.

I'm curious to understand what is happening, and if this process can be improved without increasing the size of the local clone significantly.

Note

This is different from the situation discussed in this question from 2012 where pushing was just not working.

like image 894
Guig Avatar asked Feb 04 '23 14:02

Guig


1 Answers

TL;DR: use --depth 2. Read on for why.

Shallow clones (can, but not necessarily must) defeat an important optimization. In your case this happens for the first push, but not for subsequent pushes. Other defeating cases can occur so other pushes might also be slow.

We start with the fact that Git is really all about commits,1 which are shaped into a Directed Acyclic Graph. The graph has vertices or nodes—whichever term you prefer—that are numbered, by commit hash IDs. Relatively immaterial here, but helpful for concreteness, is the fact that the edges / arcs between the nodes are stored as part of the nodes themselves, rather than being kept separately. Each node stores the hash IDs of its predecessor nodes.

A repository is, at heart, a database of these commit objects. A complete—non-shallow—repository has the entire graph, from every root to every tip commit. A single-branch clone potentially drops some part of the graph, but never has any "gaps" in the graph. For instance, given:

           node--node--tip1
          /
root--node
          \
           node--node--tip2

we can drop either tip and the nodes on that row, but not the nodes and root on the middle row. In all of these cases, then, we can—as Git always does—start at the tip and work backwards and eventually arrive at the root.

Now, there are two properties of each node that are important here:

  • The number is unique. It's a universally unique ID. No node in any other Git repository (that we'll meet anyway) can re-use that ID.

  • The data in the node are strictly read-only. That includes the outgoing edge links.

What this means is that if we have a gap-free repository—one that's either totally complete, or at least as complete as required for the tip commits it contains—on each side of a sender-to-receiver operation, we can have the sending repository simply enumerate for us some set of commits, by their numbers. If we, the receiving repository, lack that commit, we ask the sender to send it and also to tell us the numbers of its parent commit(s). The sender does that, and we see if we have those commits. For any that we lack, we ask the sender to send those and also tell us the numbers of their parent commits, and so on.

This means that if we have:

           node--node--tip1
          /
root--node
          \
           node--node--tip2

and they have:

           node--node--tip1--new
          /
root--node

then they will announce to use the hash ID of commit new. We don't have that one, so they should add it to a pile of commits to send, and announce to us the hash ID of commit tip1 as well. We do have tip1 so we just tell them: We already have tip1: you do not need to send it.

Here's the optimization: We just told them about every commit we have from tip1 all the way back to root. Our no thanks, we have that one offer to their reply to I can send you tip1 tells them about not just the one commit and its files, but also every predecessor commit and all of their files as well.

They now know that when they send us commit new, they need only send tree and blob ("file") objects that do not appear in predecessor commits. Moreover, they can compress these tree and blob objects against tree and blob objects in any previous commit, from tip1 all the way back to root. So the sender can send far less data than would otherwise be required to send the entire commit, with its full snapshot of every file.


Tree objects and blob objects are primarily found via commits. Annotated tag objects add one small wrinkle to the picture, and having an annotated tag that points directly to a tree or blob adds another, but neither defeats the standard optimization.


Compare to when the sender is a shallow repository

A shallow repository is one in which some commit(s) are marked-up artificially as being root commits. The commit object actually has the right parent hash IDs, but the Git in which this commit object lives has a file2 that says: We don't know anything about the parents of commit tip1. Don't try to look for any: pretend, instead, that tip1 is a root commit, with no parents.

This means the sending Git, instead of having:

           node--node--tip1
          /
root--node

has just:

      slightly-mangled-tip1

In this repository, we add our new commit:

      slightly-mangled-tip1--new

and now we have our Git call up their Git and offer it new commits. First we offer new. They say I don't have that one, what else can you send? We would offer slightly-mangled-tip1, but we can't do that because when we read it in, we mangle it. So we say: Sorry, that's all we have for you.

They say: Okay then, send us commit new.

On our end, then, we look at commit new. It has a full snapshot of every file. We don't know if they have any of these files. So we pack up the entire thing and send all of it.

They receive it all, unpack it, find that we've duplicated 99% of what they already have, ignore the extra copies, and take the new commit and put it in their repository:

           node--node--tip1--new
          /
root--node
          \
           node--node--tip2

The next time we run git push, we have this:

      slightly-mangled-tip1--new--new2

We offer them new2; they say I don't have that one, what's its parent and we say new and they say oh I have that one, don't bother sending it. This time, we see that they have almost every file already, and don't bother sending all those tree and blob objects and can compress any new tree and blob objects based on what's in commit new. (We still can't use the slightly-manged tip1 commit, nor of course any of the missing previous commits, but just being able to eliminate all unchanged files is huge.)


2Or other mechanism, but currently, it is a file named .git/shallow.


What you can do about this

Given that you intend to run git push, you'll get a lot of mileage out of having one un-mangled commit "between" the shallow graft point and a branch tip. So do your git clone with --depth 2. You'll get a slightly bigger client Git repository, but that first git push will go much faster.

That is, you'll start with:

slightly-mangled-node--tip1

on the client. The first new commit will result in:

slightly-mangled-node--tip1--new

and this time your Git will be able to offer tip1 to their Git during the first push, which will trigger the optimization.

like image 97
torek Avatar answered Feb 07 '23 09:02

torek