How does Git determine what objects need to be sent between repositories?

Question

I have looked here but couldn't quite figure out the things I was wondering about: how does git push or git pull figure out what commit objects are missing at the other side?

Let's say we have a repository with the following commits: (letters stand in for SHA-1 IDs, d is refs/heads/master)

a -> b -> c -> d

The remote, in contrast, has these:

a -> e -> f -> g

According to the git document, the remote would tell us that its refs/heads/master is at g, but since we don't know that commit, that doesn't actually tell us anything. How is that enough to figure out the missing data?

In the other direction, the document says:

At this point, the fetch-pack process looks at what objects it has and responds with the objects that it needs by sending “want” and then the SHA-1 it wants. It sends all the objects it already has with “have” and then the SHA-1. At the end of this list, it writes “done” to initiate the upload-pack process to begin sending the packfile of the data it needs:

this explains how the remote would determine what data to send, but wouldn't this impact pull performance on repositories with many objects? Otherwise, what is it that is actually meant in the text?

Apparently the way of data transfer is very different depending on the direction (push vs pull). What and how are the challenges met by this design choice, and how am I to understand their descriptions in the document?

Schwern · Accepted Answer

The magic is in the IDs. A commit ID is made up of many things, but basically it's a SHA-1 hash of this.

Content (everything, not just the diff)
Author
Date
Log message
Parent IDs

Change any of these and you need to create a new commit with a new ID. Note that the parent IDs are included.

What does this mean for Git? It means if I tell you I have commit "ABC123" and you have commit "ABC123" we know we have the same commit with the same content, same author, same date, same message and same parents. Those parents have the same ID so they have the same content, same author, same date, same message, and same parents. And so on. If the IDs match, they must have the same history, there's no need to check further down the line. This is one of Git's great strengths, it is woven deeply into its design, and you cannot understand Git without it.

A pull is a fetch plus a merge. git pull origin master is git fetch origin plus git merge master origin/master (or rebase with --rebase). A fetch looks something like this...

remote @ http://example.com/project.git

                  F - G [bugfix]
                 /
A - B - C - D - E - J [master]
                     \
                      H - I [feature]

local
origin = http://example.com/project.git

                  F - G [origin/bugfix]
                 /
A - B - C - D - E [origin/master] [master]

[local] Hey remote, what branches do you have?
[remote] I have bugfix at G.
[local] I also have bugfix at G! Done. What else?
[remote] I have feature at I.
[local] I don't have feature nor I. What's the parents of I?
[remote] I's parent is H.
[local] I don't have H, what's H's parents?
[remote] H's parent is J.
[local] I don't have J. What's J's parents?
[remote] J's parent is E.
[local] I have E! Send me J, H and I please.
[remote] Ok, here they come.
[local] adds J, H and I to the repo and puts origin/feature on I Ok, what else do you have?
[remote] I have master at J.
[local] I have master at E, you already sent me J. moves origin/master to J. What else?
[remote] That's it!
[local] Kthxbi

And now local looks like this...

local
origin = http://example.com/project.git

                  F - G [origin/bugfix]
                 /
A - B - C - D - E [master] - J [origin/master]
                              \
                               H - I [origin/feature]

Then it will do git merge master origin/master to finish the pull, which will fast forward to J.

A push is similar, except the process goes in reverse (local sends commits to the remote) and it will only fast-forward.

This is what Pro Git refers to as "the dumb protocol" and is used when your remote is a simple HTTP server. The Smart Protocol is what is used more often, is far less chatty, and has many optimizations. But you can see how either can be terribly efficient. There's no need to communicate the whole history, they just need to send 20 byte hash keys until they find a common ancestor.

Here's some sources and further reading.

Pro Git - Git Transfer Protocols
libgit2 fetch example

How does Git determine what objects need to be sent between repositories?

Tags:

git

git-push

git-pull

git-fetch

Silly Freak

1 Answers

Schwern

Recent Activity

Donate For Us

How does Git determine what objects need to be sent between repositories?

Tags:

git

git-push

git-pull

git-fetch

Silly Freak

1 Answers

Schwern

Related questions

Recent Activity

Donate For Us