I have looked here but couldn't quite figure out the things I was wondering about: how does git push
or git pull
figure out what commit objects are missing at the other side?
Let's say we have a repository with the following commits: (letters stand in for SHA-1 IDs, d
is refs/heads/master
)
a -> b -> c -> d
The remote, in contrast, has these:
a -> e -> f -> g
According to the git document, the remote would tell us that its refs/heads/master
is at g
, but since we don't know that commit, that doesn't actually tell us anything. How is that enough to figure out the missing data?
In the other direction, the document says:
At this point, the fetch-pack process looks at what objects it has and responds with the objects that it needs by sending “want” and then the SHA-1 it wants. It sends all the objects it already has with “have” and then the SHA-1. At the end of this list, it writes “done” to initiate the upload-pack process to begin sending the packfile of the data it needs:
this explains how the remote would determine what data to send, but wouldn't this impact pull performance on repositories with many objects? Otherwise, what is it that is actually meant in the text?
Apparently the way of data transfer is very different depending on the direction (push vs pull). What and how are the challenges met by this design choice, and how am I to understand their descriptions in the document?
The magic is in the IDs. A commit ID is made up of many things, but basically it's a SHA-1 hash of this.
Change any of these and you need to create a new commit with a new ID. Note that the parent IDs are included.
What does this mean for Git? It means if I tell you I have commit "ABC123" and you have commit "ABC123" we know we have the same commit with the same content, same author, same date, same message and same parents. Those parents have the same ID so they have the same content, same author, same date, same message, and same parents. And so on. If the IDs match, they must have the same history, there's no need to check further down the line. This is one of Git's great strengths, it is woven deeply into its design, and you cannot understand Git without it.
A pull is a fetch plus a merge. git pull origin master
is git fetch origin
plus git merge master origin/master
(or rebase
with --rebase
). A fetch looks something like this...
remote @ http://example.com/project.git
F - G [bugfix]
/
A - B - C - D - E - J [master]
\
H - I [feature]
local
origin = http://example.com/project.git
F - G [origin/bugfix]
/
A - B - C - D - E [origin/master] [master]
And now local looks like this...
local
origin = http://example.com/project.git
F - G [origin/bugfix]
/
A - B - C - D - E [master] - J [origin/master]
\
H - I [origin/feature]
Then it will do git merge master origin/master
to finish the pull, which will fast forward to J.
A push is similar, except the process goes in reverse (local sends commits to the remote) and it will only fast-forward.
This is what Pro Git refers to as "the dumb protocol" and is used when your remote is a simple HTTP server. The Smart Protocol is what is used more often, is far less chatty, and has many optimizations. But you can see how either can be terribly efficient. There's no need to communicate the whole history, they just need to send 20 byte hash keys until they find a common ancestor.
Here's some sources and further reading.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With