Why did git push so much data?

Tags:

git

I'm wondering about what git is doing when it pushes up changes, and why it seems to occasionally push way more data than the changes I've made. I made some changes to two files that added around 100 lines of code - less than 2k of text, I'd imagine.

When I went to push that data up to origin, git turned that into over 47mb of data:

git push -u origin foo
Counting objects: 9195, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (6624/6624), done.
Writing objects: 100% (9195/9195), 47.08 MiB | 1.15 MiB/s, done.
Total 9195 (delta 5411), reused 6059 (delta 2357)
remote: Analyzing objects... (9195/9195) (50599 ms)
remote: Storing packfile... done (5560 ms)
remote: Storing index... done (15597 ms)
To <<redacted>>
 * [new branch]      foo -> foo
Branch foo set up to track remote branch foo from origin.

When I diff my changes, (origin/master..HEAD) only the two files and one commit I did show up. Where did the 47mb of data come from?

I saw this: When I do "git push", what do the statistics mean? (Total, delta, etc.) and this: Predict how much data will be pushed in a git push but that didn't really tell me what's going on... Why would the pack / bundle be huge?

875

asked Jan 15 '16 21:01

user3330678

1 Answers

I just realized that there is very realistic scenario which can result in unusually big push.

What objects push does send? Which do not yet exist on server. Or, rather which it did not detect as existing. How does it check object existence? In the beginning of push, server sends references (branches and tags) which is has. So, for example, if they have following commits:

  CLIENT                                     SERVER
 (foo) -----------> aaaaa1
                      |
 (origin/master) -> aaaaa0                (master) -> aaaaa0
                      |                                 |
                     ...                               ...

Then client will get the something like /refs/heads/master aaaaa0, and find that it has to send only what is new in commit aaaaa1.

But, if somebody has pushed anything to remote master, it is different:

  CLIENT                                     SERVER
 (foo) -----------> aaaaa1                      (master) --> aaaaa2
                      |                                       /
 (origin/master) -> aaaaa0                                 aaaaa0
                      |                                      |
                     ...                                    ...

Here, client gets refs/heads/master aaaaa2, but it does not know anything about aaaaa2, so it cannot deduce that aaaaa0 exists on the server. So, in this simple case of only 2 branches the whole history will be sent instead of only incremental one.

This is unlikely to happen in grown up, being actively developed, project, which has tags and many branches some of which become stale and are not updated. So users might be sending a bit more, but it does not become that big difference as in your case, and goes unspotted. But in very small teams it can happen more often and the difference would be significant.

To avoid it, you could run git fetch before push. Then, in my example, the aaaaa2 commit would already exist at client and git push foo would know that it should not send aaaaa0 and older history.

Read here for the push implementation in protocol.

PS: the recent git commit graph feature might help with it, but I have not tried it.

answered Sep 18 '22 15:09

max630

Related questions
                            
                                git: empty arguments in post-receive hook
                            
                                Git Revert Not Working
                            
                                What is the git equivalent of Mercurial revsets?
                            
                                What does git "added by us" mean?
                            
                                Distributed issue tracker for git with usable Eclipse Mylyn support?
                            
                                Git-based content management? [closed]
                            
                                How to organize a set of scientific experiments using Git
                            
                                npm install from git repo subfolder
                            
                                Git workflow for maintaining a derivative fork
                            
                                Which signals can safely be used to kill a Git process and which not?
                            
                                What are the use-cases for local branches tracking other local branches?
                            
                                A true, robust, git svn externals solution?
                            
                                Push local repo to new sub-directory of remote repo
                            
                                Jenkins Pipeline job can't find script due to @tmp path being created
                            
                                How do I relate an existing Mercurial and git repositories using hg-git?
                            
                                Value and usage of Git-Flow's tag-prefix feature?
                            
                                Why do large files still exist in my packfile, after scrubbing them with filter-branch?
                            
                                How to make Git temporarily ignore ~/.gitconfig?
                            
                                Git pre-commit hooks only for a specific subfolder?
                            
                                Why does git keep messing with my line endings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With