Git push new branch with same files, uploads all files again

Tags:

push

Given following scenario.

Create new branch
Commit a 10MB file
Git push (uploads the 10MB file)
Create a new branch (orphan)
Commit the same 10mb file (no changes made, same object sha hash)
Git push uploads the 10MB file AGAIN

My expectations are, that the already uploaded files won't be uploaded again using git push. But what actual happens is that when a new branch is made all files (even when thousands of smaller source files, instead of one 10MB file) will be uploaded again and again.

My question: How can I make it that Git detects that the 10mb file is already uploaded? Do you know a workaround/fix to make Git detecting already existing objects on the server when pushing commits? Git detects files by its sha, so it should be able to detect that some files in the tree of the commit are already present on the server.

Possible use-case: I have two completely different branches, but some common files are shared within those two. When I push one branch, I don't want to upload the common files again when I push the second branch.

Actual use-case: I do a lot of machine learning experiments using Python scripts and some smaller datasets (1MB - 10MB). Every time I start an experiment, I add all necessary experiment files to a new Git tree, and use that tree in a new commit without branching. That commits hangs completely free in the air and gets then referenced with a new Git reference (e.g. refs/jobs/my-experiment-name). When I now have two experiments with almost the same files (and thus two references), Git pushes all objects again when I push those references. I have low bandwidth and this really slows down my work.

$ mkdir git-test && cd git-test
$ git init
$ git remote add origin [email protected]:username/projectname.git

# create dummy 10MB file
$ head -c 10000000 /dev/urandom > dummy

$ git add dummy
$ git commit -m 'init'

# first push, uploads everything - makes sense
$ git push origin master
Counting objects: 3, done.
Delta compression using up to 6 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 9.54 MiB | 1.13 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0)

# create new empty branch, not based from master
$ git checkout --orphan branch2

# add same files again
$ git add dummy
$ git commit -m 'init on branch2'

# this uploads now again the dummy file (10MB), although the server
# has that object alread
$ git push origin branch3
Counting objects: 3, done.
Delta compression using up to 6 threads.
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 9.54 MiB | 838.00 KiB/s, done.

On the technical side we have:

Two commits that do not share the same parents (have completely different history)
Those two commits have the exact same tree sha id (and thus reference the same object files)
Pushing both commits results in transferring all the objects in the same tree twice. Although I expect either that Git detects that the tree in the second commit is already present OR that file objects within that tree are already on the server.

Answer (I can't answer anymore, since someone marked this as duplicate).

The solution is unfortunately not that simple.

Every time Git wants to sync two repositories it builds a pack file, that contains all objects necessary (like files, commits, trees). When you execute a git push, the remote sends all existing references (branches) and its head commit SHA to the client. This is the problem: The pack protocol is not meant to be used per-object, but per-commit. So, according to the protocol itself, the explained behaviour above is correct. To work around that, I built a simple script every one can use to do a git push based on objects, instead of commits.

You find it here: https://github.com/marcj/git-objects-sync

What it does:

Takes one commit (only one, you need to execute it on every unsynced parent commit as well) and builds a list of object SHAs (files, trees, commits) that belong to that commit (except parent commit).
Sends this list to the server, servers answers back SHAs of objects it does not have yet
Client builds a pack file based on the missing object SHAs and sends it to the server with the information which ref needs to be updated to which commit.
Server receives pack file, unpacks it and updates the ref with given commit SHA.

Of course this has some drawbacks, but I described them in the linked Github repository.

With my script above you get now following:

marc@osx ~/git-test (branch11*) $ # added new branch11 as explained at the very top
marc@osx ~/git-test (branch11*) $ python git-sync.py refs/heads/branch11
Counting objects: 1, done.
Writing objects: 100% (1/1), 158 bytes | 158.00 KiB/s, done.
Total 1 (delta 0), reused 0 (delta 0)
marc@osx ~/git-test (branch11*) $ git push origin branch11
Everything up-to-date

So as you see, it only syncs one object (the commit object), and not the dummy file and its tree object again.

382

asked Jan 12 '18 14:01

Marc J. Schmidt

1 Answers

I think you just need to stop using --orphan to create new experiment branches.

Workflow

Create your initial project.
Add and commit your core/common files to the master branch
Create all the non-orphaned branches you want for every experiment. Create them based on the master branch.

That's it.

What's going on?

You have insisted that you aren't using branches and that you are only using references. However, branches are a kind of reference. Moreover, git checkout --orphan <newthing> does actually create a branch. The trouble is that its a branch that doesn't know about anything that was previously added to the repository because it has no parents. It's essentially the same thing as having created a whole new repository.

If you create new branches with git checkout -b <newthing> master, then git will not bother uploaded files that were already in master.

How do you manage new common files now?

Let's say someday you have a new file which you want all future experiments to make use - of a new shared/common file. All you would need to do is add that file to master and create your next experiment branch based on the updated master branch. If want that file to be available to your existing/previously created experiments, you would just need to checkout those branches and run git pull --rebase origin master. This would pull in the commits you added to master, which would contain the newly added file(s).

Mounting Complexity

When you start doing pulls, things might start getting complicated. There are a couple different strategies for how to update branches, and using --rebase is one of those strategies. It's not required, but it's probably the better way to go. There additional things to consider such as how to manage conflicting changes, but those are seemingly outside the scope of this question. There are plenty of resources available to explain rebasing/merging etc.

TR;DR

Don't try to manage commit-trees and parent/child relationships manually. Just let git do its thing.

122

answered Oct 15 '22 01:10

eddiemoya

Related questions
                            
                                Git Large File Storage with Google Cloud Storage
                            
                                Trigger a Jenkins pipeline by tagging an existing commit
                            
                                How to enable git autocomplete in integrated terminal in VSCode?
                            
                                Git log without cloning the repository?
                            
                                Is there any way to set a flag by default for a git command?
                            
                                git-gui: Moving keyboard focus to "changed files"
                            
                                How to find out the space requirements of files to be committed?
                            
                                mysterious git behaviour
                            
                                How do you manage git pre-commit hooks in a team in automated way?
                            
                                Find Git commits that contain multiple specific commits
                            
                                Generating API documents in Git Workflow
                            
                                How to force Jenkins to rebuild a branch?
                            
                                How can I change my git timezone offset?
                            
                                Are there any git merge-strategies for ignoring submodule updates on a commit or branch merge into target branch?
                            
                                git add auto-complete filename
                            
                                How to merge a git repository with submodules into another with subtree merge?
                            
                                pipe is returning empty string in bash in git for windows
                            
                                How to rebase when it says that current branch is up to date even though it isn't?
                            
                                Git push over HTTP (not HTTPS) on Ubuntu hangs after sending files
                            
                                .npmignore extending / inheriting from .gitignore

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With