Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does git upload a changed large file entirely to remote, or could just upload the differences?

Tags:

git

Assuming I have a big text file, and it would change in some parts periodically. I want to keep it synchronized with its remote version on a git server, preferably by just uploading its changed portions.

What's the default behavior of git? Does git upload the entire file each time it has been changed? Or has an option to upload just the differences?

What about non-text (binary) files?

Thanks

like image 616
DummyBeginner Avatar asked Mar 03 '23 07:03

DummyBeginner


1 Answers

Does git upload [an] entire file each time it has been changed? Or has an option to upload just the differences?

The answer to this is actually "it depends".

The system you're describing—where we say "given existing file F, use the first part of F, then insert or delete this bit, then use another part of F" and so on—is called delta compression or delta encoding.

As Tim Biegeleisen answered, Git stores—logically, at least—a complete copy of each file with each commit (but with de-duplication, so if commits A and B both store the same copy of some file, they share a single stored copy). Git calls these stored copies objects. However, Git can do delta-compression of these objects within what Git calls pack files.

When one Git needs to send internal objects to another Git, to supply commits and their files, it can either:

  • send the individual objects one by one, or
  • send a pack file containing packed versions of the objects.

Git can only use delta-compression here if you use a Git protocol that sends a pack file. You can easily tell if you're using pack files because after git push you will see:

Counting objects: ... done
Compressing objects: ... done

This compressing phase occurs while building the pack file. There's no guarantee that when Git compressed the object, it specifically did use delta-compression against some version of the object that the other Git already has. But that's the goal and usually will be the case (except for a bug introduced in Git 2.26 and fixed in Git 2.27).

Technical details, for the curious

There is a general rule about pack files that git fetch and git push explicitly violate. To really understand how this all works, though, we should first describe this general rule.

Pack files

Git has a program (and various internal functions that can be used more directly if/as needed) that builds a new pack file using just a set of raw objects, or some existing pack file(s), or both. In any case, the rule to be used here is that the new pack file should be completely self-contained. That is, any object inside pack file PF can only be delta-compressed against other objects that are also inside PF. So given a set of objects O1, O2, ..., On, the only delta-compression allowed is to compress some Oi against some Oj that appears in this same pack file.

At least one object is always a base object, i.e., is not compressed at all. Let's call this object Ob. Another object can be compressed against Ob1, producing a new compressed object Oc1 Then, another object can be compressed against either Ob1 directly, or against Oc1. Or, if the next object doesn't seem to compress well against Ob1 after all, it can be another base object, Ob2. Assuming the next object is compressed, let's call it Oc2. If it's compressed against Oc1, this is a delta chain: to decompress Oc2, Git will have to read Oc2, see that it links to Oc1, read Oc1, see that it links to Ob1, and retrieve Ob1. Then it can apply the Oc1 decompression rules to get the decompressed Oc1, and then the decompression rules for Oc2.

Since all these objects are in a single pack file, Git only needs to hold one file open. However, decompressing a very long chain can require a lot of jumping around in the file, to find the various objects and apply their deltas. The delta chain length is therefore limited. Git also tries to place the objects, physically within the pack file, in a way that makes reading the (single) pack file efficient, even with the implied jumping-around.

To obey all these rules, Git sometimes builds an entirely new pack file of every object in your repository, but only now and then. When building this new pack file, Git uses the previous pack file(s) as a guide that indicates which previously-packed objects compress well against which other previously-packed objects. It then only has to spend a lot of CPU time looking at new (since previous-pack-file) objects, to see which ones compress well and therefore which order it should use when building chains and so on. You can turn this off and build a pack file entirely from scratch, if some previous pack file was (by whatever chance) poorly constructed, and git gc --aggressive does this. You can also tune various sizes: see the options for git repack.

Thin packs

For git fetch and git push, the pack building code turns off the "all objects must appear in the pack" option. Instead, the delta compressor is informed that it should assume that some set of objects exist. It can therefore use any of these objects as a base-or-chain object. The assumed-to-exist objects must be findable somewhere, somehow, of course. So when your Git talks to the other Git, they talk about commits, by their hash IDs.

If you are pushing, your Git is the one that has to build a pack file; if you're fetching, this works the same with the sides swapped. Let's assume you are pushing here.

Your Git tells theirs: I have commit X. Their Git tells yours: I too have X or I don't have X. If they do have X, your Git immediately knows two things:

  1. They also have all of X's ancestors.1
  2. Therefore they have all of X's tree and blob objects, plus all of its ancestors' tree and blob objects.

Obviously, if they do have commit X, your Git need not send it. Your Git will only send descendants of X (commits Y and Z, perhaps). But by item 2 above, your Git can now build a pack file where your Git just assumes that their Git has every file that is in all the history leading up to, and including, commit X.

So this is where the "assume objects exist" code really kicks in: if you modified files F1 and F2 in commits Y and Z, but didn't touch anything else, they don't need any of the other files—and your new F1 and F2 files can be delta-compressed against any object in commit X or any of its ancestors.

The resulting pack file is called a thin pack. Having built the thin pack, your push (or their responder to your fetch) sends the thin pack across the network. They (for your push, or you for your fetch) must now "fix" this thin pack, using git index-pack --fix-thin. Fixing the thin pack is simply a matter of opening it up, finding all the delta chains and their object IDs, and finding those objects in the repository—remember, we've guaranteed that they are findable somewhere—and putting those objects into the pack, so that it's no longer thin.

Multiple pack files

The fattened packs are as big as they have to be, to hold all the objects they need to hold. But they're no bigger than that—they don't hold every object, only the ones they need to hold. So the old pack files remain.

After a while, a repository builds up a large number of pack files. At this point, Git decides that it's time to slim things down, re-packing multiple pack files into one single pack file that will hold everything. This allows it to delete redundant pack files entirely.2 The default for this is 50 pack files, so once you've accumulated 50 individual packs—typically via 50 fetch or push operations—git gc --auto will invoke the repack step and you'll drop back to one pack file.

Note that this repacking has no effect on the thin packs: those depend only on the existence of the objects of interest, and this existence is implicit in the fact that a Git has a commit. Having a commit implies having all of its ancestors (though see footnote 1 again), so once we see that the other Git has commit X we're done with this part of the computation, and can build our thin pack accordingly.


1Shallow clones violate this "all ancestors" rule and complicate things, but we don't really need to go into the details here.

2In some situations it's desirable to keep an old pack; to do so, you just create a file with the pack's name ending in .keep. This is mostly for those setups where you're sharing a --reference repository.

like image 140
torek Avatar answered Mar 04 '23 21:03

torek