Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What types of binary files does Git keep deltas for?

Tags:

git

git-lfs

We're dealing with a very large project that needs to be migrated to Git. Unfortunately, it contains a large number of binaries as well, some of which are zip-s, dll-s and so on. At the moment, it's not possible to remove these binaries from the version control system.

I would like to find out more about how Git keeps deltas for binary files and if, and for which ones it doesn't. I know this is configurable via the .gitattributes file, but do the file types need to be listed explicitly, or is there a pre-defined default set that it recognizes and handles automatically...?

like image 975
carlspring Avatar asked Jan 17 '18 13:01

carlspring


People also ask

What does git do with binary files?

Git cannot diff binary files. It will upload entire file into repository and will store it pretty much forever. It will also store every single version of every single binary file within the repository.

Does git use Delta?

Git does use deltas for storage. This way it can (often, depends on the heuristics) take advantage of other similar files or older versions that are more similar than the previous.

Does git compress binary files?

Git is not especially bad at handling binary assets, but it's not especially good either. By default, Git will compress and store all subsequent full versions of the binary assets, which is obviously not optimal if you have many.

Which data formats are usually stored in binary files?

A binary file is one that does not contain text. It is used to store data in the form of bytes, which are typically interpreted as something other than textual characters. These files usually contain instructions in their headers to determine how to read the data stored in them.


1 Answers

First, let's get a bit of terminology out of the way. Files are stored as blob objects. These are one of four object types, the other three being commit, tree, and annotated tag.

Git's model is that all objects are logically independent. Everything is stored by its hash ID key, in a database. To retrieve any object, you start by knowing its hash ID, which you get from something or someone else.1 You feed that hash ID to an object-getter, and it either looks up the object where it is stored directly, with no chance at delta compression at all—this is what Git calls a loose object—or, failing that, Git looks inside pack files, which pack multiple separate objects together and provide the opportunity for delta compression.2

What you're looking for, then, is information about which blob objects Git chooses to delta-compress against which other blob objects inside these pack files. The answer has evolved somewhat over time, so there is no single correct answer—but there are certain control knobs, including the .gitattributes one you mentioned.

The actual delta format is a modification of xdelta. It can, literally, compress (or "deltify") any binary data against any other binary data—but the results will be poor unless the inputs are well-chosen. It's the input choices that are the real key here. Git also has a technical documentation file describing how objects are chosen for deltification. This takes file path names, and especially final path component names, into account.

Note that if deltification fails to make the object smaller, the object is simply not delta-compressed. The object's original file size is also an input here, and core.bigFileThreshold (introduced in Git 1.7.6) sets a size value: files above this level are never deltified at all.

Hence, you can prevent Git from considering a file (object, really) for deltification by either of two ways:

  • set core.bigFileThreshold so that the object is too big, or
  • make the object's path name match a .gitattributes line that has -delta specified.

Note that when using Git-LFS, large files are not stored in Git at all. Instead, a large file (as defined by the Git-LFS settings) is replaced (at git add time) by an indirect name. Git then stores this indirect name as the blob object (using the original file's path). When Git extracts the object, Git-LFS inspects it before allowing it to go into your work-tree. Git-LFS detects that the object's data were replaced with an indirect name, and retrieves the "real" data from another (separate, not-Git-at-all) server using the indirect name. So Git never sees the large file's data at all: instead, it sees only these indirect names.


1For instance, we might start with a branch name like master, which gets us the latest (or tip) commit hash ID. That hash ID gives us access to the commit object. The commit lists the hash ID of a tree. The tree, once we obtain it, lists the hash ID of some blob, along with the file's name. So, now we know that the hash ID for the version of README in the tip commit of master, if that's what we're looking for. Or, we use the commit data to find an older commit, which we use to find another even-older commit, and so on, until we arrive at the commit we want; and then we use the tree to find the blob IDs (and names) of files.

2Normally, an object can only be "deltified" against other objects in the same pack. For transport purposes, Git provides what it calls a thin pack in which objects can be delta-compressed against other objects that are omitted, but are assumed to be available on the other side of the transport mechanism. The other Git must "fatten up" the thin pack.

like image 159
torek Avatar answered Nov 07 '22 10:11

torek