Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git seems to store the whole file instead of diff, how to avoid that?

Tags:

git

I have a "large" (5 mb) text file in a git repo. If I add a character at the last line and run git add my .git folder increases in size with approx 1 mb (which I assume is the compressed size of my 5 mb file).

The same happens for each time I edit and add.

If I run git add -p file I get a nice diff back of just a few bytes. But anyway the large object file gets stored when I full fill the add.

Running git gc --prune=now removes the large object files, and things still seems to work as expected.

But regularly running git gc after each add is not a good option since I use git in an automatic way on a SD-card which will wear out the card writing and deleting megabytes in that way.

So, my question(s) is

1) I am I right that this is the behavior of git? or do I misunderstand something?

2) Can I avoid this and make git only save the diff?

I have no problem trading away flexibility in restoring old changes and so on. There is no need for branching or stashing or other things that can complicate life for git.

edit Just to be clear, my problem isn't that git saves the whole file once. But that it stores the whole file for each edit. If I add 10 characters with add and commit between each character-editing, it saves the whole file (in compressed form) 10 times.

like image 996
Nicklas Avén Avatar asked Jan 05 '17 10:01

Nicklas Avén


People also ask

Does git store diffs or whole files?

No, commit objects in git don't contain diffs - instead, each commit object contains a hash of the tree, which recursively and completely defines the content of the source tree at that commit.

Why is git diff not showing anything?

There is no output to git diff because Git doesn't see any changes inside your repository, only files outside the repository, which it considers 'untracked' and so ignores when generating a diff.

How do I continue git diff?

While in git diff , simply hit n to go straight to the next file, and again to the one afterwards, and so on. You can also use N to go back a file. (For these commands to work, you'll need to first type /^diff and press Enter , as explained in this answer.) Pressing n finds the next search term.

Can you git diff a specific file?

The git diff command returns a list of all the changes in all the files between our last commit and our current repository. If you want to retrieve the changes made to a specific file in a repository, you can specify that file as a third parameter.


2 Answers

Git stores all files as "objects" (specifically, as blob objects, with blobs being one of the four possible object types in Git). But this is not the whole story.

Each object is uniquely identified by its contents. The contents of the object are turned into a cryptographic hash (specifically, SHA-1, with the raw contents being prefixed by an object type—in this case blob—and a decimalized representation of its size in bytes and a single ASCII NUL byte, followed by the actual object bytes). Hence if you add the exact same file more than once, you get the same hash, because the raw contents remain the same—but if you change even a single byte, you get a new object, with a new and different hash.

This is why your repository grows by ~1 MB: as you surmised, 1 MB is the size of the compressed 5 MB object. One byte is different, so the new object has a new ID and is stored as a new "loose" object. A loose object consists of the compressed object and header, stored in its own separate file ... but not all objects are loose. Git also provides packed objects.

Packed objects are considerably more complicated. Objects stored in a pack are "deltified": compressed with Git's special modified variant of libXdiff (see also Is the git binary diff algorithm (delta storage) standardized?). Git chooses a base object and a series of derived objects that are then compressed against the base. With any luck, your files will be compressed against themselves, so that once they are packed, they go back to being relatively small, except for the base file itself.

Git normally chooses when to make pack files on its own, and its usual code handles most ordinary source files pretty well. Very large text files will unbalance the automatic packing somewhat, so you might want to experiment with "hand packing" (using an occasional git repack -a -d and/or tweaking the window parameters) to see if you can get better results. However, note that except for "thin packs" used to send deltas across a network connection, pack files require the base object to be present in the same pack as all the deltified objects. If your large file will change often, it will be counterproductive to pack it often, as you will get many large packs (though the -a -d step should consolidate packs as long as you are not using "keep" files on them).

(If you modify the work-tree version of the file and git addthe result and it gets a new hash, Git will immediately package it up as a loose object, regardless of any existing packed versions.)

like image 124
torek Avatar answered Sep 22 '22 10:09

torek


You can see the documentation here.

It turns out that it can. The initial format in which Git saves objects on disk is called a “loose” object format. However, occasionally Git packs up several of these objects into a single binary file called a “packfile” in order to save space and be more efficient. Git does this if you have too many loose objects around, if you run the git gc command manually, or if you push to a remote server. To see what happens, you can manually ask Git to pack up the objects by calling the git gc command:

So, don't worry about this, git will pack your file and only keep the difference automatically to save disk space when there are too many objects. Also, you can run git gc manually.

like image 41
ramwin Avatar answered Sep 22 '22 10:09

ramwin