Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git granularity -- resolving diffs within a line

Tags:

git

diff

Is git line-based or diff granularity can be increased to a word/letter resolution? It is worth for multiple statements per line or using git for writing plain texts.

like image 475
Little Alien Avatar asked Jun 19 '16 11:06

Little Alien


1 Answers

On re-reading the question in light of the comments, I think I see what you were getting at originally, so I'll put in a real answer (vs Ismail Badawi's one line comment).

Git as simple storage

Git has quite a collection of pieces, some of which play better together than others. At the lowest level, Git acts as a pure content-based datastore: a database with key/value pairs, where the key itself is simply (and always) the hash of the value. That is, you cannot choose the key, the key is provided for you after you provide the value.

Just above this very low level key/value datastore is a similarly low level interface, in which the stored values have types. The four types of objects are "blobs", which hold your actual files; "trees", which allow Git to map real, useful file names, things like file.py or image.jpg, into unreadable hashes like b02f09aabdd957329476837f406588710142aebd; "commits", which translate from unreadable hashes to metadata about a commit, including a tree representing the files at the time of the commit; and annotated-tag objects, which we can ignore here.

Above these, Git provides references. A reference is a general form that encompasses both branches and tags, plus a lot more (remote-tracking branches, Git's "notes", the "stash", and so on). These turn human readable names like master into unreadable hashes, which Git then uses to obtain the commit(s), which Git uses to obtain the hashes for trees, which Git uses to obtain the hashes for your files.

This is the level at which Git first becomes really useful. At this point, Git can store any arbitrary data at all: it need not be composed of lines, or even words; binary forms, including JPEG images, work here. The only constraint is that every file must hash to its own unique value, unless it's the same as a previous version of that file. (To put it as an example, it's OK for hash("a file\nwith two lines\n") to match hash("a file\nwith two lines\n") even if those are in file1.txt and file2.txt; what's not OK is if hash("a file\nwith two lines!\n") had the same value, because we just added a byte: an exclamation point on the second line. We must get a new, different hash for this file's content, regardless of the file's name.)

When you make new commits, Git only cares about content. Specifically, at git commit time, Git packages up the contents of all tracked files based on the staged versions of the files (which are whatever is in the1 index). The names of the files (including path names with subdirectories, if needed) all go into a bunch of tree objects, and the files' contents themselves go into blob objects, or re-use existing blob objects when they are already in the database.

(The latter is very common for most commits, since usually you have dozens, or hundreds, or even tens of thousands, of files in each commit, but each new commit leaves most of the files unchanged from the previous commit. The index/staging-area has the same blob-hash for unchanged file Shakespeare-Sonnet-107.txt as it did for the last 50 commits, so we just re-use the same blob yet again in the new commit.)

Git as compression scheme

If all you want to do with Git is store a bunch of versions of some file tree—more of a backup system than a source control system, in other words—then the question of how efficient Git is at compressing files, after some change(s), becomes interesting.

The basic form of object storage compresses identical files perfectly: as noted above, no matter how many commits contain a file named Shakespeare-Sonnet-107.txt, as long as the contents are byte-for-byte identical, we just re-use the blob whose ID is the hash of that file's content.2

This same basic storage method—the so-called "loose object"—compresses data using Zlib. This works pretty well for text, where a big text file might compress to 10% of its original size, but it does less well for binaries (compression varies widely, I've seen 1/3 to 1/2 of original size) and very poorly for anything already compressed. For text files, and programming language source files, though, we can usually do much better.

Most version control systems offer what is called delta compression, which in its simplest form is just "diff old and new, and turn the instructions that change one to the other into the stored form of the file." That is, we run a string-to-string correction algorithm on two files—typically previous and next version, in a linear chain of commits—and instead of storing an entire new file, we just say "remove line 3, add new line 47" or whatever.

Git performs delta compression in an unusual way, for a Version Control System at least. It never simply says "oh, well, here's a new slightly different version of blah.py, I'll compress against the most recent blah.py". Instead, it compresses objects late in the game, making "pack files" out of individual objects (and, for repacking, other already-packed objects, although the rules get complex4 here) and choosing the objects to pack against each other by various heuristics.

The algorithm at the base of this particular code is a modified version of xdelta. This works on arbitrary binary data and does not depend on line-breaks.

Git as version control

If we want to use Git to do real version control, however—the thing it's designed for, after all—we must look further still. We will have many version-control-specific tasks, but two really big ones are looking at what has changed and merging changes from different development paths.

To see what has changed, Git gives us git diff, git show, and git log. All of these use the same basic difference engine: given two commits, it matches up files within those commits, and then matches up lines—aha, here's where "lines" come in!—in the matched-up files.

The output from git diff, and therefore also git show and git log -p, is very much line-oriented. If you change a word within a line, it shows the entire line as changed. There are flags you can provide to git diff (and hence git show and git log) that will, after Git has found these line-oriented changes, direct them to show which word(s) within the line(s) are actually different. As of Git version 2.9, the intra-line display has been improved further: there is a new diff-highlight script that will show exactly what changed. (These all work atop the line-oriented diff underneath: this shows through when you ignore whitespace changes, for instance, where you will see empty diff hunks, rather than no diff at all.)

Note that using word or character display has no effect on the internal format, of either commit-blobs (stored intact) or deltified objects in packs (xdelta is not line-oriented). These are purely display options.

Three-way merge

Besides all of the above considerations, if you are going to use Git's merge facilities—you do not have to; some folks with Perforce use p4merge, for instance, instead of Git's built in merge—you need to know that these start by running two regular, and hence line-oriented, git diffs.

In particular, when you run git merge <other>, Git resolves <other> to a commit ID, then finds the merge base between the current (HEAD) commit and the other commit. This merge base commit5 serves as a starting point. Git makes two diffs: one from the base to HEAD, and one from the same base to <other>. Git then combines the two diffs, applying the combined changes to the files that are in the base commit.

Since these diffs are line-oriented, the combination process will generally go more smoothly if you construct your changes so that they correspond with the lines. So, as with git diff itself, this is where you may want your files to be very much line-oriented.

As noted above, you need not use Git's internal merge, although writing merge scripts is a bit tricky, and those who do use external merges are always stumbling across various rough edges.


1This implies that there's only one index / staging area. In fact, there's one primary index, and in general you only see that one index, but there are a bunch of corner cases, such as git commit -a, where additional indexes pop into temporary existence, and mess with this model.

2Technically the ID is hash("blob %d\0%s" % (len(bytes), bytes)), to write in Python-esque form. That is, the hash ID of a file's content is found by starting with the word "blob", one space, the decimalized representation of the length of the content, an ASCII NUL byte, and finally the actual content. This guarantees that you cannot break Git by simply writing a file whose content is the same as that of some existing commit,3 for instance. The problem here is that each object can be tested for its type, rather than assuming the type from context, so if a commit and a regular file have the same hash, this would force the single underlying Git repository object to have both type commit and type blob, which is not allowed.

3You can view the contents of the current commit, whose ID is whatever git rev-parse HEAD prints, with git cat-file -p HEAD, for instance. Write the output from git cat-file into a regular file, and git add this new file, and you will get a new ID that does not match what git rev-parse HEAD shows—all because, internally, each object gets prefixed with its type name, plus that space-size-and-NUL sequence.

4To boil them down a lot, a pack object can only refer to other objects inside the same pack, except for so-called "thin" packs, which are used for transfers across the network.

5Assuming there is a single merge base. If there are several, the default recursive strategy constructs one, by merging the merge-base candidates. Usually there is just one single merge base and we need not worry about this. Occasionally there is no merge base at all; Git 2.9 has changed the default to be to complain and fail in this case, rather than merging from an empty tree.

like image 140
torek Avatar answered Oct 16 '22 13:10

torek