Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does git store resolved conflicts after merges and its author history

Tags:

git

Today I finished reading the book chapter about the internals of git and I think I got a good overview of how git works.

But what I still don't understand is how and where git stores the author information of each line (resolved conflict) after a conflict got resolved.

As far as I understood commits just create 4 elements, and none of them cover a line <==> author relationship

helloworld.txt
This line (1) got accepted from branch foo by author foo
This line (2) got accepted from branch bar by author bar
like image 309
Daniel Stephens Avatar asked Nov 29 '25 20:11

Daniel Stephens


1 Answers

Git doesn't store that at all. Git recomputes this kind of information, every time you ask about it.

Each commit stores a full snapshot of every file. So file F in commit a123456 is "authored" by whoever committed a123456. That's not very interesting, at least, not on its own.

But each commit also stores, in its metadata, the hash ID of some set of parent commits. Most commits have exactly one parent: perhaps the commit before a123456 is 3141592, for instance. File F is probably in this earlier commit as well. If we compare the content of file F in commit 3141592 with that of file F in a123456, maybe some lines are different. If that's the case, we can claim that whoever made a123456 really did write those particular lines, and whoever made 3141592 wrote the earlier lines.

But wait! 3141592 also has a parent, such as 2147483. That commit probably has file F too. If so, we repeat the comparing process: did the author of 3141592 change some lines, or simply carry the file through from before? Or, if 2147483 does not have F after all, we can deduce that all of these lines were authored by whoever made 3141592.

Note how Git has to start at the end of some linear chain of commits and work backwards. A program like git blame "assigns ownership" of some source-code line of some file when, during this backwards walk, the line changes to read however it does in the final commit. If it doesn't change, we don't yet know who to say wrote the line: we have to keep going back.

What about merges?

A merge commit stores the same snapshot as any non-merge commit, but instead of containing the hash ID of one parent, it has the hash ID of two parent commits. So if we have a merge commit in hand, we can compare it to either parent. Suppose file F is in merge commit M, and M has parents J and L. If F exactly matches both parents' copies, F is probably not changed all the way back to the merge base:

          I--J
         /    \
...--G--H      M
         \    /
          K--L

The F in M probably matches file F in H, and was not changed in any of either set of commits between H and M.

But if the F in M matches the F in J, and doesn't match the F in L, why then, we must have picked the copy in J when merging. So the copy in L probably matches the copy in H, while the one in J is probably different. We should walk from M to J to "assign blame" for changes in F.

This is particularly weird since we follow the parent that didn't change the file. But that's how git log's History Simplification works: when doing history simplification, git log picks some parent in which a file didn't change, and goes down just that one leg of history.1

If M's F doesn't match either of its inputs, both lines of commits must have contributed. Was there a merge conflict? We have no idea: we only know that the snapshot in M does not match either of the two snapshots in J and L. The git log command will in this case not simplify away one leg of the merge. What git blame will do is a bit of a mystery as it has never really been documented (and the algorithms for blame have evolved over time).


1There is a detailed description of how this works in the git log documentation, using the word TREESAME. Git looks not just at one file, but rather at every retained file, with history simplification normally being turned on by mentioning particular pathnames. The pathnames control which paths are retained, and which are stripped, in order to compare the saved snapshots in each commit, to determine TREESAME-ness.

Here, git blame's documentation is a little weak on detail.

like image 168
torek Avatar answered Dec 01 '25 14:12

torek