Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does git compare two files while merging?

Tags:

git

git-merge

How does git compare two files. Which algorithms are used to compare two files? Does it compare line by line while merging?

I can't be sure whether comparison of two files produce a conflict or not while merging.

like image 406
erncnerky Avatar asked Jul 04 '19 13:07

erncnerky


People also ask

How does Git compare work?

For Git, that's git diff : you give it the hash ID of the old commit, and the hash ID of the new commit, and it makes a diff for each file that's changed between the two. The output of git diff is a series of instructions: delete these lines, add these other lines.

What happens when two branches are merged?

Merging Branches. Once you've completed work on your branch, it is time to merge it into the main branch. Merging takes your branch changes and implements them into the main branch. Depending on the commit history, Git performs merges two ways: fast-forward and three-way merge.

How does git merge handle whitespace differences between versions?

If their version only introduces whitespace changes to a line, our version is used; If our version introduces whitespace changes but their version includes a substantial change, their version is used; Otherwise, the merge proceeds in the usual way.


2 Answers

The key to understanding git merge is that Git doesn't compare two things. Git compares three things.

Git can't compare all three directly. It has to compare them two at a time. Two of the things are the two branch tip versions of the files (or branch tip commits; I'll talk more about this in a moment), but Git doesn't compare those to each other. This is where the third one comes in: the third file is the merge base version of the file.

Remember that the goal of a merge is to combine changes. But Git doesn't store changes. Git stores snapshots. Every commit stores every file whole and intact: given one commit, Git gets the whole README.md, the whole main.py, whatever other files are in this particular commit, that's the version in the commit.

To get changes from snapshots, we need two snapshots: the old one, and the new one. Then we play a game of Spot the Difference. For Git, that's git diff: you give it the hash ID of the old commit, and the hash ID of the new commit, and it makes a diff for each file that's changed between the two. The output of git diff is a series of instructions: delete these lines, add these other lines. If you take the original snapshot and apply the instructions, you get the new snapshot.

When we're merging, though, we want to take the work done by (say) Alice, and combine it with the work done by Bob. So what Git does is:

  • Find the best shared commit, that both Alice and Bob started with.
  • Compare the shared commit's files to Alice's files. This is what Alice changed.
  • Compare the shared commit's files to Bob's files. This is what Bob changed.

We call the shared commit—the one that both Alice and Bob started with—the merge base. That's the third input to a merge. Git finds this merge base commit automatically, using the history—the commits—in your repository. This means that you need to have both Alice's and Bob's commits, and all the commits that lead up to those two branch tips, so that you also have the common starting point commit.

Remember that each commit, along with its snapshot, records some information about the snapshot: the name and email address of the person who made it, for instance. There's a date-and-time-stamp for when they made it, and a log message that they can use to explain why they made it. It also stores the raw hash ID of its immediate parent commit: the commit they used, via git checkout, to start from before they made their commit. These parent hash IDs form a backwards-looking chain: if both Alice and Bob started from commit H, and Alice made two commits I and J and Bob made two commits K and L, the backwards chains look like this:

                I <-J   <-- (Alice's latest)
               /
... <-F <-G <-H
               \
                K <-L   <-- (Bob's latest)

Git will automatically find H, which is where Alice and Bob both started from.1

Having found H, Git now, in effect, runs these two git diff commands:

  • git diff --find-renames hash-of-H hash-of-J: what Alice changed
  • git diff --find-renames hash-of-H hash-of-L: what Bob changed

The merge process now combines these changes. For each file in H:

  • Did Alice change the file? Did Bob change the file?
  • If neither changed the file, use any copy of the file: all three are the same.
  • If Alice changed the file and Bob didn't, use Alice's version.
  • If Bob changed the file and Alice didn't, use Bob's version.
  • If both changed the file, combine their changes. This is where a merge conflict could occur.

Does [Git] compare line by line while merging?

The answer to this is both no and yes. As you can now see, there's no comparison of Alice's version to Bob's version. There is a comparison—sort of line-by-line; it's whatever git diff does for comparing—of the base version, to Alice's, and there is an identical comparison of the base version to Bob's. The whole process kicks off by doing a full commit-wide comparison of the two pairs of commits. Within that commit-wide comparison, having found that both Alice and Bob changed some particular file(s), now the line-by-line, or really diff-hunk-by-diff-hunk, comparisons matter. But they're from a third version.

I don't want to check each time manually using "git diff".

You don't have to. You can if you want to, but to do that, you need to find the merge-base commit, using git merge-base perhaps. But if you don't want to, then ... don't. Git will find the merge-base commit; Git will do the two separate git diff operations; Git will combine Alice's changes with Bob's changes, and declare a conflict if the changed lines overlap—or in some cases, abut, or if both span to the end of file.

(For Git, if both Alice and Bob made exactly the same changes to exactly the same lines, Git just takes one copy of the change. Other VCSes may declare a conflict here, either out of laziness—they don't check that the changes were the same, just that they overlapped—or paranoia: if both changed the same lines, maybe the correct result is not just to use one copy of the change. Git just says "the correct result is one copy of the change".)

In any case, Git applies the combined changes to the merge base version of the file. That's the result, possibly with a merge conflict (and merge conflict markers inside the work-tree copy of the file).

Finally, note the --find-renames in the two git diff commands. Git will try to tell whether Alice and/or Bob renamed any of the files in the merge-base commit. If so, Git will try to keep the renaming in the final result. This is true regardless of whether it was Alice or Bob that did the renaming. If both Alice and Bob renamed the file, Git doesn't know which final name to use, and declares a rename/rename conflict. There are similar issues if Alice or Bob deletes the file while the other one modifies it, and there's one last conflict that occurs if both Alice and Bob add a new file with the same name. These kinds of conflicts are what I call high level conflicts: they affect whole files (and/or their names) rather than individual lines within a file. This difference between a low-level conflict (lines within a file) and a high-level one matters if and when you use the -Xours or -Xtheirs option.


1This works even if Alice only made one commit, say J, atop (say) Carol's one commit I that Carol made atop H. The common starting point is still H. Git doesn't even look at the authorship of each commit: it just works backwards from the two branch tips.

like image 152
torek Avatar answered Oct 31 '22 01:10

torek


There are several merge strategies. 3-way merge algorithm recurse is used by default in Git.

3-way algorithm uses last common commit.

For example:

master: A -> B -> C

Create new branch

master: A -> B -> C
                   \
branch:             D

Some new commits

master: A -> B -> C -> E
                   \
branch:             D -> F

Assume all changes made in a.txt (empty cell corresponds to empty line)

 commit C         commit E         commit F 
----------       ----------       ----------
  line a                            line a
  line b         new line d
  line c                          new line e
                   line a           line b
                   line b         new line f
                   line c           
                 new line g         line c

What happens if we merge two branch (commit E, commit F). Does it produce a merge conflict?. Answer is no. Because git does not compare a file line by line. It compares context of the lines.

Align the a.txt file

 commit C         commit E         commit F 
----------       ----------       ----------

                 new line d

  line a-----------line a-----------line a

                                  new line e
  line b-----------line b-----------line b
                                  new line f

  line c-----------line c-----------line c
                 new line g

In the above table, changes are aligned. lines in the commit C (ancestor commit) are our references. git compares the neighbor of the reference lines. In the example, we have 4 slot:

  • above the line a : commit e adds new line d
  • below the line a : commit f adds new line e
  • below the line b : commit e adds new line f
  • below the line c : commit g adds new line g

As you see, only one of the branches (commit E, commit F) may add something new or both of them may add same thing. Otherwise, A merge conflict is occurred.

like image 22
erncnerky Avatar answered Oct 31 '22 00:10

erncnerky