How does git compare two files. Which algorithms are used to compare two files? Does it compare line by line while merging? I can't be sure whether comparison of two files produce a conflict or not while merging.

The key to understanding <code>git merge</code> is that Git doesn't compare two things. Git compares three things. Git can't compare all three directly. It has to compare them two at a time. Two of the things are the two branch tip versions of the files (or branch tip commits; I'll talk more about this in a moment), but Git doesn't compare those to each other. This is where the third one comes in: the third file is the merge base version of the file. Remember that the goal of a merge is to combine changes. But Git doesn't store changes. Git stores snapshots. Every commit stores every file whole and intact: given one commit, Git gets the whole <code>README.md</code>, the whole <code>main.py</code>, whatever other files are in this particular commit, that's the version in the commit. To get changes from snapshots, we need two snapshots: the old one, and the new one. Then we play a game of Spot the Difference. For Git, that's <code>git diff</code>: you give it the hash ID of the old commit, and the hash ID of the new commit, and it makes a diff for each file that's changed between the two. The output of <code>git diff</code> is a series of instructions: delete these lines, add these other lines. If you take the original snapshot and apply the instructions, you get the new snapshot. When we're merging, though, we want to take the work done by (say) Alice, and combine it with the work done by Bob. So what Git does is: <ul> <li>Find the best shared commit, that both Alice and Bob started with.</li> <li>Compare the shared commit's files to Alice's files. This is what Alice changed.</li> <li>Compare the shared commit's files to Bob's files. This is what Bob changed. </li> </ul> We call the shared commit—the one that both Alice and Bob started with—the merge base. That's the third input to a merge. Git finds this merge base commit automatically, using the history—the commits—in your repository. This means that you need to have both Alice's and Bob's commits, and all the commits that lead up to those two branch tips, so that you also have the common starting point commit. Remember that each commit, along with its snapshot, records some information about the snapshot: the name and email address of the person who made it, for instance. There's a date-and-time-stamp for when they made it, and a log message that they can use to explain why they made it. It also stores the raw hash ID of its immediate parent commit: the commit they used, via <code>git checkout</code>, to start from before they made their commit. These parent hash IDs form a backwards-looking chain: if both Alice and Bob started from commit <code>H</code>, and Alice made two commits <code>I</code> and <code>J</code> and Bob made two commits <code>K</code> and <code>L</code>, the backwards chains look like this: <pre class="prettyprint"><code> I <-J <-- (Alice's latest) / ... <-F <-G <-H \ K <-L <-- (Bob's latest) </code></pre> Git will automatically find <code>H</code>, which is where Alice and Bob both started from.1 Having found <code>H</code>, Git now, in effect, runs these two <code>git diff</code> commands: <ul> <li> <code>git diff --find-renames hash-of-H hash-of-J</code>: what Alice changed</li> <li> <code>git diff --find-renames hash-of-H hash-of-L</code>: what Bob changed</li> </ul> The merge process now combines these changes. For each file in <code>H</code>: <ul> <li>Did Alice change the file? Did Bob change the file?</li> <li>If neither changed the file, use any copy of the file: all three are the same.</li> <li>If Alice changed the file and Bob didn't, use Alice's version.</li> <li>If Bob changed the file and Alice didn't, use Bob's version.</li> <li>If both changed the file, combine their changes. This is where a merge conflict could occur.</li> </ul> <blockquote> Does [Git] compare line by line while merging? </blockquote> The answer to this is both no and yes. As you can now see, there's no comparison of Alice's version to Bob's version. There is a comparison—sort of line-by-line; it's whatever <code>git diff</code> does for comparing—of the base version, to Alice's, and there is an identical comparison of the base version to Bob's. The whole process kicks off by doing a full commit-wide comparison of the two pairs of commits. Within that commit-wide comparison, having found that both Alice and Bob changed some particular file(s), now the line-by-line, or really diff-hunk-by-diff-hunk, comparisons matter. But they're from a third version. <blockquote> I don't want to check each time manually using "git diff". </blockquote> You don't have to. You can if you want to, but to do that, you need to find the merge-base commit, using <code>git merge-base</code> perhaps. But if you don't want to, then ... don't. Git will find the merge-base commit; Git will do the two separate <code>git diff</code> operations; Git will combine Alice's changes with Bob's changes, and declare a conflict if the changed lines overlap—or in some cases, abut, or if both span to the end of file. (For Git, if both Alice and Bob made exactly the same changes to exactly the same lines, Git just takes one copy of the change. Other VCSes may declare a conflict here, either out of laziness—they don't check that the changes were the same, just that they overlapped—or paranoia: if both changed the same lines, maybe the correct result is not just to use one copy of the change. Git just says "the correct result is one copy of the change".) In any case, Git applies the combined changes to the merge base version of the file. That's the result, possibly with a merge conflict (and merge conflict markers inside the work-tree copy of the file). Finally, note the <code>--find-renames</code> in the two <code>git diff</code> commands. Git will try to tell whether Alice and/or Bob renamed any of the files in the merge-base commit. If so, Git will try to keep the renaming in the final result. This is true regardless of whether it was Alice or Bob that did the renaming. If both Alice and Bob renamed the file, Git doesn't know which final name to use, and declares a rename/rename conflict. There are similar issues if Alice or Bob deletes the file while the other one modifies it, and there's one last conflict that occurs if both Alice and Bob add a new file with the same name. These kinds of conflicts are what I call high level conflicts: they affect whole files (and/or their names) rather than individual lines within a file. This difference between a low-level conflict (lines within a file) and a high-level one matters if and when you use the <code>-Xours</code> or <code>-Xtheirs</code> option. <hr> 1This works even if Alice only made one commit, say <code>J</code>, atop (say) Carol's one commit <code>I</code> that Carol made atop <code>H</code>. The common starting point is still <code>H</code>. Git doesn't even look at the authorship of each commit: it just works backwards from the two branch tips.

How does git compare two files while merging?

2 Answers

The key to understanding git merge is that Git doesn't compare two things. Git compares three things.

Git can't compare all three directly. It has to compare them two at a time. Two of the things are the two branch tip versions of the files (or branch tip commits; I'll talk more about this in a moment), but Git doesn't compare those to each other. This is where the third one comes in: the third file is the merge base version of the file.

Remember that the goal of a merge is to combine changes. But Git doesn't store changes. Git stores snapshots. Every commit stores every file whole and intact: given one commit, Git gets the whole README.md, the whole main.py, whatever other files are in this particular commit, that's the version in the commit.

To get changes from snapshots, we need two snapshots: the old one, and the new one. Then we play a game of Spot the Difference. For Git, that's git diff: you give it the hash ID of the old commit, and the hash ID of the new commit, and it makes a diff for each file that's changed between the two. The output of git diff is a series of instructions: delete these lines, add these other lines. If you take the original snapshot and apply the instructions, you get the new snapshot.

When we're merging, though, we want to take the work done by (say) Alice, and combine it with the work done by Bob. So what Git does is:

Find the best shared commit, that both Alice and Bob started with.
Compare the shared commit's files to Alice's files. This is what Alice changed.
Compare the shared commit's files to Bob's files. This is what Bob changed.

We call the shared commit—the one that both Alice and Bob started with—the merge base. That's the third input to a merge. Git finds this merge base commit automatically, using the history—the commits—in your repository. This means that you need to have both Alice's and Bob's commits, and all the commits that lead up to those two branch tips, so that you also have the common starting point commit.

Remember that each commit, along with its snapshot, records some information about the snapshot: the name and email address of the person who made it, for instance. There's a date-and-time-stamp for when they made it, and a log message that they can use to explain why they made it. It also stores the raw hash ID of its immediate parent commit: the commit they used, via git checkout, to start from before they made their commit. These parent hash IDs form a backwards-looking chain: if both Alice and Bob started from commit H, and Alice made two commits I and J and Bob made two commits K and L, the backwards chains look like this:

                I <-J   <-- (Alice's latest)
               /
... <-F <-G <-H
               \
                K <-L   <-- (Bob's latest)

Git will automatically find H, which is where Alice and Bob both started from.¹

Having found H, Git now, in effect, runs these two git diff commands:

git diff --find-renames hash-of-H hash-of-J: what Alice changed
git diff --find-renames hash-of-H hash-of-L: what Bob changed

The merge process now combines these changes. For each file in H:

Did Alice change the file? Did Bob change the file?
If neither changed the file, use any copy of the file: all three are the same.
If Alice changed the file and Bob didn't, use Alice's version.
If Bob changed the file and Alice didn't, use Bob's version.
If both changed the file, combine their changes. This is where a merge conflict could occur.

Does [Git] compare line by line while merging?

The answer to this is both no and yes. As you can now see, there's no comparison of Alice's version to Bob's version. There is a comparison—sort of line-by-line; it's whatever git diff does for comparing—of the base version, to Alice's, and there is an identical comparison of the base version to Bob's. The whole process kicks off by doing a full commit-wide comparison of the two pairs of commits. Within that commit-wide comparison, having found that both Alice and Bob changed some particular file(s), now the line-by-line, or really diff-hunk-by-diff-hunk, comparisons matter. But they're from a third version.

I don't want to check each time manually using "git diff".

You don't have to. You can if you want to, but to do that, you need to find the merge-base commit, using git merge-base perhaps. But if you don't want to, then ... don't. Git will find the merge-base commit; Git will do the two separate git diff operations; Git will combine Alice's changes with Bob's changes, and declare a conflict if the changed lines overlap—or in some cases, abut, or if both span to the end of file.

(For Git, if both Alice and Bob made exactly the same changes to exactly the same lines, Git just takes one copy of the change. Other VCSes may declare a conflict here, either out of laziness—they don't check that the changes were the same, just that they overlapped—or paranoia: if both changed the same lines, maybe the correct result is not just to use one copy of the change. Git just says "the correct result is one copy of the change".)

In any case, Git applies the combined changes to the merge base version of the file. That's the result, possibly with a merge conflict (and merge conflict markers inside the work-tree copy of the file).

Finally, note the --find-renames in the two git diff commands. Git will try to tell whether Alice and/or Bob renamed any of the files in the merge-base commit. If so, Git will try to keep the renaming in the final result. This is true regardless of whether it was Alice or Bob that did the renaming. If both Alice and Bob renamed the file, Git doesn't know which final name to use, and declares a rename/rename conflict. There are similar issues if Alice or Bob deletes the file while the other one modifies it, and there's one last conflict that occurs if both Alice and Bob add a new file with the same name. These kinds of conflicts are what I call high level conflicts: they affect whole files (and/or their names) rather than individual lines within a file. This difference between a low-level conflict (lines within a file) and a high-level one matters if and when you use the -Xours or -Xtheirs option.

¹This works even if Alice only made one commit, say J, atop (say) Carol's one commit I that Carol made atop H. The common starting point is still H. Git doesn't even look at the authorship of each commit: it just works backwards from the two branch tips.

152

answered Oct 31 '22 01:10

torek

There are several merge strategies. 3-way merge algorithm recurse is used by default in Git.

3-way algorithm uses last common commit.

For example:

master: A -> B -> C

Create new branch

master: A -> B -> C
                   \
branch:             D

Some new commits

master: A -> B -> C -> E
                   \
branch:             D -> F

Assume all changes made in a.txt (empty cell corresponds to empty line)

 commit C         commit E         commit F 
----------       ----------       ----------
  line a                            line a
  line b         new line d
  line c                          new line e
                   line a           line b
                   line b         new line f
                   line c           
                 new line g         line c

What happens if we merge two branch (commit E, commit F). Does it produce a merge conflict?. Answer is no. Because git does not compare a file line by line. It compares context of the lines.

Align the a.txt file

 commit C         commit E         commit F 
----------       ----------       ----------

                 new line d

  line a-----------line a-----------line a

                                  new line e
  line b-----------line b-----------line b
                                  new line f

  line c-----------line c-----------line c
                 new line g

In the above table, changes are aligned. lines in the commit C (ancestor commit) are our references. git compares the neighbor of the reference lines. In the example, we have 4 slot:

above the line a : commit e adds new line d
below the line a : commit f adds new line e
below the line b : commit e adds new line f
below the line c : commit g adds new line g

As you see, only one of the branches (commit E, commit F) may add something new or both of them may add same thing. Otherwise, A merge conflict is occurred.

answered Oct 31 '22 00:10

erncnerky

Related questions
                            
                                Pull-Commit-Push or Commit-Pull-Push?
                            
                                Cannot see branch history on github
                            
                                Find the git commit where directory was first created
                            
                                How to temporarily ignore untracked files on git pull?
                            
                                VS Code Commit Undo
                            
                                Can I disable a particular git command?
                            
                                git-upload-pack not found when deploying [duplicate]
                            
                                Is it possible to push changes to master branch while in another branch
                            
                                mux_client_request_session: session request failed: Session open refused by peer
                            
                                git stash drop: How can I delete older stashed states without dropping the latest X?
                            
                                wget a raw file from Github from a private repo
                            
                                Cannot resolve conflicts(application is locking files)
                            
                                Reduce .git folder size
                            
                                'git branch -a' still showing deleted remote branch in the remote repo?
                            
                                List all files changed by a specific user in a Git repository
                            
                                Why Gitlab fails importing repository from local network with status code 128?
                            
                                VS Code can't open the terminal
                            
                                Git what is the logical difference between parent and ancestor
                            
                                Is it possible to automatically have the last updated date on my website changed to the current date whenever I push changes to GitHub?
                            
                                Git Tag, how to get editor like on commit for message?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does git compare two files while merging?

Tags:

git

git-merge

erncnerky

People also ask

2 Answers

torek

erncnerky

Recent Activity

Donate For Us