What are the underlying git merge processes within the staging area?

Tags:

Git does the merge magic, and then lets the user resolve real conflicts, which is as it should be. I'm looking for a low level description of the how and why of the basic git merge and how it uses the staging area.

I've just read the Git Parable, and the comment on here that

Even taking into account the fact that its is "parable" and not recount of the history of Git (which you can find in some detail on Git Wiki, by the way), one point stays: it is IMVHO bad practice to explain staging area in the terms of splitting changes into more than one commit and/or comitting with dity tree, i.e. with some changes uncomitted. Staging area main strength (besides being explicit version of other SCMs implicit to-be-added area) is dealing with CONFLICTED MERGE, and that is how it should be explained, I think.

The git merge man page identifies the stage 1/2/3 elements of the merge, but obviously doesn't go into details of whys and wherefores.

Can folk advise on any articles on how and why git manages to achieve the results others don't (over and above the Linus V Bram detailed in Wincent's blog), i.e. the alleged Trivial part?

Most web articles assume that merges 'just happen', and I haven't found anything that explains the issues (e.g. the need for small commits, the value of a common commit, etc).

258

asked Aug 17 '11 14:08

Philip Oakley

2 Answers

This should help with at least some of your questions as it's the most common merge that git does:

git merge-file

git merge-file is designed to be a minimal clone of RCS merge; that is, it implements all of RCS merge's functionality which is needed by git(1).

194

answered Nov 15 '22 20:11

Gerry

Most every VCS employs the basic concept of a three-way merge. This compares two branches with a common ancestor of each, so if a line of code is different between the two branches, you know which branch changed it. If they both changed it, you have a merge conflict that must be resolved by a human.

There are a few cases where it is difficult to determine a suitable common ancestor. A lot of research went into different algorithms for this, many involving the tracking of additional metadata with the commits.

Linus' essential innovation was the tracking of trees rather than files. That's sort of a subtle distinction. To illustrate with the example from Wincent's blog, consider a file foo in branch A. You branch off to make branch B. In branch A foo is renamed to bar. In branch B, it is deleted. You then attempt to merge.

If you are tracking files, it goes like this:

Before branching, version 1 of file foo is created.

After the next commit, branch A points to version 2 of foo, which is a deleted file, and version 1 of new file bar.

After the next commit, branch B points to version 2.1 of foo, which is a deleted file.

When you merge, version 2 and 2.1 of foo are compared and found to be identical. No merge conflict there. Branch B doesn't even have a file called bar, so no conflict there either. You end up with the merge algorithm silently accepting branch A's rename, even though there was a real conflict between foo being deleted and it being renamed.

If you are tracking trees, it goes like this:

Before branching, a blob with hash dcb8bd7a97ab39f4c156a1a96d4b10720a39fb81 is created. A tree is created with an entry containing a label foo pointing to the hash.

After the next commit, branch A points to a tree with an entry containing a label bar pointing to the same hash.

After the next commit, branch B points to an empty tree.

When you merge, the trees are compared, with B showing a deletion and A showing a rename of the blob dcb8bd7a97ab39f4c156a1a96d4b10720a39fb81. Human is asked which one he prefers.

You can mitigate the effect somewhat with a file-tracking VCS by adding metadata for renames, but git's way uses its normal standard data structure. Also, the metadata way has difficulties with complex merges where there are many possible choices for the common ancestor. You could put a billion possible paths between the common ancestor and the two branch heads, and git will still see a blob with the same hash and be able to detect a rename and a delete. It's also difficult to preserve metadata when accepting changes in a patch via email, for instance.

It gets a little trickier with a renamed file that changes at the same time, but by tracking the trees, git has all the information it needs. It sees blob dcb8bd7a97ab39f4c156a1a96d4b10720a39fb81 gone from both branches, but it also sees a new tree entry pointing to a new blob, and can compare the two. If a significant portion of the file matches, it's considered a rename. Obviously this breaks down if you make a ton of changes in a renamed file, but at some point no merge algorithm is going to be able to help you.

See this email from Linus for more insight about his philosophy on this topic.

answered Nov 15 '22 19:11

Karl Bielefeldt

Related questions
                            
                                Custom Git merge driver with no rename detection
                            
                                How can I delete all local branches which would result in no changes if merged into master?
                            
                                git: manage multiple remotes in a submodule
                            
                                Manage multiple git release branches for multiple customers
                            
                                How can I use a wildcard in git pathspec?
                            
                                git fetch --tags --progress times out in Jenkins, works fine on command line
                            
                                Merge two Git repositories and keep the master history
                            
                                Git + Windows + Visual Studio Merge Conflicts Caused by Line Ending Issues between branches
                            
                                How do I configure ctrlp to work correctly with ag outside of a git repo?
                            
                                separating commit messages between modules of a monorepo
                            
                                "git update-index --assume-unchanged" by default
                            
                                How to add symlink file to a gitlab repo
                            
                                Can the macOS Git client use certificates stored in the user's Keychain?
                            
                                git ssh authentication fails with ssh_askpass: posix_spawn: Unknown error
                            
                                Is there a better way of writing a git pre-commit hook to check any php file in a commit for parse errors?
                            
                                Getting GitHub and Gerrit to play nicely
                            
                                How to synchronize Git Repositories across 2 Servers
                            
                                Git branching stategy for feature branches and common code
                            
                                NSTask and Git -- Permissions issues
                            
                                How can I rebase all my local Git branches (and tags) when upstream has rewritten history?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With