Are smarter merge conflict algorithms available for git?

Question

When rebasing and Git generates merge conflicts they are sometimes pretty confusingly presented. For example if I have this code:

async loadDirectory({ commit }, path: string) {
  ...

In master I make it sync:

loadDirectory({ commit }, path: string): void {
  ...

And in some_branch (based off the original code) I add another function before it:

async fetchData({ commit }) {
  lots();
  of();
  code();
},

async loadDirectory({ commit }, path: string) {
  ...

Now, if I try to rebase some_branch onto master, git shows this conflict:

<<<<<<< HEAD
loadDirectory({ commit }, path: string): void {
=======
async fetchData({ commit }) {
  lots();
  of();
  code();
},

async loadDirectory({ commit }, path: string) {
>>>>>>> Add data fetching function

This is pretty confusing! It would be much easier to understand if it had presented the conflict like this:

<<<<<<< HEAD
=======
async fetchData({ commit }) {
  lots();
  of();
  code();
},
>>>>>>> Add data fetching function

<<<<<<< HEAD
loadDirectory({ commit }, path: string): void {
=======
async loadDirectory({ commit }, path: string) {
>>>>>>> Add data fetching function

I haven't actually looked into Git's algorithm but I suspect it optimises for algorithmic complexity, and a small number of conflict sections. But whatever the algorithm, this example shows that it could be improved (possibly for a performance cost).

My question is: is it theoretically possible to use a different algorithm with Git, and is anyone working on that?

torek · Accepted Answer

Git does not come with any fancier merge algorithms, but it does include a mechanism by which you can provide your own. In fact, there are two mechanisms, but one is much more usable.

In particular, in your .gitattributes file, you can list a merge driver for particular file name patterns. It's important to realize that this merge driver is only invoked in some particular cases.

At the top level, we invoke git merge using the command line, including these options. There are more options, but these are the ones I want to call out in particular:

git merge [-s strategy] [-X extended-options] commit-specifier

The -s argument takes a strategy, and there are five built in strategies, but only one (or two or three depending on how you count) really matters:

recursive and resolve are essentially the same one, except when there is more than one merge base (a case I won't discuss much here). These are the ones that matter, and for the case we care about, they are pretty much identical.
octopus is meant for use when specifying more than one commit-specifier argument; it builds an octopus merge, which is one with more than two parents. I won't discuss this one either because octopus merges do not allow conflict resolution anyway.
ours completely ignores all other commits and simply retains the current tree, so it's not suitable here.
subtree is a kind of hacked-up variant of recursive for renaming subtrees, and hence really falls into the recursive category as well.

You can provide your own merge strategy! All you have to do is write a command, name it git-merge-whatever, and run git merge -s whatever and Git will invoke your merge strategy. If you do write your own merge strategy, though, you must make it do everything, and that's pretty difficult, as shown by the fact that there don't seem to be any third-party strategies available as Git add-ons. In other words, the phrase all you have to do covers a multitude of sins, some of which are apparently pretty major. :-)

The extended-options—Git calls these strategy options, which seems like a poor name to me since -s is the strategy option—are just passed on to the strategy as options. The standard strategy (or strategies, depending on whether you count resolve and/or subtree as separate strategies) allows -X ours and -X theirs, which aren't what you want in this case, but are worth mentioning, plus a whole host of additional -X options to tweak parameters in the algorithms.

For the most part, though, the standard strategy works by:

finding the merge base (or bases, plural, for -s recursive) so that we have one commit to use as the merge base;
comparing the merge base to HEAD and to the other commit, as if by two git diff --find-rename operations;
combining the change-sets produced by the two git diffs.

If the recursive strategy finds two or more merge bases, it merges them as if by git merge -s recursive, one pair of commits at a time, committing each result. Each such commit is temporary and has no branch name. It does have a raw hash ID—every commit does, so this temporary commit does too—and the final merge result commit hash ID is the merge base commit. Mostly, though, we end up with -s recursive finding some existing (single) merge base, so that commit is the one used as the merge base. If the resolve strategy finds two or more merge bases, it picks one at what might as well be random. (It's not actually random but it is whatever comes out of the base-finding algorithm first, and the order isn't specified and the base-finding algorithm might change in the future.)

So: at this point we have three commits:

the merge base;
the current commit, which is always HEAD, as --ours; and
the commit you specified on the command line, as --theirs.

Some set of files exist in the base. Those files are paired up with some set of files in --ours, according to the rename finding algorithm. Any files that remain unpaired—that cannot be identified as "the same file" on left and right sides—are either added (nothing on left, new file on right) or deleted (nothing on right, left side file deleted). Files whose name changed are renamed. Files that have been paired-up and have their content modified are "modified" (and possibly also renamed). The remaining paired-up files are unchanged at all.

The same thing happens with the base commit and the --theirs commit, producing lists of files that are modified and/or renamed, added, deleted, or unchanged at all.

Note that in principle, at this point, all three commits get yanked into the index. (There's an optimization in which the index doesn't actually expand, if possible, but the outcome of this optimized approach is meant to match that of actually putting all three commits into the index.) The way this works is that each entry in the index has a staging slot number. So a file can occupy staging slots 1 (merge base), 2 (--ours), and 3 (--theirs) simultaneously. Those are three copies of the file—or three blob references, to be precise—all of which use the same file name in the index.

Index staging slot zero is used for resolved files. In this case there is no merge conflict for this file.

The next step is the high level merge operation: files that are renamed, must be renamed from the base to the final commit. If both sides renamed a file, we have a rename/rename conflict. Git declares a conflict for these files and picks one of the two new names as the target name to use in the work-tree. Inside the index, however, the original name in the merge base occupies slot 1 for the base-commit name, the new name in --ours occupies slot 2 for the ours-commit name, and the new name in --theirs occupies slot 3 for the theirs-commit name. It's just the fact that the work-tree file has to have some name that makes Git pick one (and I think Git uses the --ours name here, but I'd have to experiment to be sure).

If one side added a file named F and the other didn't, there's no conflict, but if both sides added a file named F, there is an add/add conflict on file F.

If one side removed a file, and the other side modified and/or renamed the file, there is a modify/delete or rename/delete conflict on that file.

In all of these high level conflict cases, all three files remain in the index, under their various names. The work-tree file may or may not have some merge conflicts laid out in it as well, if there's a low-level conflict too. But by this point all high level conflicts are recorded in the index and declared as conflicts, which will make the merge stop and get help; the file will not be resolved in the index. Note: this is all under the control of the merge strategy.

Now that high-level conflicts are handled, we move on to whether there are low-level conflicts:

If neither side modified a file—if it has the same hash in all three index slots—then there's no problem here. The file can be moved to slot zero, using the hash ID from any of the three staging slots (as long as there are no high level conflicts of course).
If one side of the merge modified a file, but the other side didn't, then there's no problem. The file can be moved from whichever staging slot holds the modified file, to staging slot zero. That is, file F has three hashes in the three slots. Whichever one matches the slot-1 hash, we discard that one and take the other one as the merge result. This goes to slot zero and the other three slots are erased (again only if there are no high level conflicts). The work-tree file gets replaced with the retained index copy.
For the last case, in which all three input files differ, we (finally!) invoke the actual low-level merge driver.

The low level merge driver is responsible for doing the actual merging

This is where your goal comes into view. The default low level merge driver is the line-at-a-time one that produces the merge conflicts you've seen. This program is built in to the strategy, but is available as a callable program as well, using git merge-file. It takes three input file names, merges them as well as it can on its own, writes the result back to one of the three file names, and exits with a zero status if it thinks it merged the three files correctly, or nonzero if it left some conflict markers in the work-tree copy.

When the merge strategy invokes this low level merge driver, if the merge driver exits zero, the merge strategy copies the resulting file into index slot zero and erases slots 1–3 to mark the file resolved. You never see a conflict here. When it exits nonzero, though, the work-tree file isn't the correct result. The three inputs remain in the index, in slots 1–3, and the merge will eventually stop with a conflict. (Of course the strategy goes on to the remaining unmerged files first.)

If you set a merge driver, Git will run your command instead of using its built in equivalent of git merge-file. Your driver should exit zero or nonzero as usual, and regardless of how it exits, should put its best effort at actually merging the three files into the work-tree. You can achieve this however you like: the entire process is up to you.

So, if you want a fancier merge driver that can understand the programming language involved, or uses a word oriented diff instead of line diff, write one! It's just a small matter of programming.

Are smarter merge conflict algorithms available for git?

Tags:

git

algorithm

diff

Timmmm

1 Answers

The low level merge driver is responsible for doing the actual merging

torek

Recent Activity

Donate For Us

Are smarter merge conflict algorithms available for git?

Tags:

git

algorithm

diff

Timmmm

1 Answers

The low level merge driver is responsible for doing the actual merging

torek

Related questions

Recent Activity

Donate For Us