Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the difference between a Git-merge and Git-cherry-pick for a specific commit?

Is there a difference between a : git merge <commit-id> and git cherry-pick <commit-id> ? where ''commit-id'' is hash of the commit from my new branch that I want to get into master branch.

like image 850
Kevin STS Avatar asked Nov 27 '22 13:11

Kevin STS


2 Answers

There is a huge difference in all but trivial cases (and even in trivial cases, there is still a difference). To understand this properly is a bit of a challenge, but once you do, you're well on your way to really understanding Git itself.

The TL;DR is mostly what ItayB already said: cherry-pick means to copy some existing commit. The essence of this copying is to turn the commit into a change-set, then re-apply that same change-set to some other existing commit to make a new commit. The new commit "does the same change" as the commit you have copied, but applies that change to a different snapshot.

This description is useful and practical, but not 100% accurate—it doesn't help if you get merge conflicts during your cherry-pick. As those imply, cherry-pick is internally implemented as a special kind of merging. If there are no merge conflicts, you don't need to know this. If you are, well, it's probably best to start with a proper understanding of git merge style merging.

Merging (as done by git merge) is more complicated: it doesn't copy anything. Instead, it makes a new commit of type merge, which ... well, does something complicated. :-) It cannot be explained adequately without first describing the Git commit graph. It also has two parts, which I like to refer to as, first, merge as a verb (the action of combining changes), and second, the commit-of-type-merge, or merge as a noun or adjective: Git calls these a merge or a merge commit.

When cherry-pick does merging, it only does the first half, the merge as a verb action, and it does it a bit weirdly. If the merge fails with a conflict, the results can be very puzzling. They can only be explained by knowing how Git does the merge as a verb process.

There is also something Git calls a fast-forward operation, or sometimes a fast-forward merge, which is not a merge at all. This, unfortunately, is also confusing; let's hold off on that.

Everything below is the long answer: read only if you want to understand (more of) Git

What to know about commits

The first thing to know—you might already—is that Git is mainly about commits, and each Git commit saves a full snapshot of every file. That is, Git's commits are not change-sets. If you modify one file—say, README.md—and make a new commit with that, the new commit has every file, in full, including (the full text of) the modified README.md. When you inspect the commit, using git show or git log -p, Git will show you what you changed, but it does that by extracting the previous commit's saved files first, then the commit's saved files, and then comparing the two snapshots. Since only README.md changed, it only shows you README.md, and even then, only shows you the difference—the set of changes to the one file.

This, in turn, means that every commit knows its immediate ancestor, or parent commit. Commits, in Git, have a fixed, permanent "true name" that always means that particular commit. This true name, or hash ID or sometimes OID (the "O" stands for Object), is the big ugly string of letters and digits that Git prints in git log output. For example, 5d826e972970a784bd7a7bdf587512510097b8c7 is a commit in the Git repository for Git. These things look random (though they're not), and are not generally useful to humans, but they are how Git finds each commit. That particular commit has one parent—some other big ugly hash ID—and Git saves the parent's hash inside the commit, so that Git can use the commit to look backwards to its parent.

The result is that if we have a series of commits, they form a backwards-looking chain. We—or Git—will start at the end of this chain and work backwards, to find the history in the repository. Let's imagine we have a tiny repository with just three commits. Instead of their actual hash IDs, which are too big and ugly to bother with, let's call them commits A, B, and C, and draw them in their parent/child relationships:

A <-B <-C

Commit C is the latest, so it is the child of B. Git has C remember B's hash ID, so we say that C points to B. When we made B, there was only one previous commit, A, so A is B's parent and B points to A. Commit A is kind of a special case: when we made it, there were no commits. It has no parent, and this is what allows Git to stop chasing backwards.

Commits are also completely, totally, 100% read-only: once made, nothing about any commit can ever be changed. This is because the hash ID is actually a cryptographic checksum of the complete contents of the commit. Change even a single bit, anywhere, and you get a new, different hash ID—a new, different commit. So a commit snapshot saves the state of your files forever—or at least, for as long as the commit itself continues to exist. (You can initially think of this as "forever"; the mechanisms for forgetting or replacing a commit are more advanced, and get quite tricky when it's not the latest commit.)

This read-only quality means we can draw the string of commits more simply as:

A--B--C

and just remember that the linkages go only one way, backwards. The parent can't know its children because the children don't exist when the parent is born, and once the parent is born, it's frozen for all time. The child can know its parent, though, because the child is born after the parent exists and is frozen.

What to know about branch names

It's easy, in simplified diagrams like the one above, to tell which commit is the latest. The letter C come after B, after all, so C is the latest. But Git hash IDs look totally random, and Git needs the actual hash ID. So what Git does here is to store the hash ID of the latest commit in a branch name.

In fact, this is the very definition of a branch name: a name like master simply stores the hash ID of the commit we want to call the latest for that branch. So given the A--B--C string of commits, we just add the name master, pointing to commit C:

A--B--C   <-- master

What's special about a branch name is that, unlike commits, they change. They not only change, they do it automatically. The process of making a new commit, in Git, consists of writing out the commit's contents—its parent hash ID, author/committer information, saved snapshot, log message, and so on—which computes the new hash ID for the new commit, and then changing the branch name to record the new commit's hash ID. If we create a new commit D on master, Git does that by writing out D pointing back to C, then updating master to point to D:

A--B--C--D   <-- master

Suppose we now create a new branch name, develop. The new name will also point to commit D:

A--B--C--D   <-- develop, master

Let's make a new commit E now, whose parent will be D:

A--B--C--D
          \
           E

Which branch name should Git update? Do we want master to point to E, or do we want develop to point to E? The answer to this question lies in the special name HEAD.

Git's HEAD remembers the branch, and thus the current commit

To remember which branch we want Git to update, as well as which commit we have checked-out right now, Git has the special name HEAD, spelled in all capital letters like this. (Lowercase works on Windows and MacOS due to a quirk, but does not work on Linux/Unix systems that don't share this quirk, so it's best to use the all-uppercase spelling. If you don't like typing the word, you can use the symbol @, which is a synonym.) Normally, Git attaches the name HEAD to one branch name:

A--B--C--D   <-- develop (HEAD), master

Here, we're on branch develop, because that's the one HEAD is attached to. (Note that all four commits are on both branches.) If we now make new commit E, Git knows which name to update:

A--B--C--D   <-- master
          \
           E   <-- develop (HEAD)

The name HEAD remains attached to the branch; the branch name itself changes which commit hash ID it remembers; and commit E is now the current commit. If we make a new commit now, its parent will be E, and Git will update develop. (New commit E is only on develop, while commits A-B-C-D are still on both branches!)

A detached HEAD just means that Git has made the name HEAD point directly to some commit instead of attaching it to a branch name. In this case, HEAD still names the current commit. You're just not on any branch. Making a new commit still creates the commit as usual, but then instead of writing the new commit's new hash ID into a branch name, Git just writes it directly into the name HEAD.

(Detached HEAD is normal, but a little bit special-case; you won't use it for everyday development, except when doing some git rebase operations. You mostly use it for examining historic commits—those not at the tip of some branch name. We'll ignore it here.)

The commit graph, and git merge

So now that we know how commits link and how branch names point to the last commit on their branch, let's look at how git merge works.

Suppose we've made some commits on both master and develop so that we have a graph that looks like this now:

       G--H   <-- master
      /
...--D
      \
       E--F   <-- develop

We'll git checkout master so that HEAD gets attached to master pointing to H, and then run git merge develop.

Git will, at this point, follow both chains backwards. That is, it will start at H and work backwards to G and then to D. It will also start at F and work backwards to E and then to D. At this point, Git has found a shared commit—a commit that's on both branches. All earlier commits are also shared, but this is the best one, because it's the closest commit to both branch tips.

This best shared commit is called the merge base. So in this case, D is the merge base of master (H) and develop (F). The merge base commit is determined entirely by the commit graph, starting from the current commit (HEAD = master = commit H) and from the other commit you name on the command line (develop = commit F). The only use of the branch names in this process is locating the commits—everything after that depends on the graph.

Having found the merge base, what git merge does now is to combine changes. Remember, though, that we said commits are snapshots, not change-sets. So to find changes, Git has to start by extracting the merge base commit itself, into a temporary area.

Now that Git has the merge base extracted, a git diff will find what we changed, on master: the difference between the snapshot in D and the snapshot in HEAD (H). That's the first change-set.

Git now has to run a second git diff, to find what they changed, on develop: the difference between the snapshot in D and the snapshot in F. That's the second change-set.

Hence, what git merge does, having located the merge base, is run these two git diff commands:

git diff --find-renames <hash-of-D> <hash-of-H>    # what we changed
git diff --find-renames <hash-of-D> <hash-of-F>    # what they changed

Git then combines these two sets of changes, applies the combined changes to what's in the snapshot in D (the merge base), and makes a new commit from the result. Or rather, it does all of this as long as the combining works—or more accurately, as long as Git thinks the combining worked.

For now, let's assume that Git thinks it works. We'll come back to merge conflicts in a moment.

The result of committing the combined changes, applied to the merge base, is a new commit. This new commit has one special feature: besides saving a full snapshot as usual, it has not one but two parent commits. The first of these two parents is the commit that you were on when you ran git merge, and the second is the other commit. That is, the new commit I is a merge commit:

       G--H
      /    \
...--D      I   <-- master (HEAD)
      \    /
       E--F   <-- develop

Because the history in a Git repository is the set of commits, this makes one new commit whose history is both branches. From I, Git can work backwards to H and to F, and from those, to G and E respectively, and from there, to D. The name master now points to I. The name develop is unchanged: it continues to point to F.

It's now safe to delete the name develop, if we wish, because we (and Git) can find commit F from commit I. Alternatively, we can keep developing on it, making more new commits:

       G--H
      /    \
...--D      I   <-- master
      \    /
       E--F--J--K--L   <-- develop

If we now git checkout master again and run git merge develop again, Git will do the same thing it did before: find a merge base, run two git diffs, and commit the result. The interesting thing now is that because of commit I, the merge base is no longer D.

Can you name the merge base? Try it, as an exercise: start at L and work backwards, listing the commits. (Remember to only go backwards: from F, you can't get to I, because that's the wrong direction. You can get to E, which is the right way, backwards.) Then start at I and work backwards to both F and H. Is one of those in the listing you made for develop? If so, that's the merge base (F, namely) for the new merge, so Git will use that for its two git diff commands.

In the end, if the merge works, we'll get a new merge commit M on master:

       G--H
      /    \
...--D      I--------M   <-- master (HEAD)
      \    /        /
       E--F--J--K--L   <-- develop

and a future merge, if we add more commits to develop, will use L as the merge base.

Cherry-picking uses the merge machinery—the two diffs—with a weird base

Let's go back to this state, and attach HEAD to master:

       G--H   <-- master (HEAD)
      /
...--D
      \
       E--F   <-- develop

Now let's see how Git actually implements git cherry-pick develop.

First, Git resolves the name develop to a commit hash ID. Since develop points to F, that's commit F.

Commit F is a snapshot, and has to be turned into a change-set. Git does this with git diff <hash-of-E> <hash-of-F>.

Git could, at this point, just apply these same changes to the snapshot in H. That's what our high level, not-quite-accurate description claimed: we just take this diff and apply it to H. And in most cases, what happens looks like Git did just that—and in very old versions of Git (that no one uses any more), Git really did do that. But there are cases where it doesn't work right, so Git now performs a weird kind of merge.

In a normal merge, Git would find the merge base and run two diffs. In the cherry-pick kind of merge, Git just forces the merge base to be the parent of the commit being cherry-picked. That is, since we're cherry-picking F, Git forces the merge base to be commit E.

Git now does git diff --find-renames <hash-of-E> <hash-of-H> to see what we changed, and git diff --find-renames <hash-of-E> <hash-of-F> to see what they (commit F) changed. Then it combines the two sets of changes and applies the result to the snapshot in E. This keeps your work (because whatever you changed, you still have changed) while adding the change-set from F too.

If all goes well, which it often does, Git makes a new commit, but this new commit is an ordinary, single-parent commit that goes on master. It's a lot like F, and in fact, Git copies the log message from F too, so let's call this new commit F' to remember that:

       G--H--F'   <-- master (HEAD)
      /
...--D
      \
       E--F   <-- develop

Note that, just as before, develop has not moved. However, we also have not made a merge commit: the new F' does not record F itself. The graph is not merged; the merge base of F' and F is still commit D.

This is thus the complete and accurate answer

This is the full difference between a cherry-pick and a true merge: the cherry-pick uses Git's merge machinery to do the change-combining, but leaves the graph unmerged, simply making a copy of some existing commit. The two change-sets used in the combining are based on the cherry-picked commit's parent, not a computed merge base. The new copy has a new hash ID, not obviously related to the original commit in any way. The histories found by starting at either branch name, master or develop here, still join up well into the past. With a true merge, the new commit is a two-parent merge, and the histories are firmly joined—and of course, the two sets of changes that git merge combines are formed from the computed merge base, so they are different change-sets.

When the merge fails with a conflict

Git's merge machinery, the engine that combines two different sets of changes, can and does fail to do the combining, sometimes. This happens when, in the two change-sets, both try to change the same lines of the same file.

Suppose Git is combining changes, and change-set --ours says touch line 17 of file A, line 30 of file B, and lines 3-6 of file D. Meanwhile change-set --theirs says nothing about file A, but does say change line 30 of file B, line 12 of file C, and lines 10-15 of file D.

Since only ours touches file A, and only theirs touches file C, Git can just use our version of A and their version of C. We both touch file D, but ours touches line 3-6 and theirs touches lines 10-15, so Git can take both changes to file D. File B is the real problem: we both touched line 30.

If we made the same change to line 30, Git can resolve this: it just takes one copy of the change. But if we made different changes to line 30, Git will stop with a merge conflict.

At this point, Git's index (which I have not been talking about here) becomes crucial. I'm going to continue not talking about it, except to say that Git leaves all three versions of the conflicted file in it. Meanwhile, there's a work-tree copy of file B as well, and in the work-tree file, Git writes its best effort at combining the changes, using conflict markers to show where the problem is.

Your job, as the human running Git, is to resolve each conflict, in any way you like. Having fixed up all the conflicts, you then use git add to update Git's index for the new commit. Then you can run either git merge --continue or git cherry-pick --continue, depending on what caused the problem, to have Git commit the result—or, you can run git commit, which is the old way of doing this same thing. In fact, the --continue operations mainly just run git commit for you: the commit code checks to see if there was a conflict that it should finish, and if so, makes either a regular (cherry-pick) commit or a merge commit.

A special case: merge as fast-forward

When you run git merge othercommit, Git locates the merge base as usual, but sometimes the merge base is pretty trivial. Consider, for instance, a graph like this:

...--F--G--H   <-- develop (HEAD)
            \
             I--J   <-- feature-X

If you run git merge feature-X now, Git finds the merge base by starting at commits J and H and doing the usual backwards-walking to find the first shared commit. But that first shared commit is commit H itself, right where develop points.

It's possible for Git to do a real merge, running:

git diff --find-renames <hash-of-H> <hash-of-H>   # what we changed
git diff --find-renames <hash-of-H> <hash-of-J>   # what they changed

and you can force Git to do this, using git merge --no-ff. But obviously, diffing a commit against itself will show no changes at all. The --ours part of the two sets of changes will be empty. The result of the merge will just be the same snapshot that is in commit J, so if we force a true merge:

...--F--G--H------J'   <-- develop (HEAD)
            \    /
             I--J   <-- feature-X

then J' and J will also match. They will be different commits—J' will be a merge commit, with our name and the date and whatever log message we like—but their snapshots will be identical.

If we don't force a true merge, Git realizes that J' and J will match like this, and simply doesn't bother making a new commit. Instead, it "slides the name to which HEAD is attached forwards", against the backwards-pointing internal arrows:

...--F--G--H
            \
             I--J   <-- develop (HEAD), feature-X

(after which there's no point in drawing the kink in the graph). That's a fast-forward operation or, in Git's rather peculiar terminology, a fast-forward merge (even though there's no actual merging!).

like image 130
torek Avatar answered Nov 30 '22 01:11

torek


cherry-pick takes exactly one commit into your current branch. merge takes the entire branch (might be several commits) and merge it to your branch.

Same if you merge it with <commit-id> - it doesn't take only the specific commit but the below commits (if there's any) as well.

like image 34
ItayB Avatar answered Nov 30 '22 03:11

ItayB