git finding duplicate commits (by patch-id)

Question

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

This seems to be an intended use of patch-id:

git patch-id --help

IOW, you can use this thing to look for likely duplicate commits.

I imagine that stringing together "git log", "git patch-id" and uniq could do the job badly but if someone has an command that does the job well, I'd appreciate it.

Slipp D. Thompson · Accepted Answer

For looking for duplicates of a specific commit, this may work for you.

First, determine the patch id of the target commit:

$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3

The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00

All together, with a few extra flags, and in the form of a utility script:

test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"

TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
    git patch-id |
    cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
    git show --patch-with-raw "$c" |
        git patch-id
done |
    fgrep "$TARGET_COMMIT_PATCHID" |
    cut -d' ' -f2
)

echo "$MATCHING_COMMIT_SHAS"

Usage:

$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00

It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

unagi · Answer

To search for duplicate commits of commit $hash, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
    | xargs -r git show -s --oneline

For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

To find duplicates of all commits, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | sort | uniq -w40 -D | cut -c42-80 \
    | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso

The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

Note that these commands are fast since they use no shell loop, and batch-process commits.

To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

Explanation:

The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).
The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).
The last line prints custom information about the duplicate commits.

robinst · Answer

Because the duplicate changes are likely to be not on the same branch (except when there are reverts in between them), you could use git cherry:

git cherry [-v] [<upstream> [<head> [<limit>]]]

Where upstream would be the branch to check for duplicates of changes in head.

bsb · Answer

I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
    git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'

The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

Change the git rev-list command to restrict the commits checked:

git log --format=%H HEAD somefile

Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again

It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

git finding duplicate commits (by patch-id)

Tags:

git

duplicates

commit

bsb

4 Answers

Slipp D. Thompson

unagi

robinst

bsb

Recent Activity

Donate For Us

git finding duplicate commits (by patch-id)

Tags:

git

duplicates

commit

bsb

4 Answers

Slipp D. Thompson

unagi

robinst

bsb

Related questions

Recent Activity

Donate For Us