Git: Find duplicate blobs (files) in this tree

Question

This is sort of a follow-up to this question.

If there are multiple blobs with the same contents, they are only stored once in the git repository because their SHA-1's will be identical. How would one go about finding all duplicate files for a given tree?

Would you have to walk the tree and look for duplicate hashes, or does git provide backlinks from each blob to all files in a tree that reference it?

bsb · Accepted Answer

[alias]
    # find duplicate files from root
    alldupes = !"git ls-tree -r HEAD | cut -c 13- | sort | uniq -D -w 40"

    # find duplicate files from the current folder (can also be root)
    dupes = !"cd `pwd`/$GIT_PREFIX && git ls-tree -r HEAD | cut -c 13- | sort | uniq -D -w 40"

lmop · Answer

Running this on the codebase I work on was an eye-opener I can tell you!

#!/usr/bin/perl

# usage: git ls-tree -r HEAD | $PROGRAM_NAME

use strict;
use warnings;

my $sha1_path = {};

while (my $line = <STDIN>) {
    chomp $line;

    if ($line =~ m{ \A \d+ \s+ \w+ \s+ (\w+) \s+ (\S+) \z }xms) {
        my $sha1 = $1;
        my $path = $2;

        push @{$sha1_path->{$sha1}}, $path;
    }
}

foreach my $sha1 (keys %$sha1_path) {
    if (scalar @{$sha1_path->{$sha1}} > 1) {
        foreach my $path (@{$sha1_path->{$sha1}}) {
            print "$sha1  $path
";
        }

        print '-' x 40, "
";
    }
}

Git: Find duplicate blobs (files) in this tree

Tags:

git

readonly

2 Answers

bsb

lmop

Recent Activity

Donate For Us

Git: Find duplicate blobs (files) in this tree

Tags:

git

readonly

2 Answers

bsb

lmop

Related questions

Recent Activity

Donate For Us