This is sort of a follow-up to this question.
If there are multiple blobs with the same contents, they are only stored once in the git repository because their SHA-1's will be identical. How would one go about finding all duplicate files for a given tree?
Would you have to walk the tree and look for duplicate hashes, or does git provide backlinks from each blob to all files in a tree that reference it?
[alias]
# find duplicate files from root
alldupes = !"git ls-tree -r HEAD | cut -c 13- | sort | uniq -D -w 40"
# find duplicate files from the current folder (can also be root)
dupes = !"cd `pwd`/$GIT_PREFIX && git ls-tree -r HEAD | cut -c 13- | sort | uniq -D -w 40"
Running this on the codebase I work on was an eye-opener I can tell you!
#!/usr/bin/perl
# usage: git ls-tree -r HEAD | $PROGRAM_NAME
use strict;
use warnings;
my $sha1_path = {};
while (my $line = <STDIN>) {
chomp $line;
if ($line =~ m{ \A \d+ \s+ \w+ \s+ (\w+) \s+ (\S+) \z }xms) {
my $sha1 = $1;
my $path = $2;
push @{$sha1_path->{$sha1}}, $path;
}
}
foreach my $sha1 (keys %$sha1_path) {
if (scalar @{$sha1_path->{$sha1}} > 1) {
foreach my $path (@{$sha1_path->{$sha1}}) {
print "$sha1 $path\n";
}
print '-' x 40, "\n";
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With