Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find files in git repo over x megabytes, that don't exist in HEAD

Tags:

git

This is an adaptation of the git-find-blob script I posted previously:

#!/usr/bin/perl
use 5.008;
use strict;
use Memoize;

sub usage { die "usage: git-large-blob <size[b|k|m]> [<git-log arguments ...>]\n" }

@ARGV or usage();
my ( $max_size, $unit ) = ( shift =~ /^(\d+)([bkm]?)\z/ ) ? ( $1, $2 ) : usage();

my $exp = 10 * ( $unit eq 'b' ? 0 : $unit eq 'k' ? 1 : 2 );
my $cutoff = $max_size * 2**$exp; 

sub walk_tree {
    my ( $tree, @path ) = @_;
    my @subtree;
    my @r;

    {
        open my $ls_tree, '-|', git => 'ls-tree' => -l => $tree
            or die "Couldn't open pipe to git-ls-tree: $!\n";

        while ( <$ls_tree> ) {
            my ( $type, $sha1, $size, $name ) = /\A[0-7]{6} (\S+) (\S+) +(\S+)\t(.*)/;
            if ( $type eq 'tree' ) {
                push @subtree, [ $sha1, $name ];
            }
            elsif ( $type eq 'blob' and $size >= $cutoff ) {
                push @r, [ $size, @path, $name ];
            }
        }
    }

    push @r, walk_tree( $_->[0], @path, $_->[1] )
        for @subtree;

    return @r;
}

memoize 'walk_tree';

open my $log, '-|', git => log => @ARGV, '--pretty=format:%T %h %cr'
    or die "Couldn't open pipe to git-log: $!\n";

my %seen;
while ( <$log> ) {
    chomp;
    my ( $tree, $commit, $age ) = split " ", $_, 3;
    my $is_header_printed;
    for ( walk_tree( $tree ) ) {
        my ( $size, @path ) = @$_;
        my $path = join '/', @path;
        next if $seen{ $path }++;
        print "$commit $age\n" if not $is_header_printed++;
        print "\t$size\t$path\n";
    }
}

More compact ruby script:

#!/usr/bin/env ruby -w
head, treshold = ARGV
head ||= 'HEAD'
Megabyte = 1000 ** 2
treshold = (treshold || 0.1).to_f * Megabyte

big_files = {}

IO.popen("git rev-list #{head}", 'r') do |rev_list|
  rev_list.each_line do |commit|
    commit.chomp!
    for object in `git ls-tree -zrl #{commit}`.split("\0")
      bits, type, sha, size, path = object.split(/\s+/, 5)
      size = size.to_i
      big_files[sha] = [path, size, commit] if size >= treshold
    end
  end
end

big_files.each do |sha, (path, size, commit)|
  where = `git show -s #{commit} --format='%h: %cr'`.chomp
  puts "%4.1fM\t%s\t(%s)" % [size.to_f / Megabyte, path, where]
end

Usage:

ruby big_file.rb [rev] [size in MB]
$ ruby big_file.rb master 0.3
3.8M  example/blah.psd  (aad2981: 4 months ago)
1.1M  another/big.file  (6e73ca2: 2 weeks ago)

Python script to do the same thing (based on this post):

#!/usr/bin/env python

import os, sys

def getOutput(cmd):
    return os.popen(cmd).read()

if (len(sys.argv) <> 2):
    print "usage: %s size_in_bytes" % sys.argv[0]
else:
    maxSize = int(sys.argv[1])

    revisions = getOutput("git rev-list HEAD").split()

    bigfiles = set()
    for revision in revisions:
        files = getOutput("git ls-tree -zrl %s" % revision).split('\0')
        for file in files:
            if file == "":
                continue
            splitdata = file.split()
            commit = splitdata[2]
            if splitdata[3] == "-":
                continue
            size = int(splitdata[3])
            path = splitdata[4]
            if (size > maxSize):
                bigfiles.add("%10d %s %s" % (size, commit, path))

    bigfiles = sorted(bigfiles, reverse=True)

    for f in bigfiles:
        print f

Ouch... that first script (by Aristotle), is pretty slow. On the git.git repo, looking for files > 100k, it chews up the CPU for about 6 minutes.

It also appears to have several wrong SHAs printed -- often a SHA will be printed that has nothing to do with the filename mentioned in the next line.

Here's a faster version. The output format is different, but it is very fast, and it is also -- as far as I can tell -- correct.

The program is a bit longer but a lot of it is verbiage.

#!/usr/bin/perl
use 5.10.0;
use strict;
use warnings;

use File::Temp qw(tempdir);
END { chdir( $ENV{HOME} ); }
my $tempdir = tempdir( "git-files_tempdir.XXXXXXXXXX", TMPDIR => 1, CLEANUP => 1 );

my $min = shift;
$min =~ /^\d+$/ or die "need a number";

# ----------------------------------------------------------------------

my @refs =qw(HEAD);
@refs = @ARGV if @ARGV;

# first, find blob SHAs and names (no sizes here)
open( my $objects, "-|", "git", "rev-list", "--objects", @refs) or die "rev-list: $!";
open( my $blobfile, ">", "$tempdir/blobs" ) or die "blobs out: $!";

my ( $blob, $name );
my %name;
my %size;
while (<$objects>) {
    next unless / ./;    # no commits or top level trees
    ( $blob, $name ) = split;
    $name{$blob} = $name;
    say $blobfile $blob;
}
close($blobfile);

# next, use cat-file --batch-check on the blob SHAs to get sizes
open( my $sizes, "-|", "< $tempdir/blobs git cat-file --batch-check | grep blob" ) or die "cat-file: $!";

my ( $dummy, $size );
while (<$sizes>) {
    ( $blob, $dummy, $size ) = split;
    next if $size < $min;
    $size{ $name{$blob} } = $size if ( $size{ $name{$blob} } || 0 ) < $size;
}

my @names_by_size = sort { $size{$b} <=> $size{$a} } keys %size;

say "
The size shown is the largest that file has ever attained.  But note
that it may not be that big at the commit shown, which is merely the
most recent commit affecting that file.
";

# finally, for each name being printed, find when it was last updated on each
# branch that we're concerned about and print stuff out
for my $name (@names_by_size) {
    say "$size{$name}\t$name";

    for my $r (@refs) {
        system("git --no-pager log -1 --format='%x09%h%x09%x09%ar%x09$r' $r -- $name");
    }
    print "\n";
}
print "\n";

You want to use the BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branch specifically designed for removing large files from Git repos.

Download the BFG jar (requires Java 6 or above) and run this command:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1M  my-repo.git

Any files over 1M in size (that aren't in your latest commit) will be removed from your Git repository's history. You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is typically 10-50x faster than running git-filter-branch and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.


Aristote's script will show you what you want. You also need to know that deleted files will still take up space in the repo.

By default, Git keeps changes around for 30 days before they can be garbage-collected. If you want to remove them now:

$ git reflog expire --expire=1.minute refs/heads/master
     # all deletions up to 1 minute  ago available to be garbage-collected
$ git fsck --unreachable 
     # lists all the blobs(file contents) that will be garbage-collected 
$ git prune 
$ git gc

A side comment: While I am big fan of Git, Git doesn't bring any advantages to storing your collection of "random scripts, text files, websites" and binary files. Git tracks changes in content, particularly the history of coordinated changes among many text files, and does so very efficiently and effectively. As your question illustrates, Git doesn't have good tools for tracking individual file changes. And it doesn't track changes in binaries, so any revision stores another full copy in the repo.

Of course this use of Git is a perfectly good way to get familiar with how it works.