Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git find all binary files in history

Sorry if this is a duplicate of previous question, but I couldn't find quite what I'm looking for. I'm in the process of converting a large cvs codeset (20+ repositories with 15 years of history - 10-15 GB size) to git. Much of the size is due to binaries that were committed along with the code in the past. While some of the binaries are files that can be removed completely, it's desirable to keep many of them as well as their history. However, we don't want the repo to bloat.

We are currently planning on using git-fat to store the binaries, but I'm in the process of writing a script to automatically convert the files. My first step is to just try to identify all the files in the repo (included deleted files) which are binaries. Are there any simple approaches to accomplishing this? Thanks for your help

Edit

I actually think I found a reasonable approach where I just run

git log --numstat <first commit hash> HEAD

This prints out a list of all the files with two columns in front, the first contains the number of changes to the file (I'm not sure if it's in bytes or lines). But the important parts is for binary files it is '-'. By selecting lines with this tag, and "uniqueing" them, I believe I get the complete list of binary files.

Are there any flaws with this strategy?

like image 343
NotsoDarkMatters Avatar asked Jan 13 '15 21:01

NotsoDarkMatters


2 Answers

tldr;

git log --all --numstat \
    | grep '^-' \
    | cut -f3 \
    | gsed -r 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g' \
    | sort -u

Explanation:

The git-log option --numstat

shows number of added and deleted lines in decimal notation and pathname without abbreviation, to make it more machine friendly. For binary files, outputs two - instead of saying 0 0.

Source: https://git-scm.com/docs/git-log, emphasis mine

This produces output entries like the following:

commit 0123456789012345678901234567890123456789
Author: Joe Example <[email protected]>
Date:   Thu Mar 9 15:33:29 2017 +0000

    edit Dockerfile, add assets/foobar.jpg

1   1   Dockerfile
-   -   assets/foobar.jpg

The grep '^-' matches lines with a leading hyphen, the cut -f3 prints the third tab-delimited field, and the

sed -r 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g'

detects files that have been moved/renamed and prints both the source and destination; e.g., it would change this:

path/to/{foo => bar}/my-document.pdf

to this:

path/to/foo/my-document.pdf
path/to/bar/my-document.pdf

Finally, the sort -u will accumulate, sort, and uniquify the list of paths.

EDIT: You need gnu-sed installed because the default sed does not have the -r option. Best to install via Brew: brew install gnu-sed

like image 118
rubicks Avatar answered Sep 23 '22 12:09

rubicks


One of the contributors to git-fat here.

If you're primarily concerned about the size of the file, and not specifically the type, then git-fat has a find command which allows you to find all the files in the git repository over a given size.

I currently contribute to cyaninc's fork, but both versions (Jed's and Cyan's) have the find command.

Also check out the retroactive import section on the READMEs. Both versions also support that as well.

like image 32
Caustic Avatar answered Sep 22 '22 12:09

Caustic