Sorry if this is a duplicate of previous question, but I couldn't find quite what I'm looking for. I'm in the process of converting a large cvs codeset (20+ repositories with 15 years of history - 10-15 GB size) to git. Much of the size is due to binaries that were committed along with the code in the past. While some of the binaries are files that can be removed completely, it's desirable to keep many of them as well as their history. However, we don't want the repo to bloat.
We are currently planning on using git-fat to store the binaries, but I'm in the process of writing a script to automatically convert the files. My first step is to just try to identify all the files in the repo (included deleted files) which are binaries. Are there any simple approaches to accomplishing this? Thanks for your help
Edit
I actually think I found a reasonable approach where I just run
git log --numstat <first commit hash> HEAD
This prints out a list of all the files with two columns in front, the first contains the number of changes to the file (I'm not sure if it's in bytes or lines). But the important parts is for binary files it is '-'. By selecting lines with this tag, and "uniqueing" them, I believe I get the complete list of binary files.
Are there any flaws with this strategy?
tldr;
git log --all --numstat \
| grep '^-' \
| cut -f3 \
| gsed -r 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g' \
| sort -u
Explanation:
The git-log
option --numstat
shows number of added and deleted lines in decimal notation and pathname without abbreviation, to make it more machine friendly. For binary files, outputs two - instead of saying 0 0.
Source: https://git-scm.com/docs/git-log, emphasis mine
This produces output entries like the following:
commit 0123456789012345678901234567890123456789
Author: Joe Example <[email protected]>
Date: Thu Mar 9 15:33:29 2017 +0000
edit Dockerfile, add assets/foobar.jpg
1 1 Dockerfile
- - assets/foobar.jpg
The grep '^-'
matches lines with a leading hyphen, the cut -f3
prints the third tab-delimited field, and the
sed -r 's|(.*)\{(.*) => (.*)\}(.*)|\1\2\4\n\1\3\4|g'
detects files that have been moved/renamed and prints both the source and destination; e.g., it would change this:
path/to/{foo => bar}/my-document.pdf
to this:
path/to/foo/my-document.pdf
path/to/bar/my-document.pdf
Finally, the sort -u
will accumulate, sort, and uniquify the list of paths.
EDIT: You need gnu-sed installed because the default sed
does not have the -r option. Best to install via Brew: brew install gnu-sed
One of the contributors to git-fat here.
If you're primarily concerned about the size of the file, and not specifically the type, then git-fat has a find
command which allows you to find all the files in the git repository over a given size.
I currently contribute to cyaninc's fork, but both versions (Jed's and Cyan's) have the find command.
Also check out the retroactive import section on the READMEs. Both versions also support that as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With