How to find/identify large commits in git history?

Tags:

git

🚀 A blazingly fast shell one-liner 🚀

This shell script displays all blob objects in the repository, sorted from smallest to largest.

For my sample repo, it ran about 100 times faster than the other ones found here.
On my trusty Athlon II X4 system, it handles the Linux Kernel repository with its 5.6 million objects in just over a minute.

The Base Script

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

When you run above code, you will get nice human-readable output like this:

...
0d99bb931299  530KiB path/to/some-image.jpg
2ba44098e28f   12MiB path/to/hires-image.png
bd1741ddce0d   63MiB path/to/some-video-1080p.mp4

macOS users: Since numfmt is not available on macOS, you can either omit the last line and deal with raw byte sizes or brew install coreutils.

Filtering

To achieve further filtering, insert any of the following lines before the sort line.

To exclude files that are present in HEAD, insert the following line:

grep -vF --file=<(git ls-tree -r HEAD | awk '{print $3}') |

To show only files exceeding given size (e.g. 1 MiB = 2²⁰ B), insert the following line:

awk '$2 >= 2^20' |

Output for Computers

To generate output that's more suitable for further processing by computers, omit the last two lines of the base script. They do all the formatting. This will leave you with something like this:

...
0d99bb93129939b72069df14af0d0dbda7eb6dba 542455 path/to/some-image.jpg
2ba44098e28f8f66bac5e21210c2774085d2319b 12446815 path/to/hires-image.png
bd1741ddce0d07b72ccf69ed281e09bf8a2d0b2f 65183843 path/to/some-video-1080p.mp4

Appendix

File Removal

For the actual file removal, check out this SO question on the topic.

Understanding the meaning of the displayed file size

What this script displays is the size each file would have in the working directory. If you want to see how much space a file occupies if not checked out, you can use %(objectsize:disk) instead of %(objectsize). However, mind that this metric also has its caveats, as is mentioned in the documentation.

More sophisticated size statistics

Sometimes a list of big files is just not enough to find out what the problem is. You would not spot directories or branches containing humongous numbers of small files, for example.

So if the script here does not cut it for you (and you have a decently recent version of git), look into git-filter-repo --analyze or git rev-list --disk-usage (examples).

I've found a one-liner solution on ETH Zurich Department of Physics wiki page (close to the end of that page). Just do a git gc to remove stale junk, and then

git rev-list --objects --all \
  | grep "$(git verify-pack -v .git/objects/pack/*.idx \
           | sort -k 3 -n \
           | tail -10 \
           | awk '{print$1}')"

will give you the 10 largest files in the repository.

There's also a lazier solution now available, GitExtensions now has a plugin that does this in UI (and handles history rewrites as well).

GitExtensions 'Find large files' dialog

I've found this script very useful in the past for finding large (and non-obvious) objects in a git repository:

http://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

#!/bin/bash
#set -x 
 
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
 
# set the internal field separator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
 
# list all objects including their size, sort by size, take top 10
objects=`git verify-pack -v .git/objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head`
 
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
 
output="size,pack,SHA,location"
allObjects=`git rev-list --all --objects`
for y in $objects
do
    # extract the size in bytes
    size=$((`echo $y | cut -f 5 -d ' '`/1024))
    # extract the compressed size in bytes
    compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
    # extract the SHA
    sha=`echo $y | cut -f 1 -d ' '`
    # find the objects location in the repository tree
    other=`echo "${allObjects}" | grep $sha`
    #lineBreak=`echo -e "\n"`
    output="${output}\n${size},${compressedSize},${other}"
done
 
echo -e $output | column -t -s ', '

That will give you the object name (SHA1sum) of the blob, and then you can use a script like this one:

Which commit has this blob?

... to find the commit that points to each of those blobs.

Step 1 Write all file SHA1s to a text file:

git rev-list --objects --all | sort -k 2 > allfileshas.txt

Step 2 Sort the blobs from biggest to smallest and write results to text file:

git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

Step 3a Combine both text files to get file name/sha1/size information:

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
done;

Step 3b If you have file names or path names containing spaces try this variation of Step 3a. It uses cut instead of awk to get the desired columns incl. spaces from column 7 to end of line:

for SHA in `cut -f 1 -d\  < bigobjects.txt`; do
echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | cut -d ' ' -f'1,3,7-' >> bigtosmall.txt
done;

Now you can look at the file bigtosmall.txt in order to decide which files you want to remove from your Git history.

Step 4 To perform the removal (note this part is slow since it's going to examine every commit in your history for data about the file you identified):

git filter-branch --tree-filter 'rm -f myLargeFile.log' HEAD

Source

Steps 1-3a were copied from Finding and Purging Big Files From Git History

EDIT

The article was deleted sometime in the second half of 2017, but an archived copy of it can still be accessed using the Wayback Machine.

You should use BFG Repo-Cleaner.

According to the website:

The BFG is a simpler, faster alternative to git-filter-branch for cleansing bad data out of your Git repository history:

Removing Crazy Big Files

Removing Passwords, Credentials & other Private data

The classic procedure for reducing the size of a repository would be:

git clone --mirror git://example.com/some-big-repo.git
java -jar bfg.jar --strip-biggest-blobs 500 some-big-repo.git
cd some-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

Related questions
                            
                                Your branch is ahead of 'origin/master' by 3 commits
                            
                                How to tell which commit a tag points to in Git?
                            
                                Delete all local git branches
                            
                                Git: See my last commit
                            
                                How to unstash only certain files?
                            
                                How to remove files that are listed in the .gitignore but still on the repository?
                            
                                What's the difference between "git reset" and "git checkout"?
                            
                                Git merge without auto commit
                            
                                How do I run git log to see changes only for a specific branch?
                            
                                How to revert to origin's master branch's version of file
                            
                                git: fatal: Could not read from remote repository
                            
                                Display last git commit comment
                            
                                Git - Pushing code to two remotes [duplicate]
                            
                                How to commit a change with both "message" and "description" from the command line? [duplicate]
                            
                                How to diff a commit with its parent
                            
                                How to exclude file only from root folder in Git
                            
                                Git: Receiving "fatal: Not a git repository" when attempting to remote add a Git repo
                            
                                Get the short Git version hash
                            
                                Git add and commit in one command
                            
                                How can I rollback a git repository to a specific commit?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With