Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git repository maintenance and auditing tools

Tags:

git

I would like to do some auditing and maintenance operations in the Git repository, because there are developers that sometimes version files that shouldn't be in a regular (healthy) repository: compiled binary files, third party documentation files...

In the last few months the size of the repository has been greatly increased, and I want to know why: what files have been added, what file extensions, what file sizes... even though those files may have been removed after, the metadata is still there, affecting clone, pull and fetch commands, and the git metadata directory is big, indeed.

I know that the git log command provides such features, but I wonder if there's some other tool (UI tool maybe?) that provide more integrated, easy to compare and contrast information about git repository usage.

So, summarizing, what are my options to obtain Git information about:

  1. Files that were added to the repository of a certain size, file extension, from a certain time and (if possible) who did that.

  2. Files that are removed. Is it possible to obliterate them?

like image 214
Luis Avatar asked Aug 06 '17 16:08

Luis


1 Answers

The nice thing about git is that it exposes all of its guts, so you can take a peek at them.

In your case, you're looking after big blobs. In case you're unfamiliar with how git works internally, it's based on an object database that maps objects to their SHA-1 hashes. Commits are objects, each commit contains one tree, which is an object that lists the contents of a directory, and those objects can either be other trees (for subdirectories) or blobs (for file data).

Which means that if two files have the same content, they will share a single blob. It also means that if you alter an object, its ID will change as well (and you'll see what's the consequence of that at the end of this answer).

How to get reports

First, make sure you're working on a mirror repository, so clone with the --mirror option:

git clone --mirror https://my-host/my-repo.git

Ok, now here's a command that will show you the 200 biggest blobs:

git rev-list --objects --all | git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(rest)' | sort -nr -k 3 | perl -ne 'm#^(\w+) blob (\d+) (.+)# or next; print "$1\t$2\t$3\n";' | head -n 200 | column -t -s $'\t'

Let's break this down a little:

  • git rev-list --objects --all will output all object IDs (SHA-1 hashes) used in your repository, followed by the file path for blobs (docs).
  • git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(rest)' will reformat the output of git rev-list and insert some data we're interested in (like the object size) (docs).
    Here, %(rest) will be replaced with the part after the object ID on the input - it's the file path that points to the blob.
  • sort -nr -k 3 will do a reverse sort on the blob size
  • perl -ne 'm#^(\w+) blob (\d+) (.+)# or next; print "$1\t$2\t$3\n";' will simply filer out anything that's not a blob and reformat the output
  • head -n 200 will take the first 200 items
  • column -t -s $'\t' will reformat the output nicely

Note the object IDs of the blobs you want to delete.

Remember we're talking about blobs here, not files. If you change a file, you'll have 2 blobs for that file: one blob for each version that was committed. Also, keep in mind that the total disk usage will be less than the sum of the sizes of each blob due to the delta compression that happens when git performs a GC. If two blobs are very similar (as a commit will often change just a small part of a file), the delta compression will be very efficient.

Now you can tweak this command to generate other reports. Here's a less fancy version that does the same thing:

git rev-list --objects --all | git cat-file --batch-check='%(objectname) %(objectsize) %(rest)' | sort -nr -k 2 | head -n 200

And here's how to get the blob size grouped by file extension:

git rev-list --objects --all | git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(rest)' | perl -ne 'm#^(\w+) blob (\d+) .+?(?:\.(\w+))?$# or next; next if $h{$1}++; $ext = $3 ? lc $3 : "<none>"; $s{$ext} += $2; ++$c{$ext}; END { foreach $ext (keys %s) { print "$ext $s{$ext} $c{$ext}\n"; } }' | sort -nr -k 2 | column -t

Same technique, but the Perl script is different. You can iterate by inserting a grep in the fist script to get all object IDs and their sizes for a given file extension for instance.

How to obliterate unwanted data

By now you should have a good idea of what you want to get rid of. Time to use the BFG repo cleaner. Make sure to read the instructions on the website carefully.

Very important: The BFG will rewrite your entire commit history, which means all your commit hashes up from the first changed one will be different. You and everyone else that has access to the repo will have to abandon the old repository and replace it with the new one. This is a direct consequence of how object IDs work in git, and there's nothing much you can do about that.

This tool has commands that let you delete all files of a given extension, and it also has a switch that lets you supply a list of object IDs to delete. This one is very useful when combined with the results of the reports above. Suppose you have a list of object IDs to delete in a file named blobs-to-delete.txt:

java -jar bfg.jar --no-blob-protection --private --strip-blobs-with-ids blobs-to-delete.txt my-repo.git

This is much safer than using options like --strip-blobs-bigger-than, for obvious reasons.

A couple notes:

  • --no-blob-protection will mark your latest commit as modifiable (BFG won't touch its contents by default otherwise - just make sure you have a backup).
  • --private will prevent the tool from including old commit IDs in the commit messages of the new commits (remove it if you need to keep a trace of what happened in each commit message, but IMO it just pollutes the commit messages, the tool outputs a map file anyways).

Then, you have to expire the reflog and trigger a full GC for the deleted objects to actually go away:

git reflog expire --expire=now --all && git gc --prune=now --aggressive

Now if you're 200% sure of what you did, force push your changes, and then force everyone on your team to make a fresh clone. Enjoy your slimmed down repo!

like image 135
Lucas Trzesniewski Avatar answered Oct 23 '22 15:10

Lucas Trzesniewski