Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing up a git repo that is slowed because of big binary files

Tags:

We have a git repo containing both source code and binaries. The bare repo has now reached ~9GB, and cloning it takes ages. Most of the time is spent in "remote: Compressing objects". After a commit with a new version of one of the bigger binaries, a fetch takes a long time, also spent compressing objects on the server.

After reading git pull without remotely compressing objects I suspect delta compression of binary files is what hurts us as well, but I'm not 100% sure how to go about fixing this.

What are the exact steps to fix the bare repo on the server? My guess:

  • Add entries like '*.zip -delta' for all extensions I want to into .git/info/attributes
  • Run 'git repack', but with what options? Would -adF repack everything, and leave me with a repo where no delta compression has ever been done on the specified file types?
  • Run 'git prune'. I thought this was done automatically, but running it when I played around with a bare clone of said repo decreased the size by ~2GB
  • Clone the repo, add and commit a .gitattributes with the same entries as I added in .git/info/attributes on the bare repo

Am I on to something?

Update:

Some interesting test results on this. Today I started a bare clone of the problematic repo. Our not-so-powerful-server with 4GB ram ran out of memory and started swapping. After 3 hours I gave up...

Then I instead cloned a bare repo from my up-to-date working copy. Cloning that one between workstations took ~5 minutes. I then pushed it up to the server as a new repo. Cloning that repo took only 7 minutes.

If I interpret this correctly, a better packed repo performs much better, even without disabling the delta-compression for binary files. I guess this means the steps above are indeed what I want to do in the short term, but in addition I need to find out how to limit the amount of memory git is allowed to use for packing/compression on the server so I can avoid the swapping.

In case it matters: The server runs git 1.7.0.4 and the workstations run 1.7.9.5.

Update 2:

I did the following steps on my testrepo, and think I will chance to do them on the server (after a backup)

  • Limit memory usage when packing objects

    git config pack.windowMemory 100m
    git config pack.packSizeLimit 200m

  • Disable delta compression for some extensions

    echo '*.tar.gz -delta' >> info/attributes
    echo '*.tar.bz2 -delta' >> info/attributes
    echo '*.bin -delta' >> info/attributes
    echo '*.png -delta' >> info/attributes

  • Repack repository and collect garbage

    git repack -a -d -F --window-memory 100m --max-pack-size 200m
    git gc

Update 3:

Some unexpected side effects after this operation: Issues after trying to repack a git repo for improved performance

like image 906
anr78 Avatar asked Sep 18 '12 19:09

anr78


People also ask

Can git handle large binary files?

Git cannot handle large files on its own. That's why many Git teams add Git LFS to deal with large files in Git.

What is git compression?

Git Compression of Blobs and Packfiles. Many users of Git are curious about the lack of delta compression at the object (blob) level when commits are first written. This efficiency is saved until the pack file is written. Loose objects are written in compressed, but non-delta format at the time of each commit.

Does git LFS compress files?

Git LFS (Large File Storage) is a Git extension developed by Atlassian, GitHub, and a few other open source contributors, that reduces the impact of large files in your repository by downloading the relevant versions of them lazily.

What is git delta compression?

Delta compression (also called delta encoding, or just delta coding), is where only the differences to a known base file is stored, discarding any similarities. To decompress this, you apply the stored changes (also called “diffs”) to the base file, leaving you with the new file.


Video Answer


1 Answers

While your questions asks on how to make your current repo more efficient, I don't think that's feasible.

Follow the advice of the crowd:

  1. Move your big binaries out of your repo
  2. Move your dev environment to a virtual machine image: https://www.virtualbox.org/
  3. Use this Python script to clean your repo of those large binary blobs (I used it to on my repo and it worked great) https://gist.github.com/1433794
like image 122
Amir Rubin Avatar answered Sep 22 '22 08:09

Amir Rubin