Fixing up a git repo that is slowed because of big binary files

Tags:

We have a git repo containing both source code and binaries. The bare repo has now reached ~9GB, and cloning it takes ages. Most of the time is spent in "remote: Compressing objects". After a commit with a new version of one of the bigger binaries, a fetch takes a long time, also spent compressing objects on the server.

After reading git pull without remotely compressing objects I suspect delta compression of binary files is what hurts us as well, but I'm not 100% sure how to go about fixing this.

What are the exact steps to fix the bare repo on the server? My guess:

Add entries like '*.zip -delta' for all extensions I want to into .git/info/attributes
Run 'git repack', but with what options? Would -adF repack everything, and leave me with a repo where no delta compression has ever been done on the specified file types?
Run 'git prune'. I thought this was done automatically, but running it when I played around with a bare clone of said repo decreased the size by ~2GB
Clone the repo, add and commit a .gitattributes with the same entries as I added in .git/info/attributes on the bare repo

Am I on to something?

Update:

Some interesting test results on this. Today I started a bare clone of the problematic repo. Our not-so-powerful-server with 4GB ram ran out of memory and started swapping. After 3 hours I gave up...

Then I instead cloned a bare repo from my up-to-date working copy. Cloning that one between workstations took ~5 minutes. I then pushed it up to the server as a new repo. Cloning that repo took only 7 minutes.

If I interpret this correctly, a better packed repo performs much better, even without disabling the delta-compression for binary files. I guess this means the steps above are indeed what I want to do in the short term, but in addition I need to find out how to limit the amount of memory git is allowed to use for packing/compression on the server so I can avoid the swapping.

In case it matters: The server runs git 1.7.0.4 and the workstations run 1.7.9.5.

Update 2:

I did the following steps on my testrepo, and think I will chance to do them on the server (after a backup)

Limit memory usage when packing objects

git config pack.windowMemory 100m
git config pack.packSizeLimit 200m
Disable delta compression for some extensions

echo '*.tar.gz -delta' >> info/attributes
echo '*.tar.bz2 -delta' >> info/attributes
echo '*.bin -delta' >> info/attributes
echo '*.png -delta' >> info/attributes
Repack repository and collect garbage

git repack -a -d -F --window-memory 100m --max-pack-size 200m
git gc

Update 3:

Some unexpected side effects after this operation: Issues after trying to repack a git repo for improved performance

906

asked Sep 18 '12 19:09

anr78

Video Answer

1 Answers

While your questions asks on how to make your current repo more efficient, I don't think that's feasible.

Follow the advice of the crowd:

Move your big binaries out of your repo
Move your dev environment to a virtual machine image: https://www.virtualbox.org/
Use this Python script to clean your repo of those large binary blobs (I used it to on my repo and it worked great) https://gist.github.com/1433794

122

answered Sep 22 '22 08:09

Amir Rubin

Related questions
                            
                                Is there a library that uses ConstraintKinds to generalize all the base type classes to allow constraints?
                            
                                How to enforce ggplot's position_dodge on categories with no data?
                            
                                How to make ORDER BY in JOIN query faster? Nothing I have tried has worked
                            
                                Is there any Node.js issue trackers out there? [closed]
                            
                                Many meshes with the same geometry and material, can I change their colors?
                            
                                Is it possible to save the Python interpreter's state to a file?
                            
                                List<T> capacity increasing vs Dictionary<K,V> capacity increasing?
                            
                                What exactly constitutes a "full calendar sync" in EKCalendar?
                            
                                Read side implementation approaches using CQRS
                            
                                What path does @loader_path resolve to?
                            
                                How to Install the app twice without interference in android?
                            
                                Nested parallelism in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With