Git and binary data, best storage method

Tags:

binary-diff

Im in the unfortunate situation of having to store some binary files in git,

However I can choose how the data is stored on disk - in Git (in our own format which only the build system needs to read).

I'd like to avoid talking specifics too much, since I dont think its so important - but to give some context these are many icon files, but the same question would apply to many small sound-files or 3d-models too.

converting these files into one large image will be a build step, so the images can be stored however we like in git.

Binary compressed (eg, PNG (image), FLAC (sound))
Binary uncompressed (eg, PPM (image), Uncompressed-WAV (sound))
ASCII representation of binary data (eg, mime encoding, XPM (image))

Lets assume there will be changes to some files occasionally - so avoiding storing a new binary blob for every small change to a pixel - would be nice.

I'm interested to know:

Which options will store a totally new binary blob each time the binary file changes (even a few bytes).
Does git diff uncompressed binary data better then compressed data (which may change a lot even with minor edits to the uncompressed data).
I would assume storing many small binary files is less overhead long term, compared to one large binary file, assuming only some of the files are periodically modified, can git handle small changes to large binary files efficiently?

All things considered what are the best options for avoiding a large git repo (as edits are made to the binary files) assuming using binary files can't be avoided completely?

299

asked Dec 19 '13 19:12

ideasman42

1 Answers

Which options will store a totally new binary blob each time the binary file changes (even a few bytes).

All of them. All blobs (indeed, all objects in the repo) are stored "intact" (more or less) whenever they are "loose objects". The only thing done with them is to give them a header and compress them with deflate compression.

At the same time, though, loose objects are eventually combined into "packs". Git does delta-compression on files in packs: see Is the git binary diff algorithm (delta storage) standardized?. Based on the answers there, you'd be much better off not "pre-compressing" the binaries, so that the pack-file delta algorithm can find long strings of matching binary data.

Does git diff uncompressed binary data better then compressed data (which may change a lot even with minor edits to the uncompressed data).

I have not tried it but the overall implication is that the answer to this should be "yes".

I would assume storing many small binary files is less overhead long term, compared to one large binary file, assuming only some of the files are periodically modified, can git handle small changes to large binary files efficiently?

Certainly all files that are completely unchanged will be stored with a lot of "de-duplication" instantly, as their SHA-1 checksums will be identical across all commits, so that each tree names the very same blob in the repository. If foo.icon is the same across thousands of commits, there's just the one blob (whatever the SHA-1 for foo.icon turns out to be) stored.

I'd recommend experimenting a bit: create some dummy test repos with proposed binaries, make proposed changes, and see how big the repos are before and after running git gc to re-pack the loose objects. Note that there are a lot of tuneables; in particular, you might want to fuss with window, depth and window-memory settings (which can be set on command lines or in git config entries).

answered Oct 03 '22 14:10

torek

Related questions
                            
                                Why am I getting a "Permission Denied" when trying to cap deploy?
                            
                                Retroactively convert a UCS-2 file to UTF-8 in Git
                            
                                How to fix error "Please set env variable CHROME_BIN" when running Angular.js with grunt
                            
                                Remove history of a file from just a single branch, not the entire repo
                            
                                Git status reports deleted files which still exist
                            
                                Git Error: dyld: lazy symbol binding failed: Symbol not found: _iconv_open
                            
                                Git branch/tag name with ^{}
                            
                                Changes not reflected in PhpStorm after GIT checkout on remote machine
                            
                                should minified files be committed into source control?
                            
                                How can I display dates when local updates were made to a git repo?
                            
                                `hub` pull-request for organization repository
                            
                                find all my changes in gerrit
                            
                                gitignore based on unix file permissions
                            
                                How to keep a pushed/splitted subtree up-to-date?
                            
                                SourceTree - Not sure how to go back to a previous commit
                            
                                git status show files between ""
                            
                                how to reduce server git repo size when delete a branch
                            
                                Reconciling Git Flow and QA
                            
                                Architecture for managing JSON files in a javascript project: Internal database vs.git
                            
                                Push to Remote URL with TortoiseGit

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With