Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git and binary data, best storage method

Im in the unfortunate situation of having to store some binary files in git,

However I can choose how the data is stored on disk - in Git (in our own format which only the build system needs to read).

I'd like to avoid talking specifics too much, since I dont think its so important - but to give some context these are many icon files, but the same question would apply to many small sound-files or 3d-models too.

converting these files into one large image will be a build step, so the images can be stored however we like in git.

  • Binary compressed (eg, PNG (image), FLAC (sound))
  • Binary uncompressed (eg, PPM (image), Uncompressed-WAV (sound))
  • ASCII representation of binary data (eg, mime encoding, XPM (image))

Lets assume there will be changes to some files occasionally - so avoiding storing a new binary blob for every small change to a pixel - would be nice.

I'm interested to know:

  • Which options will store a totally new binary blob each time the binary file changes (even a few bytes).
  • Does git diff uncompressed binary data better then compressed data (which may change a lot even with minor edits to the uncompressed data).
  • I would assume storing many small binary files is less overhead long term, compared to one large binary file, assuming only some of the files are periodically modified, can git handle small changes to large binary files efficiently?

All things considered what are the best options for avoiding a large git repo (as edits are made to the binary files) assuming using binary files can't be avoided completely?

like image 299
ideasman42 Avatar asked Dec 19 '13 19:12

ideasman42


People also ask

Should I store binary files in git?

You should use Git LFS if you have large files or binary files to store in Git repositories. That's because Git is decentralized. So, every developer has the full change history on their computer.

How do you store binary data?

Binary data can be stored in a table using the data type bytea or by using the Large Object feature which stores the binary data in a separate table in a special format and refers to that table by storing a value of type oid in your table.

Does git compress binary files?

It can, literally, compress (or "deltify") any binary data against any other binary data—but the results will be poor unless the inputs are well-chosen. It's the input choices that are the real key here. Git also has a technical documentation file describing how objects are chosen for deltification.

How does Git deal with binary files?

Git can usually detect binary files automatically. No, Git will attempt to store delta-based changesets if it's less expensive to (not always the case). Submodules are used if you want to reference other Git repositories within your project.


1 Answers

Which options will store a totally new binary blob each time the binary file changes (even a few bytes).

All of them. All blobs (indeed, all objects in the repo) are stored "intact" (more or less) whenever they are "loose objects". The only thing done with them is to give them a header and compress them with deflate compression.

At the same time, though, loose objects are eventually combined into "packs". Git does delta-compression on files in packs: see Is the git binary diff algorithm (delta storage) standardized?. Based on the answers there, you'd be much better off not "pre-compressing" the binaries, so that the pack-file delta algorithm can find long strings of matching binary data.

Does git diff uncompressed binary data better then compressed data (which may change a lot even with minor edits to the uncompressed data).

I have not tried it but the overall implication is that the answer to this should be "yes".

I would assume storing many small binary files is less overhead long term, compared to one large binary file, assuming only some of the files are periodically modified, can git handle small changes to large binary files efficiently?

Certainly all files that are completely unchanged will be stored with a lot of "de-duplication" instantly, as their SHA-1 checksums will be identical across all commits, so that each tree names the very same blob in the repository. If foo.icon is the same across thousands of commits, there's just the one blob (whatever the SHA-1 for foo.icon turns out to be) stored.


I'd recommend experimenting a bit: create some dummy test repos with proposed binaries, make proposed changes, and see how big the repos are before and after running git gc to re-pack the loose objects. Note that there are a lot of tuneables; in particular, you might want to fuss with window, depth and window-memory settings (which can be set on command lines or in git config entries).

like image 66
torek Avatar answered Oct 03 '22 14:10

torek