Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Versioning large text files in git

Tags:

git

I've used git for awhile for source control and I really like it. So I started investigating using git to store lots of large binary files, which I'm finding just isn't git's cup of tea. So how about large text files? It seems like git should handle those just fine, but I'm having problems with that too.

I'm testing this out using a 550mb size mbox style text file. I git init'ed a new repo to do this. Here are my results:

  • git add and git commit - total repo size is 306mb - repo contains one object that is 306mb in size
  • add one email to the mailbox file and git commit - total repo size is 611mb - repo contains two objects that are each 306mb in size
  • add one more email to the mailbox file and git commit - total repo size is 917mb - repo contains three objects that are each 306mb in size

So every commit adds a new copy of the mailbox file to the repo. Now I want to try to get the size of the repo down to something manageable. Here are my results:

  • git repack -adf - total repo size is 877mb - repo contains one pack file that is 876mb in size
  • git gc --aggressive - total repo size is 877mb - repo contains one pack file that is 876mb in size

I would expect to be able to get the repo down in size to something around 306mb, but I can't figure out how. Anything larger seems like a lot of duplicate data is being stored.

My hope is that the repo would only increase by the size of the new email received, not by the size of the entire mailbox. I'm not trying to version control email here, but this seems to be my big hold back from using a nightly script to incrementally back up users' home directories.

Any advice in how to keep the repo size from blowing up when inserting a small amount of text to the end of a very large text file?

I've looked at bup and git annex, but I'd really like to stick with just plain old git if possible.

Thank you for your help!

like image 328
user1020774 Avatar asked Oct 30 '11 15:10

user1020774


People also ask

Can GitHub handle large files?

Note: If you add a file to a repository via a browser, the file can be no larger than 25 MB. For more information, see "Adding a file to a repository." GitHub blocks pushes that exceed 100 MB. To track files beyond this limit, you must use Git Large File Storage (Git LFS).

Is Git LFS necessary?

You should use Git LFS if you have large files or binary files to store in Git repositories. That's because Git is decentralized. So, every developer has the full change history on their computer.

How do I remove a large file from my Git repository?

If the large file was added in the most recent commit, you can just run: git rm --cached <filename> to remove the large file, then. git commit --amend -C HEAD to edit the commit.


1 Answers

Git isn't the greatest backup tool, but it should be able to handle appending to a text file very efficiently. I was suspicious of your results. I repeated your experiment with a 354 meg file and git 1.7.7 on OS X. Here's my actions and the size of .git.

  1. git init (52K)
  2. git add mbox && git commit (110M)
  3. cat mail1 >> mbox && git commit -a -m (219M)
  4. git gc (95M)
  5. cat mail2 >> mbox && git commit -a -m (204M)
  6. git gc (95M)

As you can see, git is being very efficient. 94 megs is the size of the compressed mbox. It can't get much smaller.

I'm guessing your either using an old version of git or your mbox file is being compressed or encrypted by your mailer.

  • Check that the contents of your mbox which git is seeing is plain text.
  • If you're not using the latest git, upgrade and try again.
like image 68
Schwern Avatar answered Oct 11 '22 13:10

Schwern