I have a 190 MB plain text file that I want to track on github.
The text file is a pronounciation lexicon file for our text-to-speech engine. We regularly add and modify lines in the text files, and the diffs are fairly small, so it's perfect for git in that sense.
However, GitHub has a strict 100 MB file size limit in place. I have tried the GitHub Large File Storage service, but that uploads a new version of the entire 190 MB file every time it changes - so that would quickly grow to many gigabytes if I go down that path.
I would like to keep the file as one file instead of splitting it because that's how our workflow is currently and it would require some coding to allow multiple text files as input/output in our tools (and we don't have much development resources).
One idea I've had is that maybe it's possible to set up some pre- and post-commit hooks to split and concatenate the big file automatically? Would that be possible?
Other ideas?
Edit: I am aware of the 100 MB file size limitation described in the similar questions here on StackOverflow, but I don't consider my question a duplicate because I'm asking for the specific case where the diffs are small and frequent (I'm not trying to upload a big ZIP file or anything). However, my understanding is that git-lfs is only appropriate for files that rarely change, and that normal git would be the perfect fit for the kind of file I'm describing; except that GitHub has a file size restriction.
Update: I spent yesterday experimenting with creating a small cross-platform program that splits and joins files into smaller files using git hooks. It kind of works but not really satisfactory. You will need to have your big text file excluded by .gitignore, which makes git unaware about whether or not it has changed. The split files are not initially detected by git status
or git commit
and leads to the same issue as described in this SO question, which is quite annoying: Pre-commit script creates mysqldump file, but "nothing to commit (working directory clean)"? Setting up a cron job (linux) and scheduled task (windows) to automatically regenerate the split files regularly might fix that, but it's not easy to automatically set up, might cause performance issues on the users computer, and is just not a very elegant solution. Some hacky solutions like dynamically modifying .gitignore might also be needed, and in no way would you get a diff of the actual text files, only the split files (although that might be acceptable as they would be very similar).
So, having slept on it, today I think the git hook approach is not a good option after all as it has too many quirks. As has been suggested by @PyRulez, I think I'll have to look at other services than GitHub (unfortunately, since I love github). A hosted solution would be preferable to avoid having to manage our own server. I'd also like it to be publically available...
Update 2: I've looked at some alternatives to GitHub and currently I'm leaning towards using GitLab. I've contacted GitHub support about the possibility of raising the 100MB limit, but if they won't do that I'll just switch to GitLab for this particular project.
GitHub blocks pushes that exceed 100 MB. To track files beyond this limit, you must use Git Large File Storage (Git LFS). For more information, see "About Git Large File Storage." If you need to distribute large files within your repository, you can create releases on GitHub.com instead of tracking the files.
GitHub has a strict file limit of 100MB. If you are just uploading lines of codes, this is not something that you need to worry about. However, if you want to upload a bit of data, or something in binary, this is a limit that you might want to cross. Here are three different ways to overcome the 100MB limit.
Files that you add to a repository via a browser are limited to 25 MB per file. You can add larger files, up to 100 MB each, via the command line. For more information, see "Adding a file to a repository using the command line." To add files larger than 100 MB, you must use Git Large File Storage.
You can use clean and smudge to compress your file. Normally, this isn't necessary, since git will compress it internally, but since gitHub is acting weird, it may help. The main commands would be like:
git config filter.compress.clean gzip git config filter.compress.smudge gzip -d
GitHub will see this as a compressed file, but on each computer, it will appear to be a text file.
See https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes for more details.
Alternatively, you could have clean post to an online pastebin, and smudge fetch from the pastebin, such as http://pastebin.com/. Many other combinations are possible with clean and smudge.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With