I have one (for now) large text data file of 120 MB.
Is it a poor practice to put it in the repo? Does it affect search functionality on GitHub?
It seems like it is a bad idea because the entire source code is only 900 lines.
Not planning on updating the file.
Could put it on Dropbox or Google Docs, but then it is separate from the repo.
If not GitHub, is there a better way of managing/backing up large data files?
GitHub has a strict file limit of 100MB. If you are just uploading lines of codes, this is not something that you need to worry about. However, if you want to upload a bit of data, or something in binary, this is a limit that you might want to cross.
Use the Git LFS extension with a repository to speed up the handling of large files. Use the Git Large File Storage (LFS) extension with an existing Bitbucket Cloud repository. Use the bfg-repo-cleaner utility to change the Git history of a repository.
If you want to store a large file on GitHub you can. You'll need to use something called Git Large File Storage (LFS). Install Git LFS on your computer and then you can begin. In this way you don't need to have all the individual files.
GitHub does not allow us to upload files larger than 25 megabytes through the browser. If you try you may get an error as follows. Nevertheless, you can push larger files into your GitHub repository using the git bash terminal as follows, in just 8 steps. Download and install Git on your pc.
Put it in the repo if:
1- you want to keep track of the changes
2- it is actually a part of the project and you want people to receive it when they clone the repo
Don't put it in the repo (use .gitignore to exclude it) if:
1- it changes often but the changes are not meaningful and you don't want to keep the history
2- it is available online or you can make it available online and put a link or something in the repo for people to know where to find it
Dropbox is good if you don't have lots of people downloading it, Amazon S3 is your best bet for hosting it.
There are good ways to handle this situation. For example when I am working on a project that analyses data, especially after cleaning and preprocessing steps, its lame to share the code but not the data set (within reason of course for size of data set). Here is what I have found:
git lfs Large File Storage this allows you to track and commit and push binaries, data files, images, etc to the same remote and you don't have to pull everything if you clone the repo.
git-annex uses its own commands so you will be committing the repo and annexed files separately. It looks great for managing these files on any remote such as a hard drive, s3, google drive and many more.
Someone has made a nice comparison of git-annex vs git lfs here, and this post compares several method in short form.
They both seem great, git annex is more mature currently, but git lfs is developed by github which I use, which is why I am using git lfs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With