Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to manage large data files with GitHub?

Tags:

I have one (for now) large text data file of 120 MB.

Is it a poor practice to put it in the repo? Does it affect search functionality on GitHub?

It seems like it is a bad idea because the entire source code is only 900 lines.

Not planning on updating the file.

Could put it on Dropbox or Google Docs, but then it is separate from the repo.

If not GitHub, is there a better way of managing/backing up large data files?

like image 328
B Seven Avatar asked Oct 29 '12 02:10

B Seven


People also ask

Can GitHub handle large files?

GitHub has a strict file limit of 100MB. If you are just uploading lines of codes, this is not something that you need to worry about. However, if you want to upload a bit of data, or something in binary, this is a limit that you might want to cross.

How do I manage large files in Git?

Use the Git LFS extension with a repository to speed up the handling of large files. Use the Git Large File Storage (LFS) extension with an existing Bitbucket Cloud repository. Use the bfg-repo-cleaner utility to change the Git history of a repository.

How do I store large files on GitHub?

If you want to store a large file on GitHub you can. You'll need to use something called Git Large File Storage (LFS). Install Git LFS on your computer and then you can begin. In this way you don't need to have all the individual files.

Can I upload large dataset to GitHub?

GitHub does not allow us to upload files larger than 25 megabytes through the browser. If you try you may get an error as follows. Nevertheless, you can push larger files into your GitHub repository using the git bash terminal as follows, in just 8 steps. Download and install Git on your pc.


2 Answers

Put it in the repo if:
1- you want to keep track of the changes
2- it is actually a part of the project and you want people to receive it when they clone the repo

Don't put it in the repo (use .gitignore to exclude it) if:
1- it changes often but the changes are not meaningful and you don't want to keep the history
2- it is available online or you can make it available online and put a link or something in the repo for people to know where to find it

Dropbox is good if you don't have lots of people downloading it, Amazon S3 is your best bet for hosting it.

like image 121
Ali Avatar answered Sep 22 '22 15:09

Ali


There are good ways to handle this situation. For example when I am working on a project that analyses data, especially after cleaning and preprocessing steps, its lame to share the code but not the data set (within reason of course for size of data set). Here is what I have found:

  • git lfs Large File Storage this allows you to track and commit and push binaries, data files, images, etc to the same remote and you don't have to pull everything if you clone the repo.

  • git-annex uses its own commands so you will be committing the repo and annexed files separately. It looks great for managing these files on any remote such as a hard drive, s3, google drive and many more.

Someone has made a nice comparison of git-annex vs git lfs here, and this post compares several method in short form.

They both seem great, git annex is more mature currently, but git lfs is developed by github which I use, which is why I am using git lfs.

like image 34
Merlin Avatar answered Sep 22 '22 15:09

Merlin