I have one (for now) large text data file of 120 MB. Is it a poor practice to put it in the repo? Does it affect search functionality on GitHub? It seems like it is a bad idea because the entire source code is only 900 lines. Not planning on updating the file. Could put it on Dropbox or Google Docs, but then it is separate from the repo. If not GitHub, is there a better way of managing/backing up large data files?

There are good ways to handle this situation. For example when I am working on a project that analyses data, especially after cleaning and preprocessing steps, its lame to share the code but not the data set (within reason of course for size of data set). Here is what I have found: <ul> <li>git lfs Large File Storage this allows you to track and commit and push binaries, data files, images, etc to the same remote and you don't have to pull everything if you clone the repo. </li> <li>git-annex uses its own commands so you will be committing the repo and annexed files separately. It looks great for managing these files on any remote such as a hard drive, s3, google drive and many more.</li> </ul> Someone has made a nice comparison of git-annex vs git lfs here, and this post compares several method in short form. They both seem great, git annex is more mature currently, but git lfs is developed by github which I use, which is why I am using git lfs.

How to manage large data files with GitHub?

2 Answers

Put it in the repo if:
1- you want to keep track of the changes
2- it is actually a part of the project and you want people to receive it when they clone the repo

Don't put it in the repo (use .gitignore to exclude it) if:
1- it changes often but the changes are not meaningful and you don't want to keep the history
2- it is available online or you can make it available online and put a link or something in the repo for people to know where to find it

Dropbox is good if you don't have lots of people downloading it, Amazon S3 is your best bet for hosting it.

121

answered Sep 22 '22 15:09

Ali

There are good ways to handle this situation. For example when I am working on a project that analyses data, especially after cleaning and preprocessing steps, its lame to share the code but not the data set (within reason of course for size of data set). Here is what I have found:

git lfs Large File Storage this allows you to track and commit and push binaries, data files, images, etc to the same remote and you don't have to pull everything if you clone the repo.
git-annex uses its own commands so you will be committing the repo and annexed files separately. It looks great for managing these files on any remote such as a hard drive, s3, google drive and many more.

Someone has made a nice comparison of git-annex vs git lfs here, and this post compares several method in short form.

They both seem great, git annex is more mature currently, but git lfs is developed by github which I use, which is why I am using git lfs.

answered Sep 22 '22 15:09

Merlin

Related questions
                            
                                MySQL is faster than Redis. Am I missing something here? [closed]
                            
                                Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time [closed]
                            
                                Android PowerManager WakeLock – Deprecated
                            
                                JSON editor (highlight, collapse, validate) in Visual Studio 2012 IDE
                            
                                How to push the "develop" branch to the remote "origin"?
                            
                                WCF GET URL Length Limit Issue: Bad Request - Invalid URL
                            
                                Amazon Redshift Keys are not enforced - how to prevent duplicate data?
                            
                                CUDA Blocks & Warps
                            
                                Support Vector Machine for Java?
                            
                                App crashes when restoring from background after a long time
                            
                                Adding Node.js (for real-time notifications) to an existing PHP application
                            
                                Combine Sliding and Absolute Expiration

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to manage large data files with GitHub?

Tags:

B Seven

People also ask

2 Answers

Ali

Merlin

Recent Activity

Donate For Us