Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git clone and pull omitting large files

Tags:

git

Here is the situation. Ad-hock analytic repository with a directory per each individual analysis. Each directory contains a script(s) connected with one or more data files that come in different formats and are of different (sometimes considerable) size. Scripts without data are generally useless so we would like to store data files. On the other hand sometimes it's useful to look at the script without being forced to download associated data files(s) (to determine how some analysis were conducted).

We definitely don't want to store data on a separated repository (runtime issues, associating scripts with data files etc.)

What was analyzed:

  • git submodules - separated repo, everything will be kept away from the scripts (not in same directories so it'd get messy over time)
  • git hooks - intended rather for applying constraints or additional actions for push request and as was stated above - everyone should be able to upload any file (besides: we don't have access to apply sever side hooks)

The idea that comes to me is that it would be convenient to exclude some locations or certain files (i.e. >> 50 MB) from being pulled or cloned from repository. Just not to transfer unwanted data. Is it possible?

If some files are not touched over subsequent commits they are not necessary from the perspective of future pushes. Probably (or even for sure) I'm lacking certain knowledge about underlying mechanisms of git. I would be thankful for clarification.

like image 584
iku Avatar asked Oct 13 '15 13:10

iku


2 Answers

git clone --no-checkout --filter=blob:limit=100m

This will actually allow fetching only files smaller than a given size when servers finally implement it.

Then you have to checkout all files but the big ones. A simple strategy that will likely work will be to git rev-list --filter=blob:limit=100 | xargs, but I'm lazy to test it now.

See this answer for more details: How do I clone a subdirectory only of a Git repository?

git LFS

This is a solution that can already be used on GitHub and GitLab.

You just track your large blobs in LFS, and then clone without LFS How to clone/pull a git repository, ignoring LFS?

GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY

and finally manually pull any missing LFS files that you may want: https://github.com/git-lfs/git-lfs/issues/1351

git lfs pull --include "*.dat"

Git sparse checkout lets you set subdirs to checkout or not, etc. I don't think it can do it based on anything else (e.g. size) though AFAIK.

like image 33
Michael Avatar answered Nov 04 '22 02:11

Michael