Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good workflow for git-annex?

Our development team has been using git for version control and using git-annex to store large binary files (data-binaries, images, test-binaries etc). Although we have been able to set it up and use it, we have had our set of troubles.

A common action that we frequently perform that has given us trouble is:

  1. Developer 1 adds some tests for a new feature and adds corresponding data for the tests using git-annex.

    git add <test-file>
    git annex add <data-file>
    git annex copy <data-file> --to=<remote location(we use s3 if that is relevant)>
    git commit -m 'Tests with data'
    git push
    git annex sync
    
  2. The work is reviewed and merged (we use Github for hosting and follow a forking model where all work is done by a developer on their own fork and merged into the main repository through Pull requests)

  3. Developer 2 fetches/merges with upstream and tries to run the tests on his machine.

    git fetch upstream
    git merge upstream/<branch>
    git annex sync
    git annex get
    

We often end up with the test data either not being tracked in git or unable to be downloaded from the remote location.

What is a good way to use git-annex in our workflow?

As an aside, what are other options that might make such a workflow better/easier to manage?

like image 298
sanchitarora Avatar asked Dec 20 '14 17:12

sanchitarora


People also ask

How git annex works?

Git-annex manages files in the git repository without playing their contents directly into thegit repo. This seems somewhat paradoxical at first, but keeps git from having to manage to large of files in the repo. Here only the file name and associated data is located directly in the git repo.

What is git used for?

Git is a DevOps tool used for source code management. It is a free and open-source version control system used to handle small to very large projects efficiently. Git is used to tracking changes in the source code, enabling multiple developers to work together on non-linear development.


2 Answers

Ok Here we go:

Manual git annex v6 use:

Server1 & Server2:

mkdir testdata
cd testdata
git init
git annex init "LocationNameIdentifyer"
git annex upgrade
git remote add OtherServerLocationNameIdentifyer ssh://otherserver.com/thedir

when this setup is ready and there are no extra files in the directory you can now run

git annex sync --content

on both location if there are files in both location you need to do

git add --all 

in both locations to track current files as so called unlocked files

after

git annex sync --content 

on both locations runned lets say 3 times

all is merged and you can now cron git annex sync --content in both locations and both have same files in the worktree if you want to track new files you puted in a location you do git add not git annex add git annex add will add the files as so called locked files that makes a totall other workflow

like image 87
frank-dspeed Avatar answered Dec 14 '22 04:12

frank-dspeed


This will let you have a git repo "myrepo" with related S3 bucket that holds all of the big files you don't really want in your git repository.

Set up the repo:

# Clone your repo "myrepo"
git clone [email protected]:me/myrepo.git
cd myrepo

# Initialize it to work with git-annex.  
# This creates .git/annex directory in the repo, 
# and a `git-annex` metadata branch the tools use behind the scenes.
git annex init                  

# The first time you use the repo with git-annex someone must link it to S3.
# Be sure to have AWS_* env vars set.
# Select a name that is fitting to be a top-level bucket name.
# This creates the bucket s3://myrepo-annexfiles-SOME_UUID.
git annex initremote myrepo-annexfiles type=S3  

# Save the repo updates related to attaching your git annex remote.
# Warning: this does a commit and push to origin of this branch plus git-annex.
# It will ALSO grab other things so make sure you have committed
# or stashed those to keep them out of the commit.
git annex sync    

Add some files to the annex:

# These examples are small for demo.
mkdir mybigfiles
cd mybigfiles
echo 123 > file1
echo 456 > file2

# This is the alternative to `git add`
# It replaces the files with symlinks into .git/annex/.../SOME_SHA256.
# It also does `git add` on the symlinks, but not the targets.
git annex add file*             

# Look at the symlinks with wonder.
ls -l mybigfiles/file*    

# This puts the content into S3 by SHA256 under the attached to your "special remote":
git annex move file* --to myrepo-annexfiles 

# Again, this will do a lot of committing and pushing so be prepared.
git annex sync                  

With git-annex the git repo will just have dead symlinks that contain a SHA256 value for the real file content, and the tooling will bring down the big files.

Later, when someone else clones the repo and wants the files:

git clone myrepo
cd myrepo

# Enable access to the S3 annex files.
# NOTE: This will put out a warning about ssh because the origin above is ssh.
# This is ONLY telling you that it can't push the big annex files there.
# In this example we are using git-annex specifically to ensure that.
# It is good that it has configured your origin to NOT participate here.
git annex enableremote myrepo-annexfiles

# Get all of the file content from S3:
git annex get mybigfiles/*

When done with the files, get your disk space back:

git annex drop mybigfiles/*

Check to see where everything really lives, and what is really downloaded where:

git annex whereis mybigfiles/file*

Note that git-annex is a super flexible tool. I found that distilling down a simpler recipe for the common case required a bit of study of the docs. Hope this helps others.

like image 43
Scott Smith Avatar answered Dec 14 '22 03:12

Scott Smith