Using git branches as snapshots of past experiments in academic research

Question

My basic question is how to duplicate a branch, but I've done some searches on Stackoverflow and can't find a method tailored to my purposes - so please don't mark my question as duplicate as I'm asking for personalized advice here.

I'm working on a research project with a 5+ people group. We have a common codebase (the master branch). For my purpose, I needed to modify the common codebase slightly for an experiment, so I created a branch and did my work there.

Now, after the experiment is done, I would like to

Keep the branch as it is without merging back (reason: others won't need my newly added code, and I would like refer back it in the future)
Create a copy (which is a new branch) of this branch and continue my work there, so that any changes on the new branch do not get reflected on the old branch

What is the cleanest and safest way to do this? Are there any implications that I need to know? Much appreciated!

Stanislav Bashkyrtsev · Accepted Answer

Create a branch and after you finish your work create an annotated tag:

git tag -a experiment123 -m "Result of the experiment: satisfactory"
git push --tags

What's convenient about such tags:

They have an author and a message where you can store experiment summary
They can't be changed (though you can re-create it)
You get to keep your current branch and can start a new experiment there

So keep working in that branch with your next experiment (no need to create a new one since you saved the previous state into the tag). After you're done with it - create a new annotated tag.

If you need to start working off a different branch - you simply check that branch out, commit there and create a tag from there. If you ever decide to return back to the old experiment and try new things with it - you'll checkout the old tag (or create a branch from it), introduce changes and again - create yet another tag.

torek · Answer

As Stanislav Bashkyrtsev suggested, tags are useful here; annotated tags in particular allow you to add those annotations. But it's worth noting something even more basic and fundamental to Git, which is: Git is about commits. Those new to Git often read or use some sort of introductory thing that just has you jump into using branches and files, which leads new users to think that Git is about branches and files—and it's not, or at least, not exactly. Git is all about commits. The commit is your basic unit in Git, so you need to know what a commit is and does for you.

Let's go back to the title of your question:

Using git branches as snapshots of past experiments in academic research

A commit is a snapshot. But more precisely, it's a two-part snapshot. No part of any commit (or of any stored Git object, for that matter) can ever be changed, so getting the two parts right is important, but if you get them wrong, that's not such a big deal, because you can always add more commits to a repository. It can be quite hard to get rid of old bad commits—they keep coming back, spread like viruses by other Git clones—but usually there's no need to destroy them as they tend to be harmless.¹

The two parts of a commit are:

the main data, or source snapshot: this holds, frozen for all time, copies of all of the files that Git knew about at the time you (or whoever) made the commit; and
the metadata, or information about the commit itself: this holds, frozen for all time, information such as who made the commit and when.

The fundamental way that you, or Git, can find a commit is by its hash ID. Every commit has a unique hash ID: a big long string of letters and numbers, which is really just a hexadecimal encoding of a large cryptographic hash.²

¹The main exception here is a bad commit that stores some huge data file that would otherwise not get in the way of using the repository. These are easy to get rid of as long as you have not let them escape the lab, as it were, so that there are no other copies that can re-infect your database. If they have gotten out, you must start being careful about which Git repositories your Git repository has Git-sex with.

²This is currently an SHA-1 checksum of the contents of the commit preceded by the word commit, a space, an ASCII representation of the size of the object, and an ASCII NUL byte. There is a project underway to move to SHA-256 and make the hash algorithm easier to change again in the future if needed.

How this turns into branches, tags, and so on

The key to understanding branch and tag names comes in several parts.

First, as we already know, each commit has a unique hash ID. We can generalize this a bit more: all Git objects have unique hash IDs, including annotated tag objects.
Second, any reference name in Git, such as a branch name or tag name, holds one (1) of these hash IDs. A branch name is constrained to hold only a commit hash ID, while a tag name can hold a commit ID or an annotated-tag-object ID.
Third, various Git objects hold hash IDs. Most importantly, each commit holds a set of earlier commit hash IDs. Most commits have just one of these. An annotated tag object has one hash ID in it as well, usually a commit hash ID.

It's this last part that gives us useful branching. If you have a long chain of commits, each of which stores the hash ID of its immediate predecessor, you end up with a situation that is easily illustrated like this:

... <-F <-G <-H

Here, H stands in for the hash ID of the last commit in the chain. If we know H's hash ID, we can have Git retrieve commit H, which consists of those two parts: snapshot of all files, and metadata that include the hash ID of earlier commit G.

Using the hash ID from the metadata—by which we say that commit H points to commit G—we can have Git find earlier commit G. Of course, that commit also has both data and metadata, including the hash ID of still-earlier commit F. Since G points back to F, we can have Git find commit F, which can find another earlier commit, and so on, all the way back in time to the very first commit. (The chain necessarily stops there, of course: the first commit simply has no previous commit, which is how Git knows that it is the first commit.)

A branch name, then, just needs to hold the hash ID of the last commit in the chain—in this case, commit H. We say that the name points to this commit, and can draw that like this:

...--F--G--H   <-- branch

If we make another branch name that also points to commit H:

...--F--G--H   <-- branch, develop

and then pick the name develop as the active branch:

...--F--G--H   <-- branch, develop (HEAD)

and then make a new commit I, new commit I will point back to existing commit H:

...--F--G--H   <-- branch
            \
             I   <-- develop (HEAD)

Git will automatically shove the hash ID of the new commit into the active branch name: the one with the special name HEAD attached to it. If we now go back to the old branch:

...--F--G--H   <-- branch (HEAD)
            \
             I   <-- develop

we get the old files back, and just as important, if we make another new commit J, it causes the name branch to point to the new commit:

             J   <-- branch (HEAD)
            /
...--F--G--H
            \
             I   <-- develop

So this is how branches grow as you work, one commit at a time, to add new commits. Whatever the current branch name is, that name selects the current commit, by its raw hash ID. That's the source of the data—the snapshot—which Git will extract from the commit and turn into ordinary, usable, editable files. You do some work with these files, then arrange for Git to make a new commit from the resulting files; the new commit points back to whatever commit you had out, and now the current branch name points to the new commit.

A tag is just like a branch name, in that it selects a commit. An annotated tag works by having an extra Git object in between, which lets you store some extra data:

          tag:foo
             |
             v
            ... [extra data]
             |
             v
             J   <-- branch (HEAD)
            /
...--F--G--H
            \
             I   <-- develop

A tag won't move, because if you use git checkout (or in Git 2.23 and later, git switch) to select a commit via a tag, you must allow Git to check out this as a detached HEAD:³

          tag:foo
             |
             v
            ... [extra data]
             |
             v
             J   <-- branch, HEAD
            /
...--F--G--H
            \
             I   <-- develop

In this mode, if you make a new commit K, new commit K is not on any branch at all:

          tag:foo
             |
             v
            ... [extra data]
             |
             v
             J   <-- branch
            / \
...--F--G--H   K   <-- HEAD
            \
             I   <-- develop

Since commits made in this mode have no name (other than HEAD) by which to find them, if you don't jot down their hash IDs somewhere—on paper, or a whiteboard, or whatever—you may never be able to find these commits again once you go back to the normal attached-HEAD mode of operation:

          tag:foo
             |
             v
            ... [extra data]
             |
             v
             J--L   <-- branch (HEAD)
            / \
...--F--G--H   K   ???
            \
             I   <-- develop

If you leave them this way long enough, with no way to find commit K, Git will eventually declare commit K "dead" and sweep it away entirely.⁴ Of course, you can create a new branch name, or a tag name, to let you find K, and that will retain commit K for as long as you (and Git) can find it.

³The old git checkout command just does this automatically, while git switch requires that you add the --detach flag to indicate that you understand you'll be in detached-HEAD mode.

⁴The full details here get complicated, but this is the normal procedure for deleting unwanted commits. If some commits are bad, you make new-and-improved commits that aren't so bad, and arrange the branch names to find the new commits instead of finding the old bad ones. Since commits always point backwards, this sometimes involves re-copying a whole chain of commits to fix one bad one in the chain.

Summary

Git stores commits.
Git finds commits by their hash IDs.
Commits store the hash IDs of earlier commits.
Branch names store commit hash IDs; the branch is the set of commits found by starting here and working backwards.
Tag names store either raw commit hash IDs, or tag object hash IDs where the tag object then stores the commit hash ID.
Git can turn names into hash IDs for you, so that you can just git checkout any branch or tag name. Using the new git switch command in 2.23, you can git switch to any branch name, or git switch --detach to any tag name.
You can also use git log to find hash IDs, and then check out any historic commit. The log command works by walking the backwards-pointing chains.

Using git branches as snapshots of past experiments in academic research

Tags:

git

Wiza

2 Answers

Stanislav Bashkyrtsev

How this turns into branches, tags, and so on

Summary

torek

Recent Activity

Donate For Us

Using git branches as snapshots of past experiments in academic research

Tags:

git

Wiza

2 Answers

Stanislav Bashkyrtsev

How this turns into branches, tags, and so on

Summary

torek

Related questions

Recent Activity

Donate For Us