Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How git branches and tags are stored in disks?

I recently checked one of my git repositories at work, which had more than 10,000 branches and more than 30000 tags. The total size of the repo, after a fresh clone is 12Gigs. I am sure there is no reason to have 10000 branches. So I believe they would occupy considerable amount of space in the disks. So, my questions are as follows

  1. How branches and tags are stored in disks, like what data-structure used, what information is stored for every branch?
  2. How do I get the metadata about the branches? like when that branch was created, what the size of the branch is.
like image 253
thefourtheye Avatar asked Dec 18 '13 19:12

thefourtheye


People also ask

How are branches stored in Git?

Git stores all references under the . git/refs folder and branches are stored in the directory . git/refs/heads . Since branch is a simple text file we can just create a file with the contents of a commit hash.

How are Git tags stored?

They are stored in the . git/refs/tags directory, and just like branches the file name is the name of the tag and the file contains a SHA of a commit 3. An annotated tag is an actual object that Git creates and that carries some information. For instance it has a message, tagger, and a tag date.

How does Git internally manage branches?

A branch in Git is simply a lightweight movable pointer to one of these commits. The default branch name in Git is master . As you start making commits, you're given a master branch that points to the last commit you made. Every time you commit, the master branch pointer moves forward automatically.

Do Git branches take up space?

Of course, they are much more clever than our simple “copy folder” strategy. For example, they don't waste disk space (which a simple file system copy would do) and they are much more capable when it comes to collaborating with other developers in the same project.


3 Answers

So, I’m going to expand on the topic a bit and explain how Git stores what. Doing so will explain what information is stored, and what exactly matters for the size of the repository. As a fair warning: this answer is rather long :)

Git objects

Git is essentially a database of objects. Those objects come in four different types and are all identified by a SHA1 hash of their contents. The four types are blobs, trees, commits and tags.

Blob

A blob is the simplest type of objects. It stores the content of a file. So for each file content you store within your Git repository, a single blob object exists in the object database. As it stores only the file content, and not metadata like file names, this is also the mechanism that prevents files with identical content from being stored multiple times.

Tree

Going one level up, the tree is the object that puts the blobs into a directory structure. A single tree corresponds to a single directory. It is essentially a list of files and subdirectories, with each entry containing a file mode, a file or directory name, and a reference to the Git object that belongs to the entry. For subdirectories, this reference points to the tree object that describes the subdirectory; for files, this reference points to the blob object storing the file contents.

Commit

Blobs and trees are already enough to represent a complete file system. To add the versioning on top of that, we have commit objects. Commit objects are created whenever you commit something in Git. Each commit represents a snapshot in the history of revisions.

It contains a reference to the tree object describing the root directory of the repository. This also means that every commit that actually introduces some changes at least requires a new tree object (likely more).

A commit also contains a reference to its parent commits. While there is usually just a single parent (for a linear history), a commit can have any number of parents in which case it’s usually called a merge commit. Most workflows will only ever make you do merges with two parents, but you can really have any other number too.

And finally, a commit also contains the meta data you expect a commit to have: Author and committer (name and time) and of course the commit message.

That is all that is necessary to have a full version control system; but of course there is one more object type:

Tag

Tag objects are one way to store tags. To be precise, tag objects store annotated tags, that are tags that have—similar to commits—some meta information. They are created by git tag -a (or when creating a signed tag) and require a tag message. They also contain a reference to the commit object they are pointing at, and a tagger (name and time).

References

Up until now, we have a full versioning system, with annotated tags, but all our objects are identified by their SHA1 hash. That’s of course a bit annoying to use, so we have some other thing to make it easier: References.

References come in different flavors, but the most important thing about them is this: They are simple text files containing 40 characters—the SHA1 hash of the object they are pointing to. Because they are this simple, they are very cheap, so working with many references is no problem at all. It creates no overhead and there is no reason not to use them.

There are usually three “types” of references: Branches, tags and remote branches. They really work the same and all point to commit objects; except for annotated tags which point to tag objects (normal tags are just commit references though too). The difference between them is how you create them, and in which subpath of /refs/ they are stored. I won’t cover this now though, as this is explained in nearly every Git tutorial; just remember: References, i.e. branches, are extremely cheap, so don’t hesitate to create them for just about everything.

Compression

Now because torek mentioned something about Git’s compression in his answer, I want to clarify this a bit. Unfortunately he mixed a few things up.

So, usually for new repositories, all Git objects are stored in .git/objects as files identified by their SHA1 hash. The first two characters are stripped from the filename and are used to partition the files into multiple folders, just so it gets a bit easier to navigate.

At some point, when the history gets bigger or when it is triggered by something else, Git will start to compress objects. It does this by packing multiple objects into a single pack file. How this exactly works is not really that important; it will reduce the amount of individual Git objects and efficiently store them in single, indexed archives (at this point, Git will use delta compression btw.). The pack files are then stored in .git/objects/pack and can easily get a few hundred MiB in size.

For references, the situation is somewhat similar, although a lot simpler. All current references are stored in .git/refs, e.g. branches in .git/refs/heads, tags in .git/refs/tags and remote branches in .git/refs/remotes/<remote>. As mentioned above, they are simple text files containing only the 40 character identifier of the object they are pointing at.

At some point, Git will move older references—of any type—into a single lookup file: .git/packed-refs. That file is just a long list of hashes and reference names, one entry per line. References that are kept in there are removed from the refs directory.

Reflogs

Torek mentioned those as well, reflogs are essentially just logs for references. They keep track of what happens to references. If you do anything that affects a reference (commit, checkout, reset, etc.) then a new log entry is added simply to log what happened. It also provides a way to go back after you did something wrong. A common use case for example is to access the reflog after accidentally resetting a branch to somewhere it wasn’t supposed to go. You can then use git reflog to look at the log and see where the reference was pointing at before. As loose Git objects are not immediately deleted (objects that are part of the history are never deleted), you can usually restore the previous situation easily.

Reflogs are however local: They only keep track of what happens to your local repository. They are not shared with remotes, and are never transferred. A freshly cloned repository will have a reflog with a single entry, it being the clone action. They are also limited to a certain length after which older actions are pruned, so they won’t become a storage problem.

Some final words

So, getting back to your actual question. When you clone a repository, Git will usually already receive the repository in a packed format. This is already done to save transfer time. References are very cheap, so they are never the cause of big repositories. However, because of Git’s nature, a single current commit object has a whole acyclic graph in it that eventually will reach the very first commit, the very first tree, and the very first blob. So a repository will always contain all the information for all revisions. That is what makes repositories with a long history big. Unfortunately, there is not really much you can do about it. Well, you could cut off older history at some part but that will leave you with a broken repository (you do this by cloning with the --depth parameter).

And as for your second question, as I explained above, branches are just references to commits, and references are only pointers to Git objects. So no, there is not really any metadata about branches you can get from them. The only thing that might give you an idea is the first commit you made when branching off in your history. But having branches does not automatically mean that there is actually a branch kept in the history (fast-foward merging and rebasing works against it), and just because there is some branching-off in the history that does not mean that the branch (the reference, the pointer) still exists.

like image 136
poke Avatar answered Oct 19 '22 09:10

poke


All git references (branches, tags, notes, stashes, etc) use the same system. These are:

  • the references themselves, and
  • "reflogs"

Reflogs are stored in .git/logs/refs/ based on the reference-name, with one exception: reflogs for HEAD are stored in .git/logs/HEAD rather than .git/logs/refs/HEAD.

References come either "loose" or "packed". Packed refs are in .git/packed-refs, which is a flat file of (SHA-1, refname) pairs for simple refs, plus extra information for annotated tags. "Loose" refs are in .git/refs/name. These files contain either a raw SHA-1 (probably the most common), or the literal string ref: followed by the name of another reference for symbolic refs (usually only for HEAD but you can make others). Symbolic refs are not packed (or at least, I can't seem to make that happen :-) ).

Packing tags and "idle" branch heads (those that are not being updated actively) saves space and time. You can use git pack-refs to do this. However, git gc invokes git pack-refs for you, so generally you don't need to do this yourself.

like image 26
torek Avatar answered Oct 19 '22 09:10

torek


You have:

  • packed-refs,
  • reftable. (see the last section of this answer)

Regarding pack-refs, the process of creating them should be much faster with Git 2.2+ (November 2014)

See commit 9540ce5 by Jeff King (peff):

refs: write packed_refs file using stdio

We write each line of a new packed-refs file individually using a write() syscall (and sometimes 2, if the ref is peeled). Since each line is only about 50-100 bytes long, this creates a lot of system call overhead.

We can instead open a stdio handle around our descriptor and use fprintf to write to it. The extra buffering is not a problem for us, because nobody will read our new packed-refs file until we call commit_lock_file (by which point we have flushed everything).

On a pathological repository with 8.5 million refs, this dropped the time to run git pack-refs from 20s to 6s.


Update Sept 2016: Git 2.11+ will include chained tags inpack-refs ("chained tags and git clone --single-branch --branch tag")

And the same Git 2.11 will now use fully pack bitmap.

See commit 645c432, commit 702d1b9 (10 Sep 2016) by Kirill Smelkov (navytux).
Helped-by: Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 7f109ef, 21 Sep 2016)

pack-objects: use reachability bitmap index when generating non-stdout pack

Pack bitmaps were introduced in Git 2.0 (commit 6b8fda2, Dec. 2013), from google's work for JGit.

We use the bitmap API to perform the Counting Objects phase in pack-objects, rather than a traditional walk through the object graph.

Now (2016):

Starting from 6b8fda2 (pack-objects: use bitmaps when packing objects), if a repository has bitmap index, pack-objects can nicely speedup "Counting objects" graph traversal phase.
That however was done only for case when resultant pack is sent to stdout, not written into a file.

One might want to generate on-disk packfiles for a specialized object transfer.
It would be useful to have some way of overriding this heuristic:
to tell pack-objects that even though it should generate on-disk files, it is still OK to use the reachability bitmaps to do the traversal.


Note: GIt 2.12 illlustrates that using bitmap has a side-effect on git gc --auto

See commit 1c409a7, commit bdf56de (28 Dec 2016) by David Turner (csusbdt).
(Merged by Junio C Hamano -- gitster -- in commit cf417e2, 18 Jan 2017)

The bitmap index only works for single packs, so requesting an incremental repack with bitmap indexes makes no sense.

Incremental repacks are incompatible with bitmap indexes


Git 2.14 refines pack-objects

See commit da5a1f8, commit 9df4a60 (09 May 2017) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 137a261, 29 May 2017)

pack-objects: disable pack reuse for object-selection options

If certain options like --honor-pack-keep, --local, or --incremental are used with pack-objects, then we need to feed each potential object to want_object_in_pack() to see if it should be filtered out.
But when the bitmap reuse_packfile optimization is in effect, we do not call that function at all, and in fact skip adding the objects to the to_pack list entirely.

This means we have a bug: for certain requests we will silently ignore those options and include objects in that pack that should not be there.

The problem has been present since the inception of the pack-reuse code in 6b8fda2 (pack-objects: use bitmaps when packing objects, 2013-12-21), but it was unlikely to come up in practice.
These options are generally used for on-disk packing, not transfer packs (which go to stdout), but we've never allowed pack reuse for non-stdout packs (until 645c432, we did not even use bitmaps, which the reuse optimization relies on; after that, we explicitly turned it off when not packing to stdout).


With Git 2.27 (Q2 2020), the tests around non-bitmap packs is refined.

See commit 14d2778 (26 Mar 2020) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 2205461, 22 Apr 2020)

p5310: stop timing non-bitmap pack-to-disk

Signed-off-by: Jeff King

Commit 645c432d61 ("pack-objects: use reachability bitmap index when generating non-stdout pack", 2016-09-10, Git v2.11.0-rc0 -- merge listed in batch #4) added two timing tests for packing to an on-disk file, both with and without bitmaps.

However, the non-bitmap one isn't interesting to have as part of p5310's regression suite. It could be used as a baseline to show off the improvement in the bitmap case, but:

  • the point of the t/perf suite is to find performance regressions, and it won't help with that.
    We don't compare the numbers between two tests (which the perf suite has no idea are even related), and any change in its numbers would have nothing to do with bitmaps.

it did show off the improvement in the commit message of 645c432d61, but it wasn't even necessary there.
The bitmap case already shows an improvement (because before the patch, it behaved the same as the non-bitmap case), and the perf suite is even able to show the difference between the before and after measurements.

On top of that, it's one of the most expensive tests in the suite, clocking in around 60s for linux.git on my machine (as compared to 16s for the bitmapped version). And by default when using "./run", we'd run it three times!

So let's just drop it. It's not useful and is adding minutes to perf runs.


Reftables

With Git 2.28 (Q3 2020), Preliminary clean-ups around refs API, plus file format specification documentation for the reftable backend.

See commit ee9681d, commit 10f007c, commit 84ee4ca, commit cdb73ca, commit d1eb22d (20 May 2020) by Han-Wen Nienhuys (hanwen).
See commit 35e6c47 (20 May 2020) by Jonathan Nieder (artagnon).
(Merged by Junio C Hamano -- gitster -- in commit eebb51b, 12 Jun 2020)

reftable: file format documentation

Signed-off-by: Jonathan Nieder

Shawn Pearce explains:

Some repositories contain a lot of references (e.g. android at 866k, rails at 31k). The reftable format provides:

  • Near constant time lookup for any single reference, even when the repository is cold and not in process or kernel cache.
  • Near constant time verification if a SHA-1 is referred to by at least one reference (for allow-tip-sha1-in-want).
  • Efficient lookup of an entire namespace, such as refs/tags/. - Support atomic push O(size_of_update) operations. - Combine reflog storage with ref storage.

This file format spec was originally written in July, 2017 by Shawn Pearce.

Some refinements since then were made by Shawn and by Han-Wen Nienhuys based on experiences implementing and experimenting with the format.

(All of this was in the context of our work at Google and Google is happy to contribute the result to the Git project.)

Imported from JGit's current version (c217d33ff, "Documentation/technical/reftable: improve repo layout", 2020-02-04, JGit v5.7.0.202002241735-m3) of Documentation/technical/reftable.md.

And it is adapted to SHA2:

reftable: define version 2 of the spec to accomodate SHA256

Signed-off-by: Han-Wen Nienhuys

Version appends a hash ID to the file header, making it slightly larger.

This commit also changes "SHA-1" into "object ID" in many places.


With Git 2.35 (Q1 2022), the "reftable" backend for the refs API, without integrating into the refs subsystem, has been added.

See commit d860c86, commit e793168, commit e48d427, commit acb5334, commit 1ae2b8c, commit 3b34f63, commit ffc97f1, commit 46bc0e7, commit 17df8db, commit f14bd71, commit 35425d1, commit e581fd7, commit a322920, commit e303bf2, commit 1214aa8, commit ef8a6c6, commit 8900447, commit 27f7ed2 (07 Oct 2021), and commit 27f3796 (30 Aug 2021) by Han-Wen Nienhuys (hanwen).
(Merged by Junio C Hamano -- gitster -- in commit a4bbd13, 15 Dec 2021)

reftable: a generic binary tree implementation

Signed-off-by: Han-Wen Nienhuys

The reftable format includes support for an (OID => ref) map.
This map can speed up visibility and reachability checks.
In particular, various operations along the fetch/push path within Gerrit have been sped up by using this structure.

like image 43
VonC Avatar answered Oct 19 '22 08:10

VonC