Is the storage of git tags inefficient?

Question

I was wondering if the storage of git tags is inefficient.

I thought that tags are just "pointers" pointing to a changeset, which should be very efficient and small in terms of disk usage.

However, with my fresh git repository:

pushing all branches (8) and all changesets (20489) takes together ~ 110MB (shown in Gitlab)
pushing all tags (1444) which does not add any additional changesets (still 20489) suddenly takes together ~ 150MB

This is weird. I didn't expected such a huge increase just because of "pointers".

Does anyone has any clue or possible explanation?

Thanks

torek · Accepted Answer

TL;DR

Tags usually are pretty efficient. As ElpieKay concluded in a comment, you must have some objects—probably commits, but any objects will do—that are reachable from tags, but not from branches.

Long

Tags—whether lightweight or annotated; we'll distinguish these in a moment—point to arbitrary Git objects, not changesets. When we say point to here, what we really mean is contain the hash ID of: all Git objects have, as their "true name", a hash ID that serves as the key in the key-value store of all Git objects.

There are four types of objects in this main Git database. They are commits, trees, blobs, and annotated tag objects. Commits acts as snapshots, but themselves contain only a small amount of metadata, including the name and email address of the committer along with a time-stamp; the hash ID(s) of the commit's parent commit(s), the log message, and the hash ID of a stored tree object. The tree object is what ultimately provides the snapshot, via sub-trees and blob objects.

Names—which Git calls references or refs—fall into various name-spaces. The two big ones are branch names like master, which are actually in the refs/heads/* name-space (refs/heads/master), and tag names like v1.2, which are actually in the refs/tags/* name-space (refs/tags/v1.2). The name spaces keep the names from colliding even if they are spelled identically. Each name contains one hash ID, and the name-to-hash-ID key-value store is the other principle database that makes up a Git repository.

Branch names are constrained to point only to commit objects. Tag names can point directly to a commit object. Such a tag is called a lightweight tag. Or, the tag name might point to an annotated tag object. That object itself points to some other (arbitrary) object, though it's pretty typical for the tag name to point to a commit. A tag name that points to an annotated tag object is an annotated tag.

The object database forms a Directed Acyclic Graph

Commits contain, i.e., point to, other commit hash IDs. No object can be changed once it is made, and the hash ID of any object cannot be predicted.¹ So a new commit only ever points back to existing commits. Each commit is also given a unique hash ID (i.e., no commit ever occurs more than once). This means that the commit graph itself, generally grown one commit at a time, never has any cycles: all the commit arrows "point backwards" to previous commits.

Commits also contain tree hash IDs, and trees contain further tree hash IDs along with blob hash IDs. These, too, are directed and acyclic, though tree hash IDs need not be unique (two different commits can share the same snapshot, for instance).

Annotated tag objects can contain the ID of any other object, but like commits, annotated tag objects have unique hash IDs and are only allowed to point to existing objects. So these likewise do not add cycles to the graph.

¹The hash ID is a cryptographic checksum of the contents of the object, including the object's type. Technically, it could be predicted, or deliberate hash collisions could be produced, if you spent enough compute power on the problem. However, Git forbids cycles in other ways as well.

The name database acts as entry points into the DAG, which allows for garbage collection

The result is that if we pick any object within the repository, we can trace, from that object, to all reachable other objects and get a sub-graph. If we use the name database (branch and tag names and all other Git references—there are some that are particularly sneaky, such as blob hash IDs stored in the index) as our entry points into the object database, and color all reachable objects green temporarily, we can then have Git walk the entire object database and discard any objects that are not reachable (and then remove the coloring, which in Git is actually held in memory, not on disk).

The reachable set of objects, however, depends on the names we use! If we omit all the tag names, we may have some objects—typically some chains of commits—that are not reachable otherwise.

Fetching and pushing copies only reachable objects

As a general rule, git fetch and git push—and the initial fetch run by git clone—copies only those object that are reachable from the names that are being used. The two Git instances involved in the transfer have an initial conversation in which each Git tells the other which hash IDs it has and/or wants after perusing some set of name/ID pairs.² The sending and receiving Git instances walk through the object DAG as needed to figure out which objects are required to make these name/ID pairs complete. The sender then sends the objects;³ the receiving Git adds those objects to its object database, and the transfer is done.

What this means in your case is that some objects are reachable only from tags and this makes the push significantly larger. Finding those objects can be a bit tricky—Git has low-level tools for this (git rev-parse and git branch --contains, for instance) but nothing cleanly packaged as a user-oriented solution.

²The new wire protocol (v2—the old one is v0 which is the same as v1)—alters the way the name/ID pairs are listed, as it turns out that in some repositories, the name database has grown to the point where simply listing everything every time, as v0 does, takes too long.

³The sending Git typically uses its knowledge of what's in the receiving Git's object database, as determined by the hash IDs the receiver must have, to build a thin pack in which the sender's objects are delta-compressed against the objects the receiver already has. See the compression aside below.

An aside on compression

Both of these key-value databases are stored in Git in multiple different ways. Objects within the object database can be stored either loose, where they are zlib-deflated but stand-alone, or packed, where they are delta-compressed against other objects. Delta chains act like changesets, but there is a critical difference—well, critical for implementors; users don't have to care about it at all!—here: any object can, at least in theory, be compressed against any other object, even an object of a different type. (In practice Git only compresses objects against same-type objects anyway.) Even with blobs compressed against blobs, there's no requirement that some file be a delta against a previous version of the same file: it could be a delta from a future version of the same file, or the current version of a different file, or whatever.

Pack files are normally self-contained: objects that are delta-compressed inside a pack file must provide the next object in the chain inside that same pack file, all the way down to a base object that is not itself delta-compressed. The thin packs that git fetch and git push build deliberately violate this assumption; the receiver of a thin pack is obligated to "fix" it (git index-pack --fix-thin) or otherwise correct the issue. But all of this is also an internal-only detail.

Ben · Answer

You have commits that are only reachable by tags. If you look at your history using git log --decorate --oneline --graph --all you can see them.

Look for lines of the history that "end" in a tag:

* 4d60a50b0 (HEAD -> master, origin/master, origin/HEAD) Latest commit
* 123d19df2 More Stuff
* 158f2091b Removed bogus quote.
| * 413d140f4 (tag: 6.4.1_76119) line endings
| * c3fa7ee03 getting the branch to autobuild and make installer
| |  * bda836a25 (tag: 7.0.0) more credits changes 
| |  * 3cab6e792 for autobuilds, launch seed7.0.0 so it gets the branch
| |_/  
|/|   
* | 11b2165f5 formatting
* | 4af66cc59 changed version numbers to 7.0.0

Is the storage of git tags inefficient?

Tags:

git

Crazyjavahacking

2 Answers

TL;DR

Long

The object database forms a Directed Acyclic Graph

The name database acts as entry points into the DAG, which allows for garbage collection

Fetching and pushing copies only reachable objects

An aside on compression

torek

Ben

Recent Activity

Donate For Us

Is the storage of git tags inefficient?

Tags:

git

Crazyjavahacking

2 Answers

TL;DR

Long

The object database forms a Directed Acyclic Graph

The name database acts as entry points into the DAG, which allows for garbage collection

Fetching and pushing copies only reachable objects

An aside on compression

torek

Ben

Related questions

Recent Activity

Donate For Us