Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does git do when we do : git gc - git prune

What's going on in background when launching,

  • git gc
  • git prune

Output of git gc :

Counting objects: 945490, done. 
Delta compression using up to 4 threads.   
Compressing objects: 100% (334718/334718), done. 
Writing objects: 100%   (945490/945490), done. 
Total 945490 (delta 483105), reused 944529 (delta 482309) 
Checking connectivity: 948048, done.

Output of git prune :

Checking connectivity: 945490, done.

What is the difference between these two options?

Thank you

like image 693
L Y E S - C H I O U K H Avatar asked May 02 '18 13:05

L Y E S - C H I O U K H


2 Answers

TL;DR

git prune only removes loose, unreachable, stale objects (objects must have all three properties to get pruned). Unreachable packed objects remain in their pack files. Reachable loose objects remain reachable and loose. Objects that are unreachable, but are not yet stale, also remain untouched. The definition of stale is a little tricky (see details below).

git gc does more: it packs references, packs useful objects, expires reflog entries, prunes loose objects, prunes removed worktrees, and prunes / gc's old git rerere data.

Long

I'm not sure what you mean by "in the background" above (background has a technical meaning in shells and all of the activity here takes place in the shell's foreground but I suspect you did not mean these terms).

What git gc does is to orchestrate a whole series of collection activities, including but not limited to git prune. The list below is the set of commands run by a foreground gc without --auto (omitting their arguments, which depend to some extent on git gc arguments):

  • git pack-refs: compact references (turn .git/refs/heads/... and .git/refs/tags/... entries into entries in .git/packed-refs, eliminating the individual files)
  • git reflog expire: expire old reflog entries
  • git repack: pack loose objects into packed object format
  • git prune: remove unwanted loose objects
  • git worktree prune: remove worktree data for added worktrees that the user has deleted
  • git rerere gc: remove old rerere records

There are a few more individual file activities git gc does on its own, but the above is the main sequence. Note that git prune happens after (1) expiring reflogs and (2) running git repack: this is because an expired reflog entry that is removed may cause an object to become unreferenced, and hence not get packed and then get pruned so that it is completely gone.

Stuff to know before we look at repack and prune

Before going into any more detail, it's a good idea to define what an object is, in Git, and what it means for an object to be loose or packed. We also need to understand what it means for an object to be reachable.

Every object has a hash ID—one of those big ugly IDs you have seen in git log, for instance—that is that object's name, for retrieval purposes. Git stores all the objects in a key-value database where the name is the key, and the object itself is the value. Git's objects are therefore how Git stores files and commits, and in fact, there are four object types: A commit object holds an actual commit. A tree object holds sets of pairs,1 a human-readable name like README or subdir along with another object's hash ID. That other object is a blob object if the name in the tree is a file name, or it is another tree object if the name is that of a subdirectory. The blob objects hold the actual file contents (but note that the name of the file is in the tree linking to the blob!). The last object type is annotated tag, used for annotated tags, which are not especially interesting here.

Once made, no object can ever be changed. This is because the object's name—it hash ID—is computed by looking at every single bit of the object's content. Change any one bit from a zero to a one or vice versa and the hash ID changes: you now have a different object, with a different name. This is how Git checks that no file has ever been messed-with: if the file contents were changed, the hash ID of the object would change. The object ID is stored in the tree entry, and if the tree object were changed, the tree's ID would change. The tree's ID is stored in the commit, and if the tree ID were changed, the commit's hash would change. So if you know that the commit's hash is a234b67... and the commit's content still hashes to a234b67..., nothing changed in the commit, and the tree ID is still valid. If the tree still hashes to its own name, its content is still valid, so the blob ID is correct; so as long as the blob content hashes to its own name, the blob is correct as well.

Objects can be loose, which means they are stored as files. The name of the file is just the hash ID.2 The contents of the loose object are zlib-deflated. Or, objects can be packed, which means many objects are stored in a single pack-file. In this case the contents are not just deflated, they're first delta-compressed. Git picks out a base object—often the latest version of some blob (file)—and then finds additional objects that can be represented as a series of commands: take the base file, remove some text at this offset, add other text at another offset, and so on. The actual format of pack files is documented here, if a bit lightly. Note that unlike most version control systems, the delta-compression occurs at a level below the stored-object abstraction: Git stores whole snapshots, then does delta-compression later, on the underlying objects. Git still accesses an object by its hash-ID name; it's just that reading that object involves reading the pack file, finding the object and its underlying delta bases, and reconstructing the complete object on the fly.

There's a general rule about pack files that states that any delta-compressed object within a pack file must have all its bases in the same pack file. This means that a pack file is self-contained: there's never a need to open multiple additional pack files to get an object out of a pack that has the object. (This particular rule can be deliberately violated, producing what Git calls a thin pack, but those are intended to be used only to send objects over a network connection to another Git that already has the base objects. The other Git must "fix" or "fatten" the thin pack to make a normal pack file, before leaving it behind for the rest of Git.)

Object reachability is a little bit tricky. Let's look first at commit reachability.

Note that when we have a commit object, that commit object itself contains several hash IDs. It has one hash ID for the tree that holds the snapshot that goes with that commit. It also has one or more hash IDs for parent commits, unless this particular commit is a root commit. A root commit is defined as a commit with no parents, so this is a bit circular: a commit has parents, unless it has no parents. It's clear enough though: given some commit, we can draw that commit as a node in a graph, with arrows coming out of the node, one per parent:

<--o
   |
   v

These parent arrows point to the commit's parent or parents. Given a series of single-parent commits we get a simple linear chain:

... <--o  <--o  <--o ...

One of these commits must be the start of the chain: that's the root commit. One of these must be the end, and that's the tip commit. All of the internal arrows point backwards (leftwards) so we can draw this without the arrow-heads, knowing that the root is at the left and the tip is at the right:

o--o--o--o--o

Now we can add a branch name like master. The name simply points to the tip commit:

o--o--o--o--o   <--master

None of the arrows embedded within a commit can ever change, because nothing in any object can ever change. The arrow in the branch name master, however, is actually just the hash ID of some commit, and this can change. Let's use letters to represent the commit hashes:

A--B--C--D--E   <-- master

the name master now just stores the commit hash of commit E. If we add a new commit to master, we do this by writing out a commit whose parent is E and whose tree is our snapshot, giving us an all-new hash, which we can call F. Commit F points back to E. We have Git write F's hash ID into master and now we have:

A--B--C--D--E--F   <-- master

We added one commit and changed one name, master. All the previous commits are reachable by starting at the name master. We read out the hash ID of F and read commit F. This has the hash ID of E, so we have reached commit E. We read E to get the hash ID of D, and thus reach D. We repeat until we read A, find that it has no parent, and are done.

If there are branches, that just means that we have commits found by another name whose parents are one of the commits also found by the name master:

A--B--C--D--E--F   <-- master
             \
              G--H   <-- develop

The name develop locates commit H; H finds G; and G refers back to E. So all of these commits are reachable.

Commits with more than one parent—i.e., merge commits—make all their parents reachable if the commit itself is reachable. So once you make a merge commit, you can (but do not have to) delete the branch name that identifies the commit that was merged-in: it's now reachable from the tip of the branch that you were on when you did the merge operation. That is:

...--o--o---o   <-- name
      \    /
       o--o   <-- delete-able

the commits on the bottom row here are reachable from name, through the merge, just as the commits on the top row were always reachable from name. Deleting the name delete-able leaves them still reachable. If the merge commit is not there, as in this case:

...--o--o   <-- name2
      \
       o--o   <-- not-delete-able

then deleting not-delete-able effectively abandons the two commits along the bottom row: they become unreachable, and hence eligible for garbage-collection.

This same reachability property applies to tree and blob objects. Commit G has a tree in it, for instance, and this tree has <name, ID> pairs:

A--B--C--D--E--F   <-- master
             \
              G--H   <-- develop
              |
         tree=d097...
            /   \
 README=9fa3... Makefile=0b41...

So from commit G, tree object d097... is reachable; from that tree, blob object 9fa3... is reachable, and so is blob object 0b41.... Commit H might have the very same README object, under the same name (though a different tree): that's fine, that just makes 9fa3 doubly reachable, which is not interesting to Git: Git only cares that it is reachable at all.

External references—branch and tag names, and other references found in Git repositories (including entries in Git's index and any references via linked added work-trees), provide the entry points into the object graph. From these entry points, any object is either reachable—has one or more names that can lead to it—or unreachable, meaning there are no names by which the object itself can be found. I've omitted annotated tags from this description, but they are generally found via tag names, and an annotated tag object has one object reference (of arbitrary object type) that it finds, making that one object reachable if the tag object itself is reachable.

Because references only refer to one object, but sometimes we do something with a branch name that we want to undo afterward, Git keeps a log of each value a reference had, and when. These reference logs or reflogs let us know what master had in it yesterday, or what was in develop last week. Eventually these reflog entries are old and stale and unlikely to be useful any more, and git reflog expire will discard them.

Repack and prune

What git repack does, at a high level, should now be reasonably clear: it turns a collection of many loose objects into a pack file full of all those objects. It can do more, though: it can include all objects from a previous pack. The previous pack becomes superfluous and can be removed afterward. It can also omit any unreachable objects from the pack, turning them instead into loose objects. When git gc runs git repack it does so with options that depend on the git gc options, so the exact semantics vary here, but the default for a foreground git gc is to use git repack -d -l, which has git repack delete redundant packs and run git prune-packed. The prune-packed program removes loose object files that also appear in pack files, so this removes the loose objects that went into the pack. The repack program passes the -l option on to git pack-objects (which is the actual workhorse that builds the pack file) where it means to omit objects that are borrowed from other repositories. (This last option is not important for most normal Git usage.)

In any case, it's git repack—or technically, git pack-objects—that prints the counting, compressing, and writing messages. When it is done you have a new pack file and the old pack file(s) are gone. The new pack file holds all the reachable objects, including the old reachable packed objects and the old reachable loose objects. If loose objects were ejected from one of the old (now torn-down and removed) pack files, they join the other loose (and unreachable) objects cluttering your repository. If they were destroyed during the tear-down, only the existing loose-and-unreachable objects remain.

It's now time for git prune: this finds loose, unreachable objects and removes them. However, it has a safety switch, --expire 2.weeks.ago: by default, as run by git gc, it does not remove such objects if they are not at least two weeks old. This means that any Git program that is in the process of creating new objects, that has not yet hooked them up, has a grace period. The new objects can be loose and unreachable for (by default) fourteen days before git prune will delete them. So a Git program that is busy creating objects has fourteen days during which it can complete the hooking-up of those objects into the graph. If it decides those objects are not worth hooking-up, it can just leave them; 14 days from that point, a future git prune will remove them.

If you run git prune manually, you must choose your --expire argument. The default without --expire is not 2.weeks.ago but instead just now.


1Tree objects actually hold triples: name, mode, hash. The mode is 100644 or 100755 for a blob object, 004000 for a sub-tree, 120000 for a symbolic link, and so on.

2For lookup speed on Linux, the hash is split after the first two characters: the hash name ab34ef56... becomes ab/34e567... in the .git/objects directory. This keeps the size of each subdirectory within .git/objects small-ish, which tames O(n2) behavior of some directory operations. This ties in with git gc --auto which repacks automatically when one object directory becomes sufficiently large. Git assumes that each subdirectory is about the same size as the hashes should mostly be uniformly distributed, so it only needs to count one subdirectory.

like image 85
torek Avatar answered Sep 28 '22 12:09

torek


Since the recent addition of the git maintenance command (Git 2.29 (Q4 2020)), the replacement for git gc -prune would be:

git maintenance pack-refs
# for
git pack-refs --all --prune

With Git 2.31 (Q1 2021), "git maintenance"(man) tool learned a new pack-refs maintenance task.

See commit acc1c4d, commit 41abfe1 (09 Feb 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit d494433, 17 Feb 2021)

maintenance: add pack-refs task

Signed-off-by: Derrick Stolee
Reviewed-by: Taylor Blau

It is valuable to collect loose refs into a more compressed form.
This is typically the packed-refs file, although this could be the reftable in the future.
Having packed refs can be extremely valuable in repos with many tags or remote branches that are not modified by the local user, but still are necessary for other queries.

For instance, with many exploded refs, commands such as

git describe --tags --exact-match HEAD

can be very slow (multiple seconds).
This command in particular is used by terminal prompts to show when a detatched HEAD is pointing to an existing tag, so having it be slow causes significant delays for users.

Add a new 'pack-refs' maintenance task.
It runs 'git pack-refs --all --prune'(man) to move loose refs into a packed form.
For now, that is the packed-refs file, but could adjust to other file formats in the future.

This is the first of several sub-tasks of the 'gc' task that could be extracted to their own tasks.
In this process, we should not change the behavior of the 'gc' task since that remains the default way to keep repositories maintained.
Creating a new task for one of these sub-tasks only provides more customization options for those choosing to not use the 'gc' task.
It is certainly possible to have both the 'gc' and 'pack-refs' tasks enabled and run regularly.
While they may repeat effort, they do not conflict in a destructive way.

The 'auto_condition' function pointer is left NULL for now.
We could extend this in the future to have a condition check if pack-refs should be run during 'git maintenance run --auto'(man).

git maintenance now includes in its man page:

pack-refs

The pack-refs task collects the loose reference files and collects them into a single file. This speeds up operations that need to iterate across many references.

And it can run on a schedule, as part of its new pack-refs task:

maintenance: incremental strategy runs pack-refs weekly

Signed-off-by: Derrick Stolee
Reviewed-by: Taylor Blau

When the 'maintenance.strategy' config option is set to 'incremental', a default maintenance schedule is enabled.
Add the 'pack-refs' task to that strategy at the weekly cadence.

git config now includes in its man page:

task, but runs the prefetch and commit-graph tasks hourly, the loose-objects and incremental-repack tasks daily, and the pack-refs task weekly.


The "git maintenance register"(man) command had trouble registering bare repositories, which had been corrected with Git 2.31 (Q1 2021).

See commit 26c7974 (23 Feb 2021) by Eric Sunshine (sunshineco).
(Merged by Junio C Hamano -- gitster -- in commit d166e8c, 25 Feb 2021)

maintenance: fix incorrect maintenance.repo path with bare repository

Reported-by: Clement Moyroud
Signed-off-by: Eric Sunshine

The periodic maintenance tasks configured by git maintenance start(man) invoke git for-each-repo(man) to run git maintenance run(man) on each path specified by the multi-value global configuration variable maintenance.repo.
Because git for-each-repo will likely be run outside of the repositories which require periodic maintenance, it is mandatory that the repository paths specified by maintenance.repo are absolute.

Unfortunately, however, git maintenance register(man) does nothing to ensure that the paths it assigns to maintenance.repo are indeed absolute, and may in fact -- especially in the case of a bare repository -- assign a relative path to maintenance.repo instead.
Fix this problem by converting all paths to absolute before assigning them to maintenance.repo.

While at it, also fix git maintenance unregister(man) to convert paths to absolute, as well, in order to ensure that it can correctly remove from maintenance.repo a path assigned via git maintenance register.


With Git 2.30 (Q4 2020), "git maintenance"(man), an extended big brother of "git gc"(man), continues to evolve with a new command in place of git gc and git prune:

See commit e841a79, commit a13e3d0, commit 52fe41f, commit efdd2f0, commit 18e449f, commit 3e220e6, commit 252cfb7, commit 28cb5e6 (25 Sep 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 52b8c8c, 27 Oct 2020)

maintenance: add loose-objects task

Signed-off-by: Derrick Stolee

One goal of background maintenance jobs is to allow a user to disable auto-gc (gc.auto=0) but keep their repository in a clean state.
Without any cleanup, loose objects will clutter the object database and slow operations.
In addition, the loose objects will take up extra space because they are not stored with deltas against similar objects.

Create a 'loose-objects' task for the 'git maintenance run'(man) command.
This helps clean up loose objects without disrupting concurrent Git commands using the following sequence of events:

  1. Run 'git prune-packed'(man) to delete any loose objects that exist in a pack-file. Concurrent commands will prefer the packed version of the object to the loose version. (Of course, there are exceptions for commands that specifically care about the location of an object. These are rare for a user to run on purpose, and we hope a user that has selected background maintenance will not be trying to do foreground maintenance.)

  2. Run 'git pack-objects'(man) on a batch of loose objects.
    These objects are grouped by scanning the loose object directories in lexicographic order until listing all loose objects -or- reaching 50,000 objects. This is more than enough if the loose objects are created only by a user doing normal development. We noticed users with millions of loose objects because VFS for Git downloads blobs on-demand when a file read operation requires populating a virtual file.

This step is based on a similar step in Scalar and VFS for Git.

git maintenance now includes in its man page:

loose-objects

The loose-objects job cleans up loose objects and places them into pack-files.

In order to prevent race conditions with concurrent Git commands, it follows a two-step process.

  • First, it deletes any loose objects that already exist in a pack-file; concurrent Git processes will examine the pack-file for the object data instead of the loose object.
  • Second, it creates a new pack-file (starting with "loose-") containing a batch of loose objects.

The batch size is limited to 50 thousand objects to prevent the job from taking too long on a repository with many loose objects.
The gc task writes unreachable objects as loose objects to be cleaned up by a later step only if they are not re-added to a pack-file; for this reason it is not advisable to enable both the loose-objects and gc tasks at the same time.

like image 34
VonC Avatar answered Sep 28 '22 11:09

VonC