Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git partial clone garbage collection

Question: How can I remove / prune / gc unreferenced blobs in a partially cloned git repository?

Details: I am evaluating whether Git Partial Clone can become a replacement for Git LFS, now where both GitLab and GitHub seem to have implemented full support for it (I couldn't find a beta label). As large binaries / blobs are only fetched on checkout with --filter=blob:none, the git repository on disk and the fetch performance seem both reasonably fast. That being said, I am struggling how to clean / prune the blobs similar to git LFS prune. When working with a reasonably sized repository over time, you still end up with all blobs that have accumulated over time that are not referenced by HEAD (anymore).

What I've tried: I was hoping that git gc --prune=today --aggressive would be able to understand the active filter (or that I can pass a filter) to convert all blobs / trees / commits, that are not referenced by the current checkout, to promisor objects. Unfortunately, I could not find a way to reduce the size of the partial cloned repository after using it for quite a while

edit: I am using git version 2.36.0

like image 270
Sc4v Avatar asked May 21 '26 22:05

Sc4v


1 Answers

I was hoping that git gc --prune=today --aggressive would be able to understand the active filter (or that I can pass a filter) to convert all blobs / trees / commits, that are not referenced by the current checkout, to promisor objects.

That should now work, with Git 2.48 (Q1 2025), batch 10:

"git gc"(man) discards any objects that are outside promisor packs that are referred to by an object in a promisor pack, and we do not refetch them from the promisor at runtime, resulting in an unusable repository.
Work it around by including these objects in the referring promisor pack at the receiving end of the fetch.

See commit c08589e, commit d9e24ce, commit 78995ff, commit da80429 (01 Nov 2024) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit 0c11ef1, 20 Nov 2024)

index-pack: repack local links into promisor packs

Signed-off-by: Jonathan Tan

Teach index-pack to, when processing the objects in a pack with --promisor specified on the CLI, repack local objects (and the local objects that they refer to, recursively) referenced by these objects into promisor packs.

This prevents the situation in which, when fetching from a promisor remote, we end up with promisor objects (newly fetched) referring to non-promisor objects (locally created prior to the fetch).
This situation may arise if the client had previously pushed objects to the remote, for example.
One issue that arises in this situation is that, if the non-promisor objects become inaccessible except through promisor objects (for example, if the branch pointing to them has moved to point to the promisor object that refers to them), then GC will garbage collect them.
There are other ways to solve this, but the simplest seems to be to enforce the invariant that we don't have promisor objects referring to non-promisor objects.

This repacking is done from index-pack to minimize the performance impact.
During a fetch, the only time most objects are fully inflated in memory is when their object ID is computed, so we also scan the objects (to see which objects they refer to) during this time.

Also to minimize the performance impact, an object is calculated to be local if it's a loose object or present in a non-promisor pack.
(If it's also in a promisor pack or referred to by an object in a promisor pack, it is technically already a promisor object.
But a misidentification of a promisor object as a non-promisor object is relatively benign here - we will thus repack that promisor object into a promisor pack, duplicating it in the object store, but there is no correctness issue, just an issue of inefficiency.)

git index-pack now includes in its man page:

Also, if there are objects in the given pack that references non-promisor objects (in the repo), repacks those non-promisor objects into a promisor pack. This avoids a situation in which a repo has non-promisor objects that are accessible through promisor objects.


The test: "After fetching descendants of non-promisor commits, gc works"

# Setup
git init full &&
git -C full config uploadpack.allowfilter 1 &&
git -C full config uploadpack.allowanysha1inwant 1 &&
touch full/foo &&
git -C full add foo &&
git -C full commit -m "commit 1" &&
git -C full checkout --detach &&

# Partial clone and push commit to remote
git clone "file://$(pwd)/full" --filter=blob:none partial &&
echo "hello" > partial/foo &&
git -C partial commit -a -m "commit 2" &&
git -C partial push &&

# gc in partial repo
git -C partial gc --prune=now &&

# Create another commit in normal repo
git -C full checkout main &&
echo " world" >> full/foo &&
git -C full commit -a -m "commit 3" &&

# Pull from remote in partial repo, and run gc again
git -C partial pull &&
git -C partial gc --prune=now

Git 2.48 (Q1 2025), rc0 fixes performance regression of a recent "fatten promisor pack with local objects" protection against an unwanted gc.

See commit 1a14c85, commit 3619802, commit 911d142 (03 Dec 2024) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit ededd0d, 15 Dec 2024)

index-pack --promisor: also check commits' trees

Signed-off-by: Jonathan Tan

Commit c08589e ("index-pack: repack local links into promisor packs", 2024-11-01, Git v2.48.0-rc0 -- merge listed in batch #10) seems to contain an oversight in that the tree of a commit is not checked.
Teach git to check these trees.

The fix slows down a fetch from a certain repo at $DAYJOB from 2m2.127s to 2m45.052s, but in order to make the fetch correct, it seems worth it.

In order to test this, we could create server and client repos as follows...

 C   S
  \ /
   O

(O and C are commits both on the client and server.
S is a commit only on the server.
C and S have the same tree but different commit messages.
The diff between O and C is non-zero.)

...and then, from the client, fetch S from the server.

In theory, the client declares "have C" and the server can use this information to exclude S's tree (since it knows that the client has C's tree, which is the same as S's tree).
However, it is also possible for the server to compute that it needs to send S and not O, and proceed from there; therefore the objects of C are not considered at all when determining what to send in the packfile.
In order to prevent a test of client functionality from having such a dependence on server behavior, I have not included such a test.

like image 198
VonC Avatar answered May 23 '26 11:05

VonC