Q: When git pushes refs that have no common history over the Smart Protocol, can it consider root or sub-trees already in-common between local and origin
when building the thin-pack to send?
tl;dr
Consider this (uncommon) situation when working-with and pushing to a remote Git repository.
master
points to a tree with 1110 descendant sub-trees a[0-9]/b[0-9]/c[0-9]
.origin/master
is current with the local master
commit i.e. identical histories. It uses ssh
protocol.squashed
. I set that branch to a new, single root-commit, but with the same content/tree as master
. This can be done with git commit-tree
. So this branch has a single commit with no commits in-common with master
, but the root tree-hash is identical, it points to the same tree object in master
and origin/master
. It is not important that this is a single/squashed commit in order to discuss this - any history rewritten back to the root commit, with no common history will do.git push origin HEAD # push squashed
From observations of the performance of this with a large repository, and the number of objects sent, I suspect that push
, send-pack
and receive-pack
and associated thin-pack negotiation over the Smart Protocol does something like:
squashed
has no common-history with any commit origin
currently has.squashed
points to a tree that is not only in origin
, but is the tree for a current HEAD
ref.In this case the trees are identical. If a subsequent change is made in squashed
... either an additional commit, or a new squash that changes a file in a0
, 2 trees (/
and a0
) would have changed, and the other 1109 would be unchanged. The root tree has changed, which means a next-level search would be required to see whether it is worth searching for further common sub-trees. This might require a heuristic, as without comparing all sub-trees down-to the leaves, it is not possible to infer the number of descendant trees in-common from the trees at any particular depth.
Of course if there are multiple commits in the nothing-in-common history being pushed, this negotiation would need to be repeated for each commit.
Does it sound reasonable that the Smart API could consider already-held common sub-trees, or at the very least, the root-tree, as it considers each commit? Or should Git already be doing this and there is something wrong with my client or server?
git version 2.8.2
Checking git's source and trying it with git daemon and GIT_TRACE_PACKET says you're correct about what it's doing: git negotiates at the commit level only. If the history isn't shared, git won't detect the shared content.
Does it sound reasonable that the Smart API could consider already-held common sub-trees, or at the very least, the root-tree, as it considers each commit?
If the already-held common subtrees can't be identified by already-held common commits, then to identify those subtrees it'd have to send their ids.
The thing is, for anything short of a complete readout, I can construct a plausible-sounding corner case that sends an arbitrarily-large amount of redundant data -- but sending every existing subtree id every time to avoid that possibility is clearly a huge loss. Don't forget that round-trip latency is horrendously expensive. So, at what point do you become likely to be spending more time negotiating when considering added overhead across all fetches, in the aggregate? If you're going to argue that some particular alternate method would save time overall, you're going to have to show up with hard data on actual production traffic.
Also remember that you can construct packs yourself. It's not hard, you feed object id's to git pack-objects pack
and drop the output into .git/objects/pack
, congratulations, you've just fetched exactly those objects into that repo.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With