I have a set of commit SHA1s, in no particular order. I would like to pipe this set to a command, and have those commits returned in topological order.
Here's one way of doing this:
git rev-list --all --topo-order | grep --file SET_OF_SHA1S
As you can imagine, this is a very slow way of doing it, as git rev-list
is having to print out all of the commit SHA1s, not just the ones in my set.
Is there a better and faster way to do this?
My test framework tests certain Git commits and stores the result in a database. I'm writing a web page that summarises these results, and it would be nice to display the results in order. Sorting by commit date is not ideal as some rebased commits will have exactly the same commit date.
You can use As Mort pointed out in a comment, this does not work. The git documentation acquired new text noting (indirectly) this problem, as of git version 2.4. (I consider this a bug in --no-walk
to prevent git from dumping any SHA-1s other than the ones you supply, and use --topo-order
to force the correct order.git rev-list
, which should load enough of the commit graph to do the topological sort, then output just the user-specified revision IDs, in the correct order.)
My original script (left in here) therefore also does not work. It can be made to work by removing the --no-walk
from the step that generates temporary file $TF2
, then using the contents of $TF1
to extract and print the "interesting" revisions from their $TF2
(sorted) order.
This is more or less what Flimm's own answer does.
[original answer, with flawed script, below]
I'm not sure exactly what I was doing with this code, but long ago, I wrote a script to check whether arguments supplied were in topo order:
#! /bin/sh
#
# check a list of IDs to see if they're in "topo order"
usage()
{
echo "usage: $0 id [...]"
}
case $# in
0) usage 1>&2; exit 1;;
esac
TF1=$(mktemp)
TF2=$(mktemp)
trap "rm -f $TF1 $TF2; exit" 0 1 2 3 15
# parse the arguments into one file
git rev-parse $@ > $TF1 || exit 1
# and topo-sort the arguments into another
git rev-list --topo-order --no-walk --reverse $@ > $TF2 || exit 1
# If the list is in the correct order the files will be the same
cmp -s $TF1 $TF2 || {
# If the files differ, it's possible that some argument(s) name
# the same rev...
[ $(wc -l < $TF1) -eq $(wc -l < $TF2) ] || {
echo "ERROR: there are repeats in $@"
# finding them is a pain, we don't bother trying
exit 1
}
echo "ERROR: $@ NOT in topo order"
echo "use instead:"
# read the topo-ordered raw IDs
while read sha1; do
# and find the (single) arg in $@ that names this one
for i; do
if [ $(git rev-parse $i) = $sha1 ]; then
printf ' %s' $i
break
fi
done
done < $TF2
echo
exit 1
}
echo "$@ in topo order"
exit 0
I think what I wanted here was to emit the same argument names, e.g., if you said git-check-topo v1.7 1234567 branchX
it would tell you to use (literally) branchX v1.7 1234567
, if that is what got you the right order, rather than just showing the raw SHA-1s.
For your purposes, a simple:
git rev-list --topo-order --no-walk $@
(with or without --reverse
as desired) should work, I think.
Here's one way of speeding it up:
git rev-list --topo-order $(cat SET_OF_SHA1S) \
| grep --file SET_OF_SHA1S --max-count $(wc -l SET_OF_SHA1S)
Optimisations:
rev-list
to list all commits reachable from your set of SHA1s.rev-list
prints enough commits that include the set of SHA1s you're interested in, tell grep
to stop grepping using the --max-count
parameter. grep
will in turn close its input, and rev-list
will stop needlessly printing out further SHA1s.Another way to speed up git rev-list --topo-order
is to use Git 2.20 (Q4 2018)
See commit 561b583, commit b454241, commit 5284fc5, commit f0d9cc4, commit d6b4071, commit 4b47a9a, commit aca4240 (01 Nov 2018) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 62ca33e, 18 Nov 2018)
revision.c
: generation-based topo-order algorithm
The current
--topo-order
algorithm requires walking all reachable commits up front, topo-sorting them, all before outputting the first value.This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go.
This can dramatically reduce the computation time to write a fixed number of commits, such as when limiting with "-n <N>
" or filling the first page of a pager.In my local testing, I used the following Git commands on the Linux repository in three modes:
HEAD~1
with no commit-graph,HEAD~1
with a commit-graph, and- HEAD with a commit-graph.
This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk.
Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s
See all the details in commit b454241.
Note: as mentioned in commit d6b4071:
The rev-list command is critical to Git's functionality.
Here are a few important types of rev-list operations:
- Basic:
git rev-list --topo-order HEAD
- Range:
git rev-list --topo-order compare..HEAD
- Ancestry:
git rev-list --topo-order --ancestry-path compare..HEAD
- Symmetric Difference:
git rev-list --topo-order compare...HEAD
Do use Git 2.23 (Q3 2019) though, as a topo-orider from Git 2.20 introduced a regression when used on range.
See commit 1d8e31a, commit 1b4d882 (21 May 2019) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit bdc81d1, 17 Jun 2019)
revision
: keeptopo-walk
free of unintersting commits
When updating the topo-order walk in b454241 (
revision.c
: generation-based topo-order algorithm, 2018-11-01, v2.20.0-rc0), the logic was a huge rewrite of the walk logic.
In that massive change, we accidentally included the UNINTERESTING commits inexpand_topo_walk()
.
This means that a simple query likegit rev-list --topo-order HEAD~1..HEAD
will expand the topo walk for all commits reachable from HEAD, and not just one commit.
revision
: use generation forA..B --topo-order
queries
If a commit-graph exists with computed generation numbers, then a '
git rev-list --topo-order -n <N> <rev>
' query will use those generation numbers to reduce the number of commits walked before writingN
commits.One caveat put in b454241 (revision.c: generation-based topo-order algorithm, 2018-11-01) was to not enable the new algorithm for queries with a revision range "
A..B
".
The logic was placed to walk from "A" and mark those commits as uninteresting, but the performance was actually worse than the existing logic in some cases.The root cause of this performance degradation is that generation numbers increase the number of commits we walk relative to the existing heuristic of walking by commit date.
While generation numbers actually guarantee that the algorithm is correct, the existing logic is very rarely wrong and that added requirement is not worth the cost.
With Git 2.28 (Q3 2020), "git log -L...
" now takes advantage of the "which paths are touched by this commit?" info stored in the commit-graph system.
See commit 002933f, commit 3cb9d2b, commit 48da94b, commit d554672 (11 May 2020) by SZEDER Gábor (szeder
).
See commit f32dde8 (11 May 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit c3a0282, 09 Jun 2020)
line-log
: try to use generation number-based topo-orderingSigned-off-by: SZEDER Gábor
Signed-off-by: Derrick Stolee
The previous patch made it possible to perform line-level filtering during history traversal instead of in an expensive preprocessing step, but it still requires some simpler preprocessing steps, notably topo-ordering.
However, nowadays we have commit-graphs storing generation numbers, which make it possible to incrementally traverse the history in topological order, without the preparatory
limit_list()
andsort_in_topological_order()
steps; see b45424181e ("[
revision.c](https
://github.com/git/git/blob/002933f3fe2b016022ebbbbb359f6aeba58309a4/revision.c): generation-based topo-order algorithm", 2018-11-01, Git v2.20.0-rc0 -- merge).This patch combines the two, so we can do both the topo-ordering and the line-level filtering during history traversal, eliminating even those simpler preprocessing steps, and thus further reducing the delay before showing the first commit modifying the given line range.
The '
revs->limited
' flag plays the central role in this, because, due to limitations of the current implementation, the generation number-based topo-ordering is only enabled when this flag remains unset.Line-level log, however, always sets this flag in
setup_revisions()
ever since the feature was introduced in 12da1d1f6f ("Implement line-history search (git log -L
)", 2013-03-28, Git v1.8.4-rc0 -- merge).The reason for setting '
limited
' is unclear, though, because the line-level log itself doesn't directly depend on it, and it doesn't affect how thelimit_list()
function limits the revision range.However, there is an indirect dependency: the line-level log requires topo-ordering, and the "traditional"
sort_in_topological_order()
requires an already limited commit list since e6c3505b44 ("Make sure we generate the whole commit list before trying to sort it topologically", 2005-07-06, Git v0.99 -- merge listed in batch #0).The new, generation numbers-based topo-ordering doesn't require a limited commit list anymore.
So don't set '
revs->limited
' for line-level log, unless it is really necessary, namely:
- The user explicitly requested parent rewriting, because that is still done in the
line_log_filter()
preprocessing step (see previous patch), which requiressort_in_topological_order()
and inturn limit_list()
as well.- A commit-graph file is not available or it doesn't yet contain generation numbers.
In these cases we had to fall back onsort_in_topological_order()
and in turnlimit_list()
.
The existing condition withgeneration_numbers_enabled()
has already ensured that the 'limited
' flag is set in these cases; this patch just makes sure that the line-level log sets 'revs->topo_order
' before that condition.While the reduced delay before showing the first commit is measurable in git.git, it takes a bigger repository to make it clearly noticable.
In both cases below the line ranges were chosen so that they were modified rather close to the starting revisions, so the effect of this change is most noticable.
# git.git $ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0 Before: real 0m0.107s user 0m0.091s sys 0m0.013s After: real 0m0.058s user 0m0.050s sys 0m0.005s # linux.git $ time git --no-pager log \ -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2 Before: real 0m1.129s user 0m1.061s sys 0m0.069s After: real 0m0.096s user 0m0.087s sys 0m0.009s
Additional testing by Derrick Stolee: Since this patch improves the performance for the first result, I repeated the experiment from the previous patch on the Linux kernel repository, reporting real time here:
Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null Before: 0.71 s After: 0.05 s
Now, we have dropped the full topo-order of all ~910,000 commits before reporting the first result. The remaining performance improvements then are:
Update the parent-rewriting logic to be incremental similar to how "
git log --graph
" behaves.Use
changed-path
Bloom filters to reduce the time spend in the tree-diff to see if the path(s) changed.
With Git 2.31 (Q1 2021), the topological walk codepath is covered by new trace2 stats.
See commit 90b666d (30 Dec 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 02feca7, 15 Jan 2021)
revision
: trace topo-walk statisticsSigned-off-by: Derrick Stolee
We trace statistics about the effectiveness of changed-path Bloom filters since 42e50e7 ("
revision.c
: add trace2 stats around Bloom filter usage", 2020-04-06, Git v2.27.0-rc0 -- merge listed in batch #6).
Add similar tracing for the topo-walk algorithm that uses generation numbers to limit the walk size.This information can help investigate and describe benefits to heuristics and other changes.
The information that is printed is in JSON format and can be formatted nicely to present as follows:
{ unt_explort_walked":2603, unt_indegree_walked":2603, unt_topo_walked":473 }
Each of these values count the number of commits are visited by each of the three "stages" of the topo-walk as detailed in b454241 ("
revision.c
: generation-based topo-order algorithm", 2018-11-01, Git v2.20.0-rc0 -- merge).
With Git 2.32 (Q2 2021), a new configuration variable has been introduced to allow choosing which version of the generation number gets used in the commit-graph file.
See commit 702110a, commit c7ef8fe (25 Feb 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit d20fa3c, 22 Mar 2021)
commit-graph
: use config to specify generation typeSigned-off-by: Derrick Stolee
We have two established generation number versions: 1: topological levels 2: corrected commit dates
The corrected commit dates are enabled by default, but they also write extra data in the
GDAT
andGDOV
chunks.
Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries.Add a new "
commitGraph.generationVersion
" config option that specifies the intended generation number version.
If this value is less than 2, then theGDAT
chunk is never written or read from an existing file.
git config
now includes in its man page:
commitGraph.generationVersion
Specifies the type of generation number version to use when writing or reading the commit-graph file.
If version 1 is specified, then the corrected commit dates will not be written or read.Defaults to 2.
Make sure to use Git 2.36+ (Q2 2022) though, as it fixes the way generation number v2 in the commit-graph files are (not) handled.
See commit 6dbf4b8 (02 Mar 2022), and commit c8d67b9, commit 3b0199d, commit 75979d9, commit 17925e0, commit c78c7a9 (01 Mar 2022) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit a54cc52, 16 Mar 2022)
commit-graph
: declare bankruptcy on GDAT chunksReported-by: Patrick Steinhardt
Signed-off-by: Derrick Stolee
The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks store corrected commit date offsets, used for generation number v2.
Recent changes have demonstrated that previous versions of Git were incorrectly parsing data from these chunks, but might have also been writing them incorrectly.Patrick demonstrated a case where in split commit-graphs across an alternate boundary (and possibly some other special conditions) it was possible to have a commit-graph that was generated by a previous version of Git have incorrect generation number v2 data which results in errors like the following:
commit-graph generation for commit <oid> is 1623273624 < 1623273710
If we cannot trust the existing data in the GDAT and GDOV chunks, then we can alter the format to change the chunk IDs for these chunks.
This causes the new version of Git to silently ignore the older chunks (and disabling generation number v2 in the process) while writing new commit-graph files with correct data in the GDA2 and GDO2 chunks.
technical/commit-graph-format
now includes in its man page:
Historical Notes:
The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have the number '2' in their chunk IDs because a previous version of Git wrote possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By changing the IDs, newer versions of Git will silently ignore those older chunks and write the new information without trusting the incorrect data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With