Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I sort a set of git commit IDs in topological order?

Tags:

git

I have a set of commit SHA1s, in no particular order. I would like to pipe this set to a command, and have those commits returned in topological order.

Here's one way of doing this:

git rev-list --all --topo-order | grep --file SET_OF_SHA1S

As you can imagine, this is a very slow way of doing it, as git rev-list is having to print out all of the commit SHA1s, not just the ones in my set.

Is there a better and faster way to do this?

Use case:

My test framework tests certain Git commits and stores the result in a database. I'm writing a web page that summarises these results, and it would be nice to display the results in order. Sorting by commit date is not ideal as some rebased commits will have exactly the same commit date.

like image 677
Flimm Avatar asked Mar 28 '14 13:03

Flimm


3 Answers

You can use --no-walk to prevent git from dumping any SHA-1s other than the ones you supply, and use --topo-order to force the correct order. As Mort pointed out in a comment, this does not work. The git documentation acquired new text noting (indirectly) this problem, as of git version 2.4. (I consider this a bug in git rev-list, which should load enough of the commit graph to do the topological sort, then output just the user-specified revision IDs, in the correct order.)

My original script (left in here) therefore also does not work. It can be made to work by removing the --no-walk from the step that generates temporary file $TF2, then using the contents of $TF1 to extract and print the "interesting" revisions from their $TF2 (sorted) order.

This is more or less what Flimm's own answer does.

[original answer, with flawed script, below]


I'm not sure exactly what I was doing with this code, but long ago, I wrote a script to check whether arguments supplied were in topo order:

#! /bin/sh
#
# check a list of IDs to see if they're in "topo order"
usage()
{
    echo "usage: $0 id [...]"
}

case $# in
0) usage 1>&2; exit 1;;
esac

TF1=$(mktemp)
TF2=$(mktemp)
trap "rm -f $TF1 $TF2; exit" 0 1 2 3 15

# parse the arguments into one file
git rev-parse $@ > $TF1 || exit 1
# and topo-sort the arguments into another
git rev-list --topo-order --no-walk --reverse $@ > $TF2 || exit 1
# If the list is in the correct order the files will be the same
cmp -s $TF1 $TF2 || {
    # If the files differ, it's possible that some argument(s) name
    # the same rev...
    [ $(wc -l < $TF1) -eq $(wc -l < $TF2) ] || {
        echo "ERROR: there are repeats in $@"
        # finding them is a pain, we don't bother trying
        exit 1
    }
    echo "ERROR: $@ NOT in topo order"
    echo "use instead:"
    # read the topo-ordered raw IDs
    while read sha1; do
        # and find the (single) arg in $@ that names this one
        for i; do
            if [ $(git rev-parse $i) = $sha1 ]; then
                printf ' %s' $i
                break
            fi
        done
    done < $TF2
    echo
    exit 1
}
echo "$@ in topo order"
exit 0

I think what I wanted here was to emit the same argument names, e.g., if you said git-check-topo v1.7 1234567 branchX it would tell you to use (literally) branchX v1.7 1234567, if that is what got you the right order, rather than just showing the raw SHA-1s.

For your purposes, a simple:

git rev-list --topo-order --no-walk $@

(with or without --reverse as desired) should work, I think.

like image 22
torek Avatar answered Oct 29 '22 12:10

torek


Here's one way of speeding it up:

git rev-list --topo-order $(cat SET_OF_SHA1S) \
   | grep --file SET_OF_SHA1S --max-count $(wc -l SET_OF_SHA1S)

Optimisations:

  • Only ask rev-list to list all commits reachable from your set of SHA1s.
  • As soon as rev-list prints enough commits that include the set of SHA1s you're interested in, tell grep to stop grepping using the --max-count parameter. grep will in turn close its input, and rev-list will stop needlessly printing out further SHA1s.
like image 155
Flimm Avatar answered Oct 29 '22 12:10

Flimm


Another way to speed up git rev-list --topo-order is to use Git 2.20 (Q4 2018)

See commit 561b583, commit b454241, commit 5284fc5, commit f0d9cc4, commit d6b4071, commit 4b47a9a, commit aca4240 (01 Nov 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 62ca33e, 18 Nov 2018)

revision.c: generation-based topo-order algorithm

The current --topo-order algorithm requires walking all reachable commits up front, topo-sorting them, all before outputting the first value.

This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go.
This can dramatically reduce the computation time to write a fixed number of commits, such as when limiting with "-n <N>" or filling the first page of a pager.

In my local testing, I used the following Git commands on the Linux repository in three modes:

  • HEAD~1 with no commit-graph,
  • HEAD~1 with a commit-graph, and
  • HEAD with a commit-graph.

This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk.

Test: git rev-list --topo-order -100 HEAD
HEAD~1, no commit-graph: 6.80 s
HEAD~1, w/ commit-graph: 0.77 s
HEAD, w/ commit-graph: 0.02 s

See all the details in commit b454241.

Note: as mentioned in commit d6b4071:

The rev-list command is critical to Git's functionality.

Here are a few important types of rev-list operations:

  • Basic: git rev-list --topo-order HEAD
  • Range: git rev-list --topo-order compare..HEAD
  • Ancestry: git rev-list --topo-order --ancestry-path compare..HEAD
  • Symmetric Difference: git rev-list --topo-order compare...HEAD

Do use Git 2.23 (Q3 2019) though, as a topo-orider from Git 2.20 introduced a regression when used on range.

See commit 1d8e31a, commit 1b4d882 (21 May 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit bdc81d1, 17 Jun 2019)

revision: keep topo-walk free of unintersting commits

When updating the topo-order walk in b454241 (revision.c: generation-based topo-order algorithm, 2018-11-01, v2.20.0-rc0), the logic was a huge rewrite of the walk logic.
In that massive change, we accidentally included the UNINTERESTING commits in expand_topo_walk().
This means that a simple query like

git rev-list --topo-order HEAD~1..HEAD

will expand the topo walk for all commits reachable from HEAD, and not just one commit.

revision: use generation for A..B --topo-order queries

If a commit-graph exists with computed generation numbers, then a 'git rev-list --topo-order -n <N> <rev>' query will use those generation numbers to reduce the number of commits walked before writing N commits.

One caveat put in b454241 (revision.c: generation-based topo-order algorithm, 2018-11-01) was to not enable the new algorithm for queries with a revision range "A..B".
The logic was placed to walk from "A" and mark those commits as uninteresting, but the performance was actually worse than the existing logic in some cases.

The root cause of this performance degradation is that generation numbers increase the number of commits we walk relative to the existing heuristic of walking by commit date.
While generation numbers actually guarantee that the algorithm is correct, the existing logic is very rarely wrong and that added requirement is not worth the cost.


With Git 2.28 (Q3 2020), "git log -L..." now takes advantage of the "which paths are touched by this commit?" info stored in the commit-graph system.

See commit 002933f, commit 3cb9d2b, commit 48da94b, commit d554672 (11 May 2020) by SZEDER Gábor (szeder).
See commit f32dde8 (11 May 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit c3a0282, 09 Jun 2020)

line-log: try to use generation number-based topo-ordering

Signed-off-by: SZEDER Gábor
Signed-off-by: Derrick Stolee

The previous patch made it possible to perform line-level filtering during history traversal instead of in an expensive preprocessing step, but it still requires some simpler preprocessing steps, notably topo-ordering.

However, nowadays we have commit-graphs storing generation numbers, which make it possible to incrementally traverse the history in topological order, without the preparatory limit_list() and sort_in_topological_order() steps; see b45424181e ("[revision.c](https://github.com/git/git/blob/002933f3fe2b016022ebbbbb359f6aeba58309a4/revision.c): generation-based topo-order algorithm", 2018-11-01, Git v2.20.0-rc0 -- merge).

This patch combines the two, so we can do both the topo-ordering and the line-level filtering during history traversal, eliminating even those simpler preprocessing steps, and thus further reducing the delay before showing the first commit modifying the given line range.

The 'revs->limited' flag plays the central role in this, because, due to limitations of the current implementation, the generation number-based topo-ordering is only enabled when this flag remains unset.

Line-level log, however, always sets this flag in setup_revisions() ever since the feature was introduced in 12da1d1f6f ("Implement line-history search (git log -L)", 2013-03-28, Git v1.8.4-rc0 -- merge).

The reason for setting 'limited' is unclear, though, because the line-level log itself doesn't directly depend on it, and it doesn't affect how the limit_list() function limits the revision range.

However, there is an indirect dependency: the line-level log requires topo-ordering, and the "traditional" sort_in_topological_order() requires an already limited commit list since e6c3505b44 ("Make sure we generate the whole commit list before trying to sort it topologically", 2005-07-06, Git v0.99 -- merge listed in batch #0).

The new, generation numbers-based topo-ordering doesn't require a limited commit list anymore.

So don't set 'revs->limited' for line-level log, unless it is really necessary, namely:

  • The user explicitly requested parent rewriting, because that is still done in the line_log_filter() preprocessing step (see previous patch), which requires sort_in_topological_order() and in turn limit_list() as well.
  • A commit-graph file is not available or it doesn't yet contain generation numbers.
    In these cases we had to fall back on sort_in_topological_order() and in turn limit_list().
    The existing condition with generation_numbers_enabled() has already ensured that the 'limited' flag is set in these cases; this patch just makes sure that the line-level log sets 'revs->topo_order' before that condition.

While the reduced delay before showing the first commit is measurable in git.git, it takes a bigger repository to make it clearly noticable.

In both cases below the line ranges were chosen so that they were modified rather close to the starting revisions, so the effect of this change is most noticable.

# git.git
$ time git --no-pager log -L:read_alternate_refs:sha1-file.c -1 v2.23.0

Before:

real    0m0.107s
user    0m0.091s
sys     0m0.013s

After:

real    0m0.058s
user    0m0.050s
sys     0m0.005s

# linux.git
$ time git --no-pager log \
 -L:build_restore_work_registers:arch/mips/mm/tlbex.c -1 v5.2

Before:

real   0m1.129s
user   0m1.061s
sys    0m0.069s

After:

real   0m0.096s
user   0m0.087s
sys    0m0.009s

Additional testing by Derrick Stolee: Since this patch improves the performance for the first result, I repeated the experiment from the previous patch on the Linux kernel repository, reporting real time here:

Command: git log -L 100,200:MAINTAINERS -n 1 >/dev/null
 Before: 0.71 s
 After: 0.05 s

Now, we have dropped the full topo-order of all ~910,000 commits before reporting the first result. The remaining performance improvements then are:

  1. Update the parent-rewriting logic to be incremental similar to how "git log --graph" behaves.

  2. Use changed-path Bloom filters to reduce the time spend in the tree-diff to see if the path(s) changed.


With Git 2.31 (Q1 2021), the topological walk codepath is covered by new trace2 stats.

See commit 90b666d (30 Dec 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 02feca7, 15 Jan 2021)

revision: trace topo-walk statistics

Signed-off-by: Derrick Stolee

We trace statistics about the effectiveness of changed-path Bloom filters since 42e50e7 ("revision.c: add trace2 stats around Bloom filter usage", 2020-04-06, Git v2.27.0-rc0 -- merge listed in batch #6).
Add similar tracing for the topo-walk algorithm that uses generation numbers to limit the walk size.

This information can help investigate and describe benefits to heuristics and other changes.

The information that is printed is in JSON format and can be formatted nicely to present as follows:

{
unt_explort_walked":2603,
unt_indegree_walked":2603,
unt_topo_walked":473
}

Each of these values count the number of commits are visited by each of the three "stages" of the topo-walk as detailed in b454241 ("revision.c: generation-based topo-order algorithm", 2018-11-01, Git v2.20.0-rc0 -- merge).


With Git 2.32 (Q2 2021), a new configuration variable has been introduced to allow choosing which version of the generation number gets used in the commit-graph file.

See commit 702110a, commit c7ef8fe (25 Feb 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit d20fa3c, 22 Mar 2021)

commit-graph: use config to specify generation type

Signed-off-by: Derrick Stolee

We have two established generation number versions: 1: topological levels 2: corrected commit dates

The corrected commit dates are enabled by default, but they also write extra data in the GDAT and GDOV chunks.
Services that host Git data might want to have more control over when this feature rolls out than just updating the Git binaries.

Add a new "commitGraph.generationVersion" config option that specifies the intended generation number version.
If this value is less than 2, then the GDAT chunk is never written or read from an existing file.

git config now includes in its man page:

commitGraph.generationVersion

Specifies the type of generation number version to use when writing or reading the commit-graph file.
If version 1 is specified, then the corrected commit dates will not be written or read.

Defaults to 2.


Make sure to use Git 2.36+ (Q2 2022) though, as it fixes the way generation number v2 in the commit-graph files are (not) handled.

See commit 6dbf4b8 (02 Mar 2022), and commit c8d67b9, commit 3b0199d, commit 75979d9, commit 17925e0, commit c78c7a9 (01 Mar 2022) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit a54cc52, 16 Mar 2022)

commit-graph: declare bankruptcy on GDAT chunks

Reported-by: Patrick Steinhardt
Signed-off-by: Derrick Stolee

The Generation Data (GDAT) and Generation Data Overflow (GDOV) chunks store corrected commit date offsets, used for generation number v2.
Recent changes have demonstrated that previous versions of Git were incorrectly parsing data from these chunks, but might have also been writing them incorrectly.

Patrick demonstrated a case where in split commit-graphs across an alternate boundary (and possibly some other special conditions) it was possible to have a commit-graph that was generated by a previous version of Git have incorrect generation number v2 data which results in errors like the following:

commit-graph generation for commit <oid> is 1623273624 < 1623273710

If we cannot trust the existing data in the GDAT and GDOV chunks, then we can alter the format to change the chunk IDs for these chunks.
This causes the new version of Git to silently ignore the older chunks (and disabling generation number v2 in the process) while writing new commit-graph files with correct data in the GDA2 and GDO2 chunks.

technical/commit-graph-format now includes in its man page:

Historical Notes:

The Generation Data (GDA2) and Generation Data Overflow (GDO2) chunks have the number '2' in their chunk IDs because a previous version of Git wrote possibly erroneous data in these chunks with the IDs "GDAT" and "GDOV". By changing the IDs, newer versions of Git will silently ignore those older chunks and write the new information without trusting the incorrect data.

like image 36
VonC Avatar answered Oct 29 '22 10:10

VonC