Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve git log performance?

Tags:

git

git-log

I am trying to extract git logs from a few repositories like this:

git log --pretty=format:%H\t%ae\t%an\t%at\t%s --numstat

For larger repositories (like rails/rails) it takes a solid 35+ seconds to generate the log.

Is there a way to improve this performance?

like image 762
George L Avatar asked Feb 03 '16 20:02

George L


1 Answers

TLDR; as mentioned in GitMerge 2019:

git config --global core.commitGraph true
git config --global gc.writeCommitGraph true
cd /path/to/repo
git commit-graph write

Actually (see at the end), the first two config are not needed with Git 2.24+ (Q3 2019): they are true by default.

As T4cC0re mentions in the comments:

If you are on git version 2.29 or above you should rather run:

git commit-graph write --reachable --changed-paths

This will pre-compute file paths, so that git log commands that are scoped to files also benefit from this cache.


Git 2.18 (Q2 2018) will improve git log performance:

See commit 902f5a2 (24 Mar 2018) by René Scharfe (rscharfe).
See commit 0aaf05b, commit 3d475f4 (22 Mar 2018) by Derrick Stolee (derrickstolee).
See commit 626fd98 (22 Mar 2018) by brian m. carlson (bk2204).
(Merged by Junio C Hamano -- gitster -- in commit 51f813c, 10 Apr 2018)

sha1_name: use bsearch_pack() for abbreviations

When computing abbreviation lengths for an object ID against a single packfile, the method find_abbrev_len_for_pack() currently implements binary search.
This is one of several implementations.
One issue with this implementation is that it ignores the fanout table in the pack-index.

Translate this binary search to use the existing bsearch_pack() method that correctly uses a fanout table.

Due to the use of the fanout table, the abbreviation computation is slightly faster than before.

For a fully-repacked copy of the Linux repo, the following 'git log' commands improved:

* git log --oneline --parents --raw
  Before: 59.2s
  After:  56.9s
  Rel %:  -3.8%

* git log --oneline --parents
  Before: 6.48s
  After:  5.91s
  Rel %: -8.9%

The same Git 2.18 adds a commits graph: Precompute and store information necessary for ancestry traversal in a separate file to optimize graph walking.

See commit 7547b95, commit 3d5df01, commit 049d51a, commit 177722b, commit 4f2542b, commit 1b70dfd, commit 2a2e32b (10 Apr 2018), and commit f237c8b, commit 08fd81c, commit 4ce58ee, commit ae30d7b, commit b84f767, commit cfe8321, commit f2af9f5 (02 Apr 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit b10edb2, 08 May 2018)

commit: integrate commit graph with commit parsing

Teach Git to inspect a commit graph file to supply the contents of a struct commit when calling parse_commit_gently().
This implementation satisfies all post-conditions on the struct commit, including loading parents, the root tree, and the commit date.

If core.commitGraph is false, then do not check graph files.

In test script t5318-commit-graph.sh, add output-matching conditions on read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we save a lot of time on long commit walks.

Here are some performance results for a copy of the Linux repository where 'master' has 678,653 reachable commits and is behind 'origin/master' by 59,929 commits.

| Command                          | Before | After  | Rel % |
|----------------------------------|--------|--------|-------|
| log --oneline --topo-order -1000 |  8.31s |  0.94s | -88%  |
| branch -vv                       |  1.02s |  0.14s | -86%  |
| rev-list --all                   |  5.89s |  1.07s | -81%  |
| rev-list --all --objects         | 66.15s | 58.45s | -11%  |

To know more about commit graph, see "How does 'git log --graph' work?".


The same Git 2.18 (Q2 2018) adds lazy-loading tree.

The code has been taught to use the duplicated information stored in the commit-graph file to learn the tree object name for a commit to avoid opening and parsing the commit object when it makes sense to do so.

See commit 279ffad (30 Apr 2018) by SZEDER Gábor (szeder).
See commit 7b8a21d, commit 2e27bd7, commit 5bb03de, commit 891435d (06 Apr 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit c89b6e1, 23 May 2018)

commit-graph: lazy-load trees for commits

The commit-graph file provides quick access to commit data, including the OID of the root tree for each commit in the graph. When performing a deep commit-graph walk, we may not need to load most of the trees for these commits.

Delay loading the tree object for a commit loaded from the graph until requested via get_commit_tree().
Do not lazy-load trees for commits not in the graph, since that requires duplicate parsing and the relative peformance improvement when trees are not needed is small.

On the Linux repository, performance tests were run for the following command:

git log --graph --oneline -1000

Before: 0.92s
After:  0.66s
Rel %: -28.3%

Git 2.21 (Q1 2019) adds loose cache.

See commit 8be88db (07 Jan 2019), and commit 4cea1ce, commit d4e19e5, commit 0000d65 (06 Jan 2019) by René Scharfe (rscharfe).
(Merged by Junio C Hamano -- gitster -- in commit eb8638a, 18 Jan 2019)

object-store: use one oid_array per subdirectory for loose cache

The loose objects cache is filled one subdirectory at a time as needed.
It is stored in an oid_array, which has to be resorted after each add operation.
So when querying a wide range of objects, the partially filled array needs to be resorted up to 255 times, which takes over 100 times longer than sorting once.

Use one oid_array for each subdirectory.
This ensures that entries have to only be sorted a single time. It also avoids eight binary search steps for each cache lookup as a small bonus.

The cache is used for collision checks for the log placeholders %h, %t and %p, and we can see the change speeding them up in a repository with ca. 100 objects per subdirectory:

$ git count-objects
  26733 objects, 68808 kilobytes

Test                        HEAD^             HEAD
--------------------------------------------------------------------
4205.1: log with %H         0.51(0.47+0.04)   0.51(0.49+0.02) +0.0%
4205.2: log with %h         0.84(0.82+0.02)   0.60(0.57+0.03) -28.6%
4205.3: log with %T         0.53(0.49+0.04)   0.52(0.48+0.03) -1.9%
4205.4: log with %t         0.84(0.80+0.04)   0.60(0.59+0.01) -28.6%
4205.5: log with %P         0.52(0.48+0.03)   0.51(0.50+0.01) -1.9%
4205.6: log with %p         0.85(0.78+0.06)   0.61(0.56+0.05) -28.2%
4205.7: log with %h-%h-%h   0.96(0.92+0.03)   0.69(0.64+0.04) -28.1%

Git 2.22 (Apr. 2019) checks errors before using data read from the commit-graph file.

See commit 93b4405, commit 43d3561, commit 7b8ce9c, commit 67a530f, commit 61df89c, commit 2ac138d (25 Mar 2019), and commit 945944c, commit f6761fa (21 Feb 2019) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit a5e4be2, 25 Apr 2019)

commit-graph write: don't die if the existing graph is corrupt

When the commit-graph is written we end up calling parse_commit(). This will in turn invoke code that'll consult the existing commit-graph about the commit, if the graph is corrupted we die.

We thus get into a state where a failing "commit-graph verify" can't be followed-up with a "commit-graph write" if core.commitGraph=true is set, the graph either needs to be manually removed to proceed, or core.commitGraph needs to be set to "false".

Change the "commit-graph write" codepath to use a new parse_commit_no_graph() helper instead of parse_commit() to avoid this.
The latter will call repo_parse_commit_internal() with use_commit_graph=1 as seen in 177722b ("commit: integrate commit graph with commit parsing", 2018-04-10, Git v2.18.0-rc0).

Not using the old graph at all slows down the writing of the new graph by some small amount, but is a sensible way to prevent an error in the existing commit-graph from spreading.


With Git 2.24+ (Q3 2019), the commit-graph is active by default:

See commit aaf633c, commit c6cc4c5, commit ad0fb65, commit 31b1de6, commit b068d9a, commit 7211b9e (13 Aug 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit f4f8dfe, 09 Sep 2019)

commit-graph: turn on commit-graph by default

The commit-graph feature has seen a lot of activity in the past year or so since it was introduced.
The feature is a critical performance enhancement for medium- to large-sized repos, and does not significantly hurt small repos.

Change the defaults for core.commitGraph and gc.writeCommitGraph to true so users benefit from this feature by default.


Still with Git 2.24 (Q4 2019), a configuration variable tells "git fetch" to write the commit graph after finishing.

See commit 50f26bd (03 Sep 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 5a53509, 30 Sep 2019)

fetch: add fetch.writeCommitGraph config setting

The commit-graph feature is now on by default, and is being written during 'git gc' by default.
Typically, Git only writes a commit-graph when a 'git gc --auto' command passes the gc.auto setting to actualy do work. This means that a commit-graph will typically fall behind the commits that are being used every day.

To stay updated with the latest commits, add a step to 'git fetch' to write a commit-graph after fetching new objects.
The fetch.writeCommitGraph config setting enables writing a split commit-graph, so on average the cost of writing this file is very small. Occasionally, the commit-graph chain will collapse to a single level, and this could be slow for very large repos.

For additional use, adjust the default to be true when feature.experimental is enabled.


And still with Git 2.24 (Q4 2019), the commit-graph is more robust.

See commit 6abada1, commit fbab552 (12 Sep 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 098e8c6, 07 Oct 2019)

commit-graph: bump DIE_ON_LOAD check to actual load-time

Commit 43d3561 (commit-graph write: don't die if the existing graph is corrupt, 2019-03-25, Git v2.22.0-rc0) added an environment variable we use only in the test suite, $GIT_TEST_COMMIT_GRAPH_DIE_ON_LOAD.
But it put the check for this variable at the very top of prepare_commit_graph(), which is called every time we want to use the commit graph.
Most importantly, it comes before we check the fast-path "did we already try to load?", meaning we end up calling getenv() for every single use of the commit graph, rather than just when we load.

getenv() is allowed to have unexpected side effects, but that shouldn't be a problem here; we're lazy-loading the graph so it's clear that at least one invocation of this function is going to call it.

But it is inefficient. getenv() typically has to do a linear search through the environment space.

We could memoize the call, but it's simpler still to just bump the check down to the actual loading step. That's fine for our sole user in t5318, and produces this minor real-world speedup:

[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      1.460 s ±  0.017 s    [User: 1.174 s, System: 0.285 s]
Range (min … max):    1.440 s …  1.491 s    10 runs

[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      1.391 s ±  0.005 s    [User: 1.118 s, System: 0.273 s]
Range (min … max):    1.385 s …  1.399 s    10 runs

Git 2.24 (Q4 2019) also includes a regression fix.

See commit cb99a34, commit e88aab9 (24 Oct 2019) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit dac1d83, 04 Nov 2019)

commit-graph: fix writing first commit-graph during fetch

Reported-by: Johannes Schindelin
Helped-by: Jeff King
Helped-by: Szeder Gábor
Signed-off-by: Derrick Stolee

The previous commit includes a failing test for an issue around fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we fix that bug and set the test to "test_expect_success".

The problem arises with this set of commands when the remote repo at <url> has a submodule.
Note that --recurse-submodules is not needed to demonstrate the bug.

$ git clone <url> test
$ cd test
$ git -c fetch.writeCommitGraph=true fetch origin
  Computing commit graph generation numbers: 100% (12/12), done.
  BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2>
  Aborted (core dumped)

As an initial fix, I converted the code in builtin/fetch.c that calls write_commit_graph_reachable() to instead launch a "git commit-graph write --reachable --split" process. That code worked, but is not how we want the feature to work long-term.

That test did demonstrate that the issue must be something to do with internal state of the 'git fetch' process.

The write_commit_graph() method in commit-graph.c ensures the commits we plan to write are "closed under reachability" using close_reachable().
This method walks from the input commits, and uses the UNINTERESTING flag to mark which commits have already been visited. This allows the walk to take O(N) time, where N is the number of commits, instead of O(P) time, where P is the number of paths. (The number of paths can be exponential in the number of commits.)

However, the UNINTERESTING flag is used in lots of places in the codebase. This flag usually means some barrier to stop a commit walk, such as in revision-walking to compare histories.
It is not often cleared after the walk completes because the starting points of those walks do not have the UNINTERESTING flag, and clear_commit_marks() would stop immediately.

This is happening during a 'git fetch' call with a remote. The fetch negotiation is comparing the remote refs with the local refs and marking some commits as UNINTERESTING.

I tested running clear_commit_marks_many() to clear the UNINTERESTING flag inside close_reachable(), but the tips did not have the flag, so that did nothing.

It turns out that the calculate_changed_submodule_paths() method is at fault. Thanks, Peff, for pointing out this detail! More specifically, for each submodule, the collect_changed_submodules() runs a revision walk to essentially do file-history on the list of submodules. That revision walk marks commits UNININTERESTING if they are simplified away by not changing the submodule.

Instead, I finally arrived on the conclusion that I should use a flag that is not used in any other part of the code. In commit-reach.c, a number of flags were defined for commit walk algorithms. The REACHABLE flag seemed like it made the most sense, and it seems it was not actually used in the file.
The REACHABLE flag was used in early versions of commit-reach.c, but was removed by 4fbcca4 ("commit-reach: make can_all_from_reach... linear", 2018-07-20, v2.20.0-rc0).

Add the REACHABLE flag to commit-graph.c and use it instead of UNINTERESTING in close_reachable().
This fixes the bug in manual testing.


Fetching from multiple remotes into the same repository in parallel had a bad interaction with the recent change to (optionally) update the commit-graph after a fetch job finishes, as these parallel fetches compete with each other.

That has been corrected with Git 2.25 (Q1 2020).

See commit 7d8e72b, commit c14e6e7 (03 Nov 2019) by Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit bcb06e2, 01 Dec 2019)

fetch: add the command-line option --write-commit-graph

Signed-off-by: Johannes Schindelin

This option overrides the config setting fetch.writeCommitGraph, if both are set.

And:

fetch: avoid locking issues between fetch.jobs/fetch.writeCommitGraph

Signed-off-by: Johannes Schindelin

When both fetch.jobs and fetch.writeCommitGraph is set, we currently try to write the commit graph in each of the concurrent fetch jobs, which frequently leads to error messages like this one:

fatal: Unable to create '.../.git/objects/info/commit-graphs/commit-graph-chain.lock': File exists.

Let's avoid this by holding off from writing the commit graph until all fetch jobs are done.


The code to write split commit-graph file(s) upon fetching computed bogus value for the parameter used in splitting the resulting files, which has been corrected with Git 2.25 (Q1 2020).

See commit 63020f1 (02 Jan 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 037f067, 06 Jan 2020)

commit-graph: prefer default size_mult when given zero

Signed-off-by: Derrick Stolee

In 50f26bd ("fetch: add fetch.writeCommitGraph config setting", 2019-09-02, Git v2.24.0-rc0 -- merge listed in batch #4), the fetch builtin added the capability to write a commit-graph using the "--split" feature.
This feature creates multiple commit-graph files, and those can merge based on a set of "split options" including a size multiple.
The default size multiple is 2, which intends to provide a log_2 N depth of the commit-graph chain where N is the number of commits.

However, I noticed during dogfooding that my commit-graph chains were becoming quite large when left only to builds by 'git fetch'.
It turns out that in split_graph_merge_strategy(), we default the size_mult variable to 2, except we override it with the context's split_opts if they exist.
In builtin/fetch.c, we create such a split_opts, but do not populate it with values.

This problem is due to two failures:

  1. It is unclear that we can add the flag COMMIT_GRAPH_WRITE_SPLIT with a NULL split_opts.
  2. If we have a non-NULL split_opts, then we override the default values even if a zero value is given.

Correct both of these issues.

  • First, do not override size_mult when the options provide a zero value.
  • Second, stop creating a split_opts in the fetch builtin.

Note that git log was broken between Git 2.22 (May 2019) and Git 2.27 (Q2 2020), when using magic pathspec.

The command line parsing of "git log :/a/b/" was broken for about a full year without anybody noticing, which has been corrected.

See commit 0220461 (10 Apr 2020) by Jeff King (peff).
See commit 5ff4b92 (10 Apr 2020) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit 95ca489, 22 Apr 2020)

sha1-name: do not assume that the ref store is initialized

Reported-by: Érico Rolim

c931ba4e ("sha1-name.c``: remove the_repo from handle_one_ref()", 2019-04-16, Git v2.22.0-rc0 -- merge listed in batch #8) replaced the use of for_each_ref() helper, which works with the main ref store of the default repository instance, with refs_for_each_ref(), which can work on any ref store instance, by assuming that the repository instance the function is given has its ref store already initialized.

But it is possible that nobody has initialized it, in which case, the code ends up dereferencing a NULL pointer.

And:

repository: mark the "refs" pointer as private

Signed-off-by: Jeff King

The "refs" pointer in a struct repository starts life as NULL, but then is lazily initialized when it is accessed via get_main_ref_store().
However, it's easy for calling code to forget this and access it directly, leading to code which works some of the time, but fails if it is called before anybody else accesses the refs.

This was the cause of the bug fixed by 5ff4b920eb ("sha1-name: do not assume that the ref store is initialized", 2020-04-09, Git v2.27.0 -- merge listed in batch #3). In order to prevent similar bugs, let's more clearly mark the "refs" field as private.

like image 51
VonC Avatar answered Sep 21 '22 21:09

VonC