In my repo, <code>git diff</code> and <code>git stash</code> both run quickly, in less than a second. However <code>git stash -p</code> takes a good 20 seconds before showing the first hunk. Why could this be?

This should improve with Git 2.25.2 (March 2020), which adds code simplification. See discussion. See commit 26f924d (07 Jan 2020) by Elijah Newren (<code>newren</code>). (Merged by Junio C Hamano -- <code>gitster</code> -- in commit a3648c0, 22 Jan 2020) <blockquote> <h3> <code>unpack-trees</code>: exit <code>check_updates()</code> early if updates are not wanted</h3> Signed-off-by: Elijah Newren <code>check_updates()</code> has a lot of code that repeatedly checks whether <code>o->update</code> or <code>o->dry_run</code> are set. (Note that <code>o->dry_run</code> is a near-synonym for <code>!o->update,</code> but not quite as per commit 2c9078d05bf2 ("<code>unpack-trees</code>: add the <code>dry_run</code> flag to <code>unpack_trees_options</code>", 2011-05-25, Git v1.7.6-rc0).) In fact, this function almost turns into a no-op whenever the condition <pre class="prettyprint"><code>!o->update || o->dry_run </code></pre> is met. Simplify the code by checking this condition at the beginning of the function, and when it is true, do the few things that are relevant and return early. There are a few things that make the conversion not quite obvious: <ul> <li>The fact that check_updates() does not actually turn into a no-op when updates are not wanted may be slightly surprising. However, commit 33ecf7eb61 (Discard "<code>deleted</code>" cache entries after using them to update the working tree, 2008-02-07, Git v1.5.5-rc0) put the discarding of unused cache entries in <code>check_updates()</code> so we still need to keep the call to <code>remove_marked_cache_entries()</code>. It's possible this call belongs in another function, but it is certainly needed as tests will fail if it is removed.</li> <li>The original called <code>remove_scheduled_dirs()</code> unconditionally. Technically, commit 7847892716 (<code>unlink_entry()</code>: introduce <code>schedule_dir_for_removal()</code>, 2009-02-09, Git v1.6.3-rc0) should have made that call conditional, but it didn't matter in practice because <code>remove_scheduled_dirs()</code> becomes a no-op when all the calls to unlink_entry() are skipped. As such, we do not need to call it.</li> <li>When <code>(o->dry_run && o->update)</code>, the original would have two calls to <code>git_attr_set_direction()</code> surrounding a bunch of skipped updates. These two calls to <code>git_attr_set_direction()</code> cancel each other out and thus can be omitted when <code>o->dry_run</code> is true just as they already are when <code>!o->update</code>.</li> <li>The code would previously call <code>setup_collided_checkout_detection()</code> and <code>report_collided_checkout()</code> even when <code>o->dry_run</code>. However, this was just an expensive no-op because <code>setup_collided_checkout_detection()</code> merely cleared the <code>CE_MATCHED</code> flag for each cache entry, and <code>report_collided_checkout()</code> reported which ones had it set. Since a dry-run would skip all the <code>checkout_entry()</code> calls, <code>CE_MATCHED</code> would never get set and thus no collisions would be reported. Since we can't detect the collisions anyway without doing updates, skipping the collisions detection setup and reporting is an optimization.</li> <li>The code previously would call <code>get_progress()</code> and <code>display_progress()</code> even when <code>(!o->update || o->dry_run)</code>. This served to show how long it took to skip all the updates, which is somewhat useless. Since we are skipping the updates, we can skip showing how long it takes to skip them.</li> </ul> </blockquote>

I notice the same problem. This started at least over a year ago and has not improved since than. I also use git on a very big repo. Unfortunately in my case there is also a lot of binary data in it since it’s just a mirror of a SVN repo using git_svn and my colleagues think it’s a good idea to place binary test data into the repo. No answer, just hints and guesses where to search: <ul> <li>It seams the big difference is, that in case of <code>stash -p</code> the function <code>stash_patch</code> is called. Otherwise <code>stash_working_tree</code>.</li> <li>In <code>stash_patch</code> there are child processes called executing other git commands. One of these is <code>read-tree</code> (see: <code>man git-read-tree</code>). The final command looks like this: <code>GIT_INDEX_FILE=index.stash.<PID> git read-tree HEAD</code>. This actually takes no time. </li> <li>The next step is another child process calling <code>GIT_INDEX_FILE=index.stash.<PID> git add--interactive --patch=stash -- <PATH></code> – This is where all the reads come from and what takes up all the time. Interesting thing is: Calling just <code>GIT_INDEX_FILE=index.stash.<PID> git status</code> after <code>GIT_INDEX_FILE=index.stash.<PID> git read-tree HEAD</code> is as expensive as <code>git add--interactive</code>. Actually <code>add--interactive</code> is a perl script implementing <code>add -p</code>. I don’t know perl and had a hard time reading this, but probably it will somehow check the working dir state and use the same code for it as <code>git status</code>. </li> <li> The basic idea seams to be: <ul> <li>Create a temporary index from HEAD</li> <li>Interactive add changes to that index</li> <li>Save the changed temporary index to a tree-ish</li> </ul> </li> <li>The expensive part seams to be to get the state of the working dir w.r.t the temporary index. Why it’s so expensive I don’t know. Probably there is some cached data invalidated and it has to read all the files in the working copy at least to some amount to compare with the temporary index, but to understand this one has to dive deeper into the internals of <code>git status</code>. </li> </ul> I tried measuring this like this: <pre class="prettyprint"><code>GIT_INDEX_FILE=.git/index.stash.test git read-tree HEAD GIT_TRACE_PERFORMANCE=/tmp/trace_status GIT_INDEX_FILE=.git/index.stash.test git st . </code></pre> Result looks like this: <pre class="prettyprint"><code>20:31:20.439868 read-cache.c:2290 performance: 0.000269090 s: read cache .git/index.stash.test 20:31:20.441368 preload-index.c:147 performance: 0.001419629 s: preload index 20:32:15.568433 read-cache.c:1605 performance: 55.128484420 s: refresh index 20:32:15.568611 diff-lib.c:251 performance: 0.000054503 s: diff-files 20:32:15.568847 unpack-trees.c:1546 performance: 0.000004362 s: traverse_trees 20:32:15.568868 unpack-trees.c:447 performance: 0.000008189 s: check_updates 20:32:15.568874 unpack-trees.c:1643 performance: 0.000040807 s: unpack_trees 20:32:15.568879 diff-lib.c:537 performance: 0.000079322 s: diff-index 20:32:15.569115 name-hash.c:600 performance: 0.000197074 s: initialize name hash 20:32:15.573785 dir.c:2326 performance: 0.004883714 s: read directory 20:32:15.574904 read-cache.c:3017 performance: 0.001083674 s: write index, changed mask = 82 20:32:15.575125 trace.c:475 performance: 55.135763475 s: git command: /usr/lib/git-core/git status . 20:32:15.575421 trace.c:475 performance: 55.136831211 s: git command: git st . </code></pre> My repo looks like this: <pre class="prettyprint"><code>>$ du -hd 1 1,1M ./.idea 74M ./code 3,0G ./.git 2,4G ./test-data 5,5G . </code></pre> Similar picture if trace directly applied to <code>git stash -p</code>: <pre class="prettyprint"><code>20:43:55.968088 read-cache.c:1605 performance: 59.716998605 s: refresh index 20:43:55.969584 trace.c:475 performance: 59.719061140 s: git command: git update-index --refresh </code></pre> Man page for <code>git update-index --refresh</code> states: <pre class="prettyprint"><code>USING --REFRESH --refresh does not calculate a new sha1 file or bring the index up to date for mode/content changes. But what it does do is to "re-match" the stat information of a file with the index, so that you can refresh the index for a file that hasn’t been changed but where the stat entry is out of date. For example, you’d want to do this after doing a git read-tree, to link up the stat index details with the proper files. </code></pre>

Why does git stash -p take long to start?

Video Answer

2 Answers

This should improve with Git 2.25.2 (March 2020), which adds code simplification.
See discussion.

See commit 26f924d (07 Jan 2020) by Elijah Newren (newren).
^{(Merged by Junio C Hamano -- gitster -- in commit a3648c0, 22 Jan 2020)}

unpack-trees: exit check_updates() early if updates are not wanted

^{Signed-off-by: Elijah Newren}

check_updates() has a lot of code that repeatedly checks whether o->update or o->dry_run are set.

(Note that o->dry_run is a near-synonym for !o->update, but not quite as per commit 2c9078d05bf2 ("unpack-trees: add the dry_run flag to unpack_trees_options", 2011-05-25, Git v1.7.6-rc0).)
In fact, this function almost turns into a no-op whenever the condition
!o->update || o->dry_run
is met.

Simplify the code by checking this condition at the beginning of the function, and when it is true, do the few things that are relevant and return early.

There are a few things that make the conversion not quite obvious:

The fact that check_updates() does not actually turn into a no-op when updates are not wanted may be slightly surprising.
However, commit 33ecf7eb61 (Discard "deleted" cache entries after using them to update the working tree, 2008-02-07, Git v1.5.5-rc0) put the discarding of unused cache entries in check_updates() so we still need to keep the call to remove_marked_cache_entries().
It's possible this call belongs in another function, but it is certainly needed as tests will fail if it is removed.

The original called remove_scheduled_dirs() unconditionally.
Technically, commit 7847892716 (unlink_entry(): introduce schedule_dir_for_removal(), 2009-02-09, Git v1.6.3-rc0) should have made that call conditional, but it didn't matter in practice because remove_scheduled_dirs() becomes a no-op when all the calls to unlink_entry() are skipped.
As such, we do not need to call it.

When (o->dry_run && o->update), the original would have two calls to git_attr_set_direction() surrounding a bunch of skipped updates.
These two calls to git_attr_set_direction() cancel each other out and thus can be omitted when o->dry_run is true just as they already are when !o->update.

The code would previously call setup_collided_checkout_detection() and report_collided_checkout() even when o->dry_run.
However, this was just an expensive no-op because setup_collided_checkout_detection() merely cleared the CE_MATCHED flag for each cache entry, and report_collided_checkout() reported which ones had it set.
Since a dry-run would skip all the checkout_entry() calls, CE_MATCHED would never get set and thus no collisions would be reported.
Since we can't detect the collisions anyway without doing updates, skipping the collisions detection setup and reporting is an optimization.

The code previously would call get_progress() and display_progress() even when (!o->update || o->dry_run).
This served to show how long it took to skip all the updates, which is somewhat useless.
Since we are skipping the updates, we can skip showing how long it takes to skip them.

answered Oct 23 '22 16:10

VonC

I notice the same problem. This started at least over a year ago and has not improved since than. I also use git on a very big repo. Unfortunately in my case there is also a lot of binary data in it since it’s just a mirror of a SVN repo using git_svn and my colleagues think it’s a good idea to place binary test data into the repo.

No answer, just hints and guesses where to search:

It seams the big difference is, that in case of stash -p the function stash_patch is called. Otherwise stash_working_tree.
In stash_patch there are child processes called executing other git commands. One of these is read-tree (see: man git-read-tree). The final command looks like this: GIT_INDEX_FILE=index.stash.<PID> git read-tree HEAD. This actually takes no time.
The next step is another child process calling GIT_INDEX_FILE=index.stash.<PID> git add--interactive --patch=stash -- <PATH> – This is where all the reads come from and what takes up all the time. Interesting thing is: Calling just GIT_INDEX_FILE=index.stash.<PID> git status after GIT_INDEX_FILE=index.stash.<PID> git read-tree HEAD is as expensive as git add--interactive. Actually add--interactive is a perl script implementing add -p. I don’t know perl and had a hard time reading this, but probably it will somehow check the working dir state and use the same code for it as git status.
The basic idea seams to be:
- Create a temporary index from HEAD
- Interactive add changes to that index
- Save the changed temporary index to a tree-ish
The expensive part seams to be to get the state of the working dir w.r.t the temporary index. Why it’s so expensive I don’t know. Probably there is some cached data invalidated and it has to read all the files in the working copy at least to some amount to compare with the temporary index, but to understand this one has to dive deeper into the internals of git status.

I tried measuring this like this:

GIT_INDEX_FILE=.git/index.stash.test git read-tree HEAD
GIT_TRACE_PERFORMANCE=/tmp/trace_status GIT_INDEX_FILE=.git/index.stash.test git st .

Result looks like this:

20:31:20.439868 read-cache.c:2290       performance: 0.000269090 s:  read cache .git/index.stash.test
20:31:20.441368 preload-index.c:147     performance: 0.001419629 s:   preload index
20:32:15.568433 read-cache.c:1605       performance: 55.128484420 s:  refresh index
20:32:15.568611 diff-lib.c:251          performance: 0.000054503 s:  diff-files
20:32:15.568847 unpack-trees.c:1546     performance: 0.000004362 s:    traverse_trees
20:32:15.568868 unpack-trees.c:447      performance: 0.000008189 s:    check_updates
20:32:15.568874 unpack-trees.c:1643     performance: 0.000040807 s:   unpack_trees
20:32:15.568879 diff-lib.c:537          performance: 0.000079322 s:  diff-index
20:32:15.569115 name-hash.c:600         performance: 0.000197074 s:   initialize name hash
20:32:15.573785 dir.c:2326              performance: 0.004883714 s:  read directory 
20:32:15.574904 read-cache.c:3017       performance: 0.001083674 s:  write index, changed mask = 82
20:32:15.575125 trace.c:475             performance: 55.135763475 s: git command: /usr/lib/git-core/git status .
20:32:15.575421 trace.c:475             performance: 55.136831211 s: git command: git st .

My repo looks like this:

>$ du -hd 1
1,1M    ./.idea
74M     ./code
3,0G    ./.git
2,4G    ./test-data
5,5G    .

Similar picture if trace directly applied to git stash -p:

20:43:55.968088 read-cache.c:1605       performance: 59.716998605 s:  refresh index
20:43:55.969584 trace.c:475             performance: 59.719061140 s: git command: git update-index --refresh

Man page for git update-index --refresh states:

USING --REFRESH
       --refresh does not calculate a new sha1 file or bring the index up to date for mode/content changes. But what it does do is to "re-match" the stat information of a file with the index, so that you can refresh the index for a
       file that hasn’t been changed but where the stat entry is out of date.

       For example, you’d want to do this after doing a git read-tree, to link up the stat index details with the proper files.

answered Oct 23 '22 15:10

Peter

Related questions
                            
                                Git: show message when cloning
                            
                                Git Daemon Hang SO_KEEPALIVE
                            
                                What folders/files of a Firefox profile should you exclude from your commit?
                            
                                How do I fork an empty repository on github?
                            
                                Git checkout doesnt revert files
                            
                                How to manually decrypt a file encrypted via git crypt
                            
                                Jenkins intermittent git connectivity error
                            
                                On Github, merging PR into different branch
                            
                                Triggering Jenkins build on both new tags & commits
                            
                                push all except an special branch
                            
                                Git subrepositories
                            
                                Is there an easy way to delete untracked git files [duplicate]
                            
                                AWS Lambda for CodeCommit repo sync
                            
                                Combine color with conditional newlines in git log output
                            
                                Cannot clone github repo without being logged in as root, regardless of sudo
                            
                                How to search for "lost" file/changes in Git?
                            
                                Why do we need SSH keys in git?
                            
                                How to find the branch a git commit is on? (Using libgit2sharp)
                            
                                How to update all submodules?
                            
                                visual studio code how to change the current git repository

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does git stash -p take long to start?

Tags:

git

git-stash

Tor Klingberg

People also ask

Video Answer

2 Answers

`unpack-trees`: exit `check_updates()` early if updates are not wanted

VonC

Peter

Recent Activity

Donate For Us

Why does git stash -p take long to start?

Tags:

git

git-stash

Tor Klingberg

People also ask

Video Answer

2 Answers

unpack-trees: exit check_updates() early if updates are not wanted

VonC

Peter

Related questions

Recent Activity

Donate For Us

`unpack-trees`: exit `check_updates()` early if updates are not wanted