Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Git is really slow for 100,000 objects. Any fixes?

I have a "fresh" git-svn repo (11.13 GB) that has over a 100,000 objects in it.

I have preformed

git fsck git gc 

on the repo after the initial checkout.

I then tried to do a

git status 

The time it takes to do a git status is anywhere from 2m25.578s and 2m53.901s

I tested git status by issuing the command

time git status 

5 times and all of the times ran between the two times listed above.

I am doing this on a Mac OS X, locally not through a VM.

There is no way it should be taking this long.

Any ideas? Help?

Thanks.

Edit

I have a co-worker sitting right next to me with a comparable box. Less RAM and running Debian with a jfs filesystem. His git status runs in .3 on the same repo (it is also a git-svn checkout).

Also, I recently changed my file permissions (to 777) on this folder and it brought the time down considerably (why, I have no clue). I can now get it done anywhere between 3 and 6 seconds. This is manageable, but still a pain.

like image 266
manumoomoo Avatar asked Jul 22 '10 22:07

manumoomoo


People also ask

Why git add is taking forever?

Git slowness is generally from large binary files. This isn't because they're binary, just because binary files tend to be large and more complex to compress & diff. Based on your edit indicating the file sizes, I suspect this is your problem.

Why is git status slow?

The first thing to determine is if the poor behavior is due to your machine or to your specific local copy of the repo. The files in your . git folder can affect performance in various ways - settings in . git/config , presence of lfs files, commits that can be garbage collected, etc.

What does git fsck do?

The git fsck command checks the connectivity and validity of objects in the git repository. Using this command, users can confirm the integrity of the files in their repository and identify any corrupted objects.


2 Answers

It came down to a couple of items that I can see right now.

  1. git gc --aggressive
  2. Opening up file permissions to 777

There has to be something else going on, but this was the things that clearly made the biggest impact.

like image 113
manumoomoo Avatar answered Oct 21 '22 22:10

manumoomoo


git status has to look at every file in the repository every time. You can tell it to stop looking at trees that you aren't working on with

git update-index --assume-unchanged <trees to skip> 

source

From the manpage:

When these flags are specified, the object names recorded for the paths are not updated. Instead, these options set and unset the "assume unchanged" bit for the paths. When the "assume unchanged" bit is on, git stops checking the working tree files for possible modifications, so you need to manually unset the bit to tell git when you change the working tree file. This is sometimes helpful when working with a big project on a filesystem that has very slow lstat(2) system call (e.g. cifs).

This option can be also used as a coarse file-level mechanism to ignore uncommitted changes in tracked files (akin to what .gitignore does for untracked files). Git will fail (gracefully) in case it needs to modify this file in the index e.g. when merging in a commit; thus, in case the assumed-untracked file is changed upstream, you will need to handle the situation manually.

Many operations in git depend on your filesystem to have an efficient lstat(2) implementation, so that st_mtime information for working tree files can be cheaply checked to see if the file contents have changed from the version recorded in the index file. Unfortunately, some filesystems have inefficient lstat(2). If your filesystem is one of them, you can set "assume unchanged" bit to paths you have not changed to cause git not to do this check. Note that setting this bit on a path does not mean git will check the contents of the file to see if it has changed — it makes git to omit any checking and assume it has not changed. When you make changes to working tree files, you have to explicitly tell git about it by dropping "assume unchanged" bit, either before or after you modify them.

...

In order to set "assume unchanged" bit, use --assume-unchanged option. To unset, use --no-assume-unchanged.

The command looks at core.ignorestat configuration variable. When this is true, paths updated with git update-index paths… and paths updated with other git commands that update both index and working tree (e.g. git apply --index, git checkout-index -u, and git read-tree -u) are automatically marked as "assume unchanged". Note that "assume unchanged" bit is not set if git update-index --refresh finds the working tree file matches the index (use git update-index --really-refresh if you want to mark them as "assume unchanged").


Now, clearly, this solution is only going to work if there are parts of the repo that you can conveniently ignore. I work on a project of similar size, and there are definitely large trees that I don't need to check on a regular basis. The semantics of git-status make it a generally O(n) problem (n in number of files). You need domain specific optimizations to do better than that.

Note that if you work in a stitching pattern, that is, if you integrate changes from upstream by merge instead of rebase, then this solution becomes less convenient, because a change to an --assume-unchanged object merging in from upstream becomes a merge conflict. You can avoid this problem with a rebasing workflow.

like image 27
masonk Avatar answered Oct 21 '22 21:10

masonk