Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

git very slow with many ignored files

I have set up a repository to include a working directory that has many tens of thousands of files, thousands of directories, with many Gb of data. This directory is located on a samba share. I only want to have a few dozen source files within this directory under version control.

I have set up the gitignore file thusly and it works:

# Ignore everything
*

# Except a couple of files in any directory
!*.pin
!*.bsh
!*/

Operations on the repository (such as commit) takes several minute to carry out. This is too long to reasonably get any work done. I suspect that the slowdown is because git is trawling through every directory looking for files that may have been updated.

There are only a few locations in the working directory where I have files that I want to track, so I tried to narrow down the set of files to examine using this query:

*
!/version_2/analysis/abcd.pin
!/version_2/analysis/*.bsh
!*/

This also works, but it is still just as slow as the less qualified gitignore. I'm guessing it is that final line that is the killer, but no matter how I tried to make the unignore patterns be very specific, I always had to include that final wildcard clause in order for the process to find any files to commit.

So my two part question is

1) Is there a better way to set up the gitignore file that will help speed up the commit process by only including the very narrow set of directories and file types that contain relevant results?

2) Is there some other tweaks to git or samba that are required to make this work more efficiently?

Thanks,

Tom

like image 410
opeongo Avatar asked Sep 22 '16 15:09

opeongo


People also ask

Can you have multiple git ignore files?

gitignore file is a plain text file where each line contains a pattern for files/directories to ignore. Generally, this is placed in the root folder of the repository, and that's what I recommend. However, you can put it in any folder in the repository and you can also have multiple . gitignore files.

How many files git ignore?

You can have multiple . gitignore , each one of course in its own directory. To check which gitignore rule is responsible for ignoring a file, use git check-ignore : git check-ignore -v -- afile .

How do I stop git from ignoring files?

If you don't want Git to track certain files in your repository, there is no Git command you can use. (Although you can stop tracking a file with the git rm command, such as git rm --cached .) Instead, you need to use a . gitignore file, a text file that tells Git which files not to track.

Does git checkout overwrite ignored files?

Silently overwrite ignored files when switching branches. This is the default behavior. Use --no-overwrite-ignore to abort the operation when the new branch contains ignored files. Using --recurse-submodules will update the content of all active submodules according to the commit recorded in the superproject.


1 Answers

After fiddling around for a bit, I have found a way to significantly improve performance by just modifying the .gitignore file.

The performance problem was caused by my approach of ignoring all and then specifying what to unignore. This had a nice concise specification (4 lines), but was really slow. It caused git to walk the entire directory tree in order to detect what changed.

My new and improved approved approach is to just use exclude patterns. Using this I can indicate large branches of the tree to prune. I had to add a more lengthy set of documents and file types to exclude and this took a few iterations to get right because there were so many. Due to the nature of the data sets there may be more maintenance of the .gitignore file required in future if new file types show up, but this is a small price to pay.

Here is something like what my final .gitignore file looks like:

# prune large input data and results folders where ever they occur
../data/
../results/

# Exclude document types that don't need versioning,
# leaving only the types of interest
*~
*#
*.csv
*.doc
*.docx
*.gif
*.htm
*.html
*.ini
*.jpg
*.odt
*.pdf
*.png
*.ppt
*.pptx
*.xls
*.xlsx
*.xlsm
*.xml
*.rar
*.zip

Commit times are now down to a few seconds.

Overall this is still pretty simple, although not as clean as my initial 4-liner.

After review, I think my problem was that I became a victim of my own premature optimization.

like image 156
opeongo Avatar answered Oct 07 '22 00:10

opeongo