Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does 'git status' run filters?

Tags:

git

I need to clone a git repo into an existing directory ($HOME, for managing dotfiles). I'm doing a bare clone and reconfiguring it because I need to clone into an existing unclean working directory. It works, however I find that git status tries to run filters on first use. Why does it do that and how can I prevent it?

Try this:

# create a test repo
mkdir test && cd test
git init
echo hello > hello.txt
git add .
git commit -m 1
echo 'hello.txt filter=foo diff=bar' > .gitattributes
git add .
git commit -m 2

# clone it bare and configure it
mkdir ../test2 && cd ../test2
git clone --bare ../test .git
git config core.bare false
git config core.logallrefupdates true
git reset
git checkout .
git config filter.foo.clean foo
git config filter.foo.smudge foo
git config diff.bar.textconv bar

This borks

$ git status
error: cannot run foo: No such file or directory
error: cannot fork to run external filter 'foo'
error: external filter 'foo' failed
On branch master
nothing to commit, working tree clean

This doesn't

$ git status
On branch master
nothing to commit, working tree clean

Also, initially doing git status multiple times in quick succession (i.e. git status; git status; git status) can yield multiple failures. Sometimes.

As far as I can confirm by much reading, the filters should only run when checking files in and out.

So why does git status run them?

like image 723
starfry Avatar asked Jan 30 '17 11:01

starfry


People also ask

What does filtering content mean Git?

In git you can define "filters" that affect the process of moving files from the index to the work tree ("smudge" filters) and from the work tree to the index ("clean" filters). Typically you'll find a . gitattribute file that associates the filters with files at specific paths.

What is a Git filter?

In editing files, git-filter-branch by design checks out each and every commit as it existed in the original repo. If your repo has 10^5 files and 10^5 commits, but each commit only modifies five files, then git-filter-branch will make you do 10^10 modifications, despite only having (at most) 5*10^5 unique blobs.

What is Git smudge filter?

The Git smudge filter is what converts the LFS pointer stored in Git with the actual large file from the LFS server. If your local repository does not have the LFS object, the smudge filter will have to download it. This means that network issues could affect the smudge filter.


1 Answers

The idea that filters only run during checkin/checkout is something of a white lie. It's meant to make filters more explicable.

In fact, though, filters run when moving files between the index and work-tree (and also, in sufficiently modern versions of Git, when requested with --path= options in git show and git cat-file and git hash-object as well: some of these are transitions directly from repository to stdout, or stdin to repository). This is mostly equivalent to checkin/checkout time. But git status has special dispensation as well, due to the cache aspect of the index.

For performance reasons, Git wants to know whether any file in the work-tree might be "dirty" with respect to the version in the index. Git assumes that the stat value st_mtime, which typically has one-second resolution,1 can be used for this purpose: if the st_mtime time of the file—the work-tree entry—is older than a saved st_mtime in an index entry, then the index entry is up to date and is "clean": what's in the index matches what's in the work-tree, after applying clean filters etc.

If the time stamp of the work-tree entry is newer than the saved index entry, then the file has definitely been modified: the index entry may be out of date. It's not guaranteed out of date, as the work-tree file may have been modified in a way that ultimately made no change. But it's clearly necessary to run the clean filter (and any CR/LF line ending hackery).

If the two time stamps are the same, the work-tree entry is indeterminate. (Git calls this "racily clean" although "racily dirty" would be equally accurate.)

In all these cases, git status will run the clean filter (and any input-to-Git direction CR/LF modifications) over the work-tree file to compute a new hash. If the new hash matches the index hash, Git can and will update the index entry to mark the file as "actually clean". Now, the next time you do something, Git won't have to run the clean filter.

Unless, that is, you do it all within the resolution of the st_mtime stat field. In that case, the index entry winds up "racily clean" and Git has to try again. This is what you are observing here.

(Note, by the way, that git status runs two diffs: one from HEAD to index, and one from index to work-tree. It's that second diff that benefits hugely from the cache aspect of the index. The index can now also store information about uncached files and directories, too!)


1Some stat calls give sub-second precision, but for various reasons, the index / cache entry only stores the 1-second resolution time stamp anyway, normally.

For (much) more on this, see the racy-git.txt file in the technical documentation.

like image 183
torek Avatar answered Sep 27 '22 18:09

torek