List last commit dates for a large number of files, quickly

Tags:

git

I would like to list the last commit date for a large number of files in a git repository.

For the sake of concreteness, let us assume that I want to get the last commit dates of all *.txt files inside a particular subdirectory. There are tens of thousands of files in the repository in total, and the number of relevant *.txt files is in the ballpark of several hundreds. There are already thousands of commits in the repository.

I tried three different approaches.

Solution 1. This question gives one answer, based on git log. However, if I try to do something like this, it is very slow:

find . -name '*.txt' |
    xargs -n1 git log --format=format:%ai -n1 --all -- '{}'

In my test case, it took several minutes – far too slow for my purposes.

Solution 2. Something like this would be much faster, less than one second:

git log --format=format:%ai --name-only .

However, then I would have to write a script that post-processes the output. Moreover, the above command prints out lots of information that is never needed: irrelevant files and old commits.

Solution 3. I also tried something like this, in order to get rid of the irrelevant files:

git log --format=format:%ai --name-only `find . -name '*.txt'`

However, this turned out to be slower than solution 2. (There was a factor 3 difference in the running time.) Moreover, it still prints old commits that are no longer needed.

Question. Am I missing something? Is there a fast and convenient approach? Preferably something that works not only right now but also in future, when we have a much larger number of commits?

765

asked Feb 23 '12 17:02

Jukka Suomela

1 Answers

Try this.

In git, each commit references a tree object which has pointers to the state of each file (the files being blob objects).

So, what you want to do is write a program which starts out with a list of all the files in which you're interested, and begins at the HEAD object (SHA1 commit obtained via git rev-parse HEAD). It checks to see if any of the "files of interest" are modified in that tree (tree gotten from "tree" attribute of git cat-file commit [SHA1]) - note, you'll have to descend to the subtrees for each directory. If they are modified (meaning a different SHA1 hash from the one they had in the "previous" revision), it removes each such from the interest set and prints the appropriate information. Then it continues to each parent of the current tree. This continues until the set-of-interest is empty.

If you want the maximal speed, you'll use the git C API. If you don't want that much speed, you can use git cat-file tree [SHA1 hash] (or, easier, git ls-tree [SHA1 hash] [files]), which is going to perform the absolute minimal amount of work to read a particular tree object (it's part of the plumbing layer).

It's questionable how well this will continue to work in the future, but if forward-compat is a bigger issue you can move up a level from git cat-file - but as you already discovered, git log is comparatively slow as it's part of the porcelain, not the plumbing.

See here for a pretty good resource on how git's object model works.

answered Oct 21 '22 11:10

Borealid

Related questions
                            
                                Git - pushing a remote branch for a large project is really slow
                            
                                .gitattributes smudge and clean filters as a part of the repository
                            
                                JGit sets git: URI instead of https: for remote on CircleCI
                            
                                Git rebase a pushed feature branch
                            
                                git: How do I overwrite all local changes on merge?
                            
                                How share a config file in git?
                            
                                Best git mysql versioning system?
                            
                                How can I push to a git bundle
                            
                                Merging after directory got turned into submodule
                            
                                'git apply' failed with code 1: patch does not apply error in SourceTree on Stage Hunk
                            
                                Git Bash on Windows 10 ignores Ctrl + C
                            
                                Why is the merge tool disabled in Eclipse for a EGit-managed project?
                            
                                How do Git SVN ignore-paths work (ignoring daily build tags)?
                            
                                Alternatives to github network graph viewer? [closed]
                            
                                How to get Git on Windows to ignore symbolic links
                            
                                Ignore .git folder in sub folder
                            
                                Get files modified/added/removed from a commit in LibGit2Sharp
                            
                                read-only git mirror of an svn repository
                            
                                PHP deployment using Git. How can I make it more automated?
                            
                                Eclipse shortcut for Compare With Head Revision

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With