I have over 200mb of source code files that I have to constantly look up (I am part of a very big team). I notice that grep does not create an index so lookup requires going through the entire source code database each time.
Is there a command line utility similar to grep which has indexing ability?
The solutions below are rather simple. There are a lot of corner cases that they do not cover:
The good part about the solutions is that they are very easy to implement.
Solution 1: one big file
Fact: Seeking is dead slow, reading one big file is often faster.
Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . > .index
Use the index:
grep foo .index
Solution 2: one big compressed file
Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.
So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo
Solution 3: use index for finding potential candidates
Generating the index can be time consuming and you might not want to do that for every single change in the dir.
To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.
The sort -u
is needed to avoid grepping the same file multiple times.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
Solution 4: append to the index
Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.
Index a dir:
find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index
Append to the index:
find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index
Use the index:
pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo
It can be even faster if you use pzstd
instead of pbzip2
/pbzcat
.
Solution 5: use git
git grep
can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.
The good part is that the .git index is smaller than the .index.bz2.
Index a dir:
git init
git add .
Append to the index:
git add .
Use the index:
git grep foo
Solution 6: optimize git
Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:
git gc --aggressive
This takes a while, but it packs the index very efficiently in few files.
Now you can do:
find .git -type f | xargs cat >/dev/null
git grep foo
git
will do a lot of seeking into the index, but by running cat
first, you put the whole index into RAM.
Adding to the index is the same as in solution 5, but run git gc
now and then to avoid many small files, and git gc --aggressive
to save more disk space, when the system is idle.
git
will not free disk space if you remove files. So if you remove large amounts of data, remove .git
and do git init; git add .
again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With