Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

grep but indexable?

Tags:

linux

grep

I have over 200mb of source code files that I have to constantly look up (I am part of a very big team). I notice that grep does not create an index so lookup requires going through the entire source code database each time.

Is there a command line utility similar to grep which has indexing ability?

like image 260
disappearedng Avatar asked Oct 12 '11 02:10

disappearedng


1 Answers

The solutions below are rather simple. There are a lot of corner cases that they do not cover:

  • searching for start of line ^
  • filenames containing \n or : will fail
  • filenames containing white space will fail (though that can be fixed by using GNU Parallel instead of xargs)
  • searching for a string that matches the path of another files will be suboptimal

The good part about the solutions is that they are very easy to implement.

Solution 1: one big file

Fact: Seeking is dead slow, reading one big file is often faster.

Given those facts the idea is to simply make an index containing all the files with all their content - each line prepended with the filename and the line number:

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . > .index

Use the index:

grep foo .index

Solution 2: one big compressed file

Fact: Harddrives are slow. Seeking is dead slow. Multi core CPUs are normal.

So it may be faster to read a compressed file and decompress it on the fly than reading the uncompressed file - especially if you have RAM enough to cache the compressed file but not enough for the uncompressed file.

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

Use the index:

pbzcat .index | grep foo

Solution 3: use index for finding potential candidates

Generating the index can be time consuming and you might not want to do that for every single change in the dir.

To speed that up only use the index for identifying filenames that might match and do an actual grep through those (hopefully limited number of) files. This will discover files that no longer match, but it will not discover new files that do match.

The sort -u is needed to avoid grepping the same file multiple times.

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

Use the index:

pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo

Solution 4: append to the index

Re-creating the full index can be very slow. If most of the dir stays the same, you can simply append to the index with newly changed files. The index will again only be used for locating potential candidates, so if a file no longer matches it will be discovered when grepping through the actual file.

Index a dir:

find . -type f -print0 | xargs -0 grep -Han . | pbzip2 > .index

Append to the index:

find . -type f -newer .index -print0 | xargs -0 grep -Han . | pbzip2 >> .index

Use the index:

pbzcat .index | grep foo | sed s/:.*// | sort -u | xargs grep foo

It can be even faster if you use pzstd instead of pbzip2/pbzcat.

Solution 5: use git

git grep can grep through a git repository. But it seems to do a lot of seeks and is 4 times slower on my system than solution 4.

The good part is that the .git index is smaller than the .index.bz2.

Index a dir:

git init
git add .

Append to the index:

git add .

Use the index:

git grep foo

Solution 6: optimize git

Git puts its data into many small files. This results in seeking. But you can ask git to compress the small files into few, bigger files:

git gc --aggressive

This takes a while, but it packs the index very efficiently in few files.

Now you can do:

find .git  -type f | xargs cat >/dev/null
git grep foo

git will do a lot of seeking into the index, but by running cat first, you put the whole index into RAM.

Adding to the index is the same as in solution 5, but run git gc now and then to avoid many small files, and git gc --aggressive to save more disk space, when the system is idle.

git will not free disk space if you remove files. So if you remove large amounts of data, remove .git and do git init; git add . again.

like image 84
Ole Tange Avatar answered Oct 08 '22 13:10

Ole Tange