Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast, line-wise "grep -n" equivalent for Unix directory structure

I am trying to create a web interface for searching through a large number of huge configuration files (approx 60000 files, each one with a size between 20 KByte to 50 MByte). Those files are also updated frequently (~3 times/day).

Requirements:

  • Concurrency
  • Must identify the line numbers for each matching line
  • Good update performance

What I have looked into:

  • Lucene: To identify a line number, each line must be stored in a separate Lucene document, each containing two fields (the line number and the line). This makes updates hard/slow.
  • SOLR and Sphinx: Both based on Lucene, they have the same problem and do not allow for identifying the line number.
  • SQL table with a fulltext index: Again, no way to show the line number.
  • SQL table with each line in a separate row: Tested this with SQLite or MySQL, and update performance was the worst of all options. Updating a 50 MB document took more than an hour.
  • eXist-db: We converted each text file to XML like this: <xml><line number="1">test</line>...</xml>. Updates take ~5 minutes, which somewhat works but we are still not happy with it.
  • Whoosh for Python: Pretty much like Lucene. I have implemented a prototype that sort-of works by dropping/re-importing all lines of a given file. Updating a 50MB document takes about 2-3 minutes using this method.
  • GNU id utils: Suggested by sarnold, this is blazingly fast (50MB document is updated in less then 10 seconds on my test machine) and would be perfect if it had pagination and an API.

How would you implement an alternative?

like image 955
knipknap Avatar asked Nov 14 '11 11:11

knipknap


People also ask

How do I grep all files in a directory?

To search all files in the current directory, use an asterisk instead of a filename at the end of a grep command. The output shows the name of the file with nix and returns the entire line.

What is grep F?

Matching a list of expressions. If you have a separate file of text patterns, the -f option lets you specify that file. The grep will consider each line in that file as a pattern to match against the target file.

Which command is used to search for a file in UNIX filesystem?

Use the Unix find command to search for files.


2 Answers

You might wish to investigate the GNU idutils toolkit. On a local copy of the Linux kernel sources, it can give output like this:

$ gid ugly
include/linux/hil_mlc.h:66:  * a positive return value causes the "ugly" branch to be taken.
include/linux/hil_mlc.h:101:    int         ugly;   /* Node to jump to on timeout       */

Rebuilding the index from a cold cache is reasonably quick:

$ time mkid

real    1m33.022s
user    0m17.360s
sys     0m2.730s

Rebuilding the index from a warm cache is much faster:

$ time mkid

real    0m15.692s
user    0m15.070s
sys     0m0.520s

The index only takes 46 megabytes for my 2.1 gigs of data -- which is tiny in comparison to yours, but the ratio feels good.

Finding 399 occurrences of foo took only 0.039 seconds:

$ time gid foo > /dev/null

real    0m0.038s
user    0m0.030s
sys     0m0.000s

Update

Larsmans was curious about the performance of git grep on the kernel sources -- which is an excellent way to show how much performance gain gid(1) provides.

On a cold cache, git grep foo (which returned 1656 entries, far more than idutils):

$ time git grep foo > /dev/null

real    0m19.231s
user    0m1.480s
sys     0m0.680s

Once the cache was warm, git grep foo runs much faster:

$ time git grep foo > /dev/null

real    0m0.264s
user    0m1.320s
sys     0m0.330s

Because my dataset fits entirely in RAM once the cache is warm, git grep is pretty amazing: it's only seven times slower than the gid(1) utility and certainly it would be more than fast enough for interactive use. If the dataset in question cannot be entirely cached (which is probably where things actually get interesting) then the performance benefit of the index is unmistakable.

The two complaints about idutils:

  1. No pagination. This is definitely a downside, though in my experience it runs quickly enough to simply store the results of the search elsewhere. If the search is going to return an appreciable percentage of the original dataset, then storage of partial results is definitely going to be annoying.

  2. No API: true enough, there's no API. But the source is available; src/lid.c function report_grep() takes a linked list of files that match the output. A little fiddling with this function should even offer pagination. (It would take some doing.) At the end of the day, you'd have a C API, which might still not be ideal. But customizing it doesn't look awful.

However, the weakness that is probably worst is the lack of an incremental database update. If all files are updated three times per day, this is not a big deal. If some files are updated three times a day, it is doing needless work. If a handful of files are updated three times a day, there must be a better solution.

like image 178
sarnold Avatar answered Sep 22 '22 17:09

sarnold


In case anyone needs it, I created Whooshstore, which is essentially a Whoosh-based, pure Python clone of GNU id utils that provides incremental updates, pagination and a Python API.

The command line client works like this:

ws-update -b --index my.idx datadir  # build the index
ws-update -b --append --index my.idx datadir  # incremental update
ws --index my.idx hello world     # query the index

(-b is for batch updating, which is faster but requires more memory. For the full CLI syntax use --help.)

It does not come close to the speed of GNU id utils, but by updating the index using several incremental batch (in-memory) updates it's fast enough for us.

like image 22
knipknap Avatar answered Sep 21 '22 17:09

knipknap