I am trying to create a web interface for searching through a large number of huge configuration files (approx 60000 files, each one with a size between 20 KByte to 50 MByte). Those files are also updated frequently (~3 times/day).
Requirements:
What I have looked into:
<xml><line number="1">test</line>...</xml>
. Updates take ~5 minutes, which somewhat works but we are still not happy with it.How would you implement an alternative?
To search all files in the current directory, use an asterisk instead of a filename at the end of a grep command. The output shows the name of the file with nix and returns the entire line.
Matching a list of expressions. If you have a separate file of text patterns, the -f option lets you specify that file. The grep will consider each line in that file as a pattern to match against the target file.
Use the Unix find command to search for files.
You might wish to investigate the GNU idutils toolkit. On a local copy of the Linux kernel sources, it can give output like this:
$ gid ugly
include/linux/hil_mlc.h:66: * a positive return value causes the "ugly" branch to be taken.
include/linux/hil_mlc.h:101: int ugly; /* Node to jump to on timeout */
Rebuilding the index from a cold cache is reasonably quick:
$ time mkid
real 1m33.022s
user 0m17.360s
sys 0m2.730s
Rebuilding the index from a warm cache is much faster:
$ time mkid
real 0m15.692s
user 0m15.070s
sys 0m0.520s
The index only takes 46 megabytes for my 2.1 gigs of data -- which is tiny in comparison to yours, but the ratio feels good.
Finding 399 occurrences of foo
took only 0.039
seconds:
$ time gid foo > /dev/null
real 0m0.038s
user 0m0.030s
sys 0m0.000s
Update
Larsmans was curious about the performance of git grep
on the kernel sources -- which is an excellent way to show how much performance gain gid(1)
provides.
On a cold cache, git grep foo
(which returned 1656 entries, far more than idutils):
$ time git grep foo > /dev/null
real 0m19.231s
user 0m1.480s
sys 0m0.680s
Once the cache was warm, git grep foo
runs much faster:
$ time git grep foo > /dev/null
real 0m0.264s
user 0m1.320s
sys 0m0.330s
Because my dataset fits entirely in RAM once the cache is warm, git grep
is pretty amazing: it's only seven times slower than the gid(1)
utility and certainly it would be more than fast enough for interactive use. If the dataset in question cannot be entirely cached (which is probably where things actually get interesting) then the performance benefit of the index is unmistakable.
The two complaints about idutils:
No pagination. This is definitely a downside, though in my experience it runs quickly enough to simply store the results of the search elsewhere. If the search is going to return an appreciable percentage of the original dataset, then storage of partial results is definitely going to be annoying.
No API: true enough, there's no API. But the source is available; src/lid.c
function report_grep()
takes a linked list of files that match the output. A little fiddling with this function should even offer pagination. (It would take some doing.) At the end of the day, you'd have a C API, which might still not be ideal. But customizing it doesn't look awful.
However, the weakness that is probably worst is the lack of an incremental database update. If all files are updated three times per day, this is not a big deal. If some files are updated three times a day, it is doing needless work. If a handful of files are updated three times a day, there must be a better solution.
In case anyone needs it, I created Whooshstore, which is essentially a Whoosh-based, pure Python clone of GNU id utils that provides incremental updates, pagination and a Python API.
The command line client works like this:
ws-update -b --index my.idx datadir # build the index
ws-update -b --append --index my.idx datadir # incremental update
ws --index my.idx hello world # query the index
(-b
is for batch updating, which is faster but requires more memory. For the full CLI syntax use --help
.)
It does not come close to the speed of GNU id utils, but by updating the index using several incremental batch (in-memory) updates it's fast enough for us.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With