I am trying to create a web interface for searching through a large number of huge configuration files (approx 60000 files, each one with a size between 20 KByte to 50 MByte). Those files are also updated frequently (~3 times/day). Requirements: <ul> <li>Concurrency</li> <li>Must identify the line numbers for each matching line</li> <li>Good update performance</li> </ul> What I have looked into: <ul> <li> Lucene: To identify a line number, each line must be stored in a separate Lucene document, each containing two fields (the line number and the line). This makes updates hard/slow.</li> <li> SOLR and Sphinx: Both based on Lucene, they have the same problem and do not allow for identifying the line number.</li> <li> SQL table with a fulltext index: Again, no way to show the line number.</li> <li> SQL table with each line in a separate row: Tested this with SQLite or MySQL, and update performance was the worst of all options. Updating a 50 MB document took more than an hour.</li> <li> eXist-db: We converted each text file to XML like this: <code><xml><line number="1">test</line>...</xml></code>. Updates take ~5 minutes, which somewhat works but we are still not happy with it.</li> <li> Whoosh for Python: Pretty much like Lucene. I have implemented a prototype that sort-of works by dropping/re-importing all lines of a given file. Updating a 50MB document takes about 2-3 minutes using this method.</li> <li> GNU id utils: Suggested by sarnold, this is blazingly fast (50MB document is updated in less then 10 seconds on my test machine) and would be perfect if it had pagination and an API.</li> </ul> How would you implement an alternative?

You might wish to investigate the GNU idutils toolkit. On a local copy of the Linux kernel sources, it can give output like this: <pre class="prettyprint"><code>$ gid ugly include/linux/hil_mlc.h:66: * a positive return value causes the "ugly" branch to be taken. include/linux/hil_mlc.h:101: int ugly; /* Node to jump to on timeout */ </code></pre> Rebuilding the index from a cold cache is reasonably quick: <pre class="prettyprint"><code>$ time mkid real 1m33.022s user 0m17.360s sys 0m2.730s </code></pre> Rebuilding the index from a warm cache is much faster: <pre class="prettyprint"><code>$ time mkid real 0m15.692s user 0m15.070s sys 0m0.520s </code></pre> The index only takes 46 megabytes for my 2.1 gigs of data -- which is tiny in comparison to yours, but the ratio feels good. Finding 399 occurrences of <code>foo</code> took only <code>0.039</code> seconds: <pre class="prettyprint"><code>$ time gid foo > /dev/null real 0m0.038s user 0m0.030s sys 0m0.000s </code></pre> Update Larsmans was curious about the performance of <code>git grep</code> on the kernel sources -- which is an excellent way to show how much performance gain <code>gid(1)</code> provides. On a cold cache, <code>git grep foo</code> (which returned 1656 entries, far more than idutils): <pre class="prettyprint"><code>$ time git grep foo > /dev/null real 0m19.231s user 0m1.480s sys 0m0.680s </code></pre> Once the cache was warm, <code>git grep foo</code> runs much faster: <pre class="prettyprint"><code>$ time git grep foo > /dev/null real 0m0.264s user 0m1.320s sys 0m0.330s </code></pre> Because my dataset fits entirely in RAM once the cache is warm, <code>git grep</code> is pretty amazing: it's only seven times slower than the <code>gid(1)</code> utility and certainly it would be more than fast enough for interactive use. If the dataset in question cannot be entirely cached (which is probably where things actually get interesting) then the performance benefit of the index is unmistakable. The two complaints about idutils: <ol> <li>No pagination. This is definitely a downside, though in my experience it runs quickly enough to simply store the results of the search elsewhere. If the search is going to return an appreciable percentage of the original dataset, then storage of partial results is definitely going to be annoying.</li> <li>No API: true enough, there's no API. But the source is available; <code>src/lid.c</code> function <code>report_grep()</code> takes a linked list of files that match the output. A little fiddling with this function should even offer pagination. (It would take some doing.) At the end of the day, you'd have a C API, which might still not be ideal. But customizing it doesn't look awful.</li> </ol> However, the weakness that is probably worst is the lack of an incremental database update. If all files are updated three times per day, this is not a big deal. If some files are updated three times a day, it is doing needless work. If a handful of files are updated three times a day, there must be a better solution.

Fast, line-wise "grep -n" equivalent for Unix directory structure

Tags:

python

indexing

search

lucene

sphinx

I am trying to create a web interface for searching through a large number of huge configuration files (approx 60000 files, each one with a size between 20 KByte to 50 MByte). Those files are also updated frequently (~3 times/day).

Requirements:

Concurrency
Must identify the line numbers for each matching line
Good update performance

What I have looked into:

Lucene: To identify a line number, each line must be stored in a separate Lucene document, each containing two fields (the line number and the line). This makes updates hard/slow.
SOLR and Sphinx: Both based on Lucene, they have the same problem and do not allow for identifying the line number.
SQL table with a fulltext index: Again, no way to show the line number.
SQL table with each line in a separate row: Tested this with SQLite or MySQL, and update performance was the worst of all options. Updating a 50 MB document took more than an hour.
eXist-db: We converted each text file to XML like this: <xml><line number="1">test</line>...</xml>. Updates take ~5 minutes, which somewhat works but we are still not happy with it.
Whoosh for Python: Pretty much like Lucene. I have implemented a prototype that sort-of works by dropping/re-importing all lines of a given file. Updating a 50MB document takes about 2-3 minutes using this method.
GNU id utils: Suggested by sarnold, this is blazingly fast (50MB document is updated in less then 10 seconds on my test machine) and would be perfect if it had pagination and an API.

How would you implement an alternative?

955

asked Nov 14 '11 11:11

knipknap

2 Answers

You might wish to investigate the GNU idutils toolkit. On a local copy of the Linux kernel sources, it can give output like this:

$ gid ugly
include/linux/hil_mlc.h:66:  * a positive return value causes the "ugly" branch to be taken.
include/linux/hil_mlc.h:101:    int         ugly;   /* Node to jump to on timeout       */

Rebuilding the index from a cold cache is reasonably quick:

$ time mkid

real    1m33.022s
user    0m17.360s
sys     0m2.730s

Rebuilding the index from a warm cache is much faster:

$ time mkid

real    0m15.692s
user    0m15.070s
sys     0m0.520s

The index only takes 46 megabytes for my 2.1 gigs of data -- which is tiny in comparison to yours, but the ratio feels good.

Finding 399 occurrences of foo took only 0.039 seconds:

$ time gid foo > /dev/null

real    0m0.038s
user    0m0.030s
sys     0m0.000s

Update

Larsmans was curious about the performance of git grep on the kernel sources -- which is an excellent way to show how much performance gain gid(1) provides.

On a cold cache, git grep foo (which returned 1656 entries, far more than idutils):

$ time git grep foo > /dev/null

real    0m19.231s
user    0m1.480s
sys     0m0.680s

Once the cache was warm, git grep foo runs much faster:

$ time git grep foo > /dev/null

real    0m0.264s
user    0m1.320s
sys     0m0.330s

Because my dataset fits entirely in RAM once the cache is warm, git grep is pretty amazing: it's only seven times slower than the gid(1) utility and certainly it would be more than fast enough for interactive use. If the dataset in question cannot be entirely cached (which is probably where things actually get interesting) then the performance benefit of the index is unmistakable.

The two complaints about idutils:

No pagination. This is definitely a downside, though in my experience it runs quickly enough to simply store the results of the search elsewhere. If the search is going to return an appreciable percentage of the original dataset, then storage of partial results is definitely going to be annoying.
No API: true enough, there's no API. But the source is available; src/lid.c function report_grep() takes a linked list of files that match the output. A little fiddling with this function should even offer pagination. (It would take some doing.) At the end of the day, you'd have a C API, which might still not be ideal. But customizing it doesn't look awful.

However, the weakness that is probably worst is the lack of an incremental database update. If all files are updated three times per day, this is not a big deal. If some files are updated three times a day, it is doing needless work. If a handful of files are updated three times a day, there must be a better solution.

178

answered Sep 22 '22 17:09

sarnold

In case anyone needs it, I created Whooshstore, which is essentially a Whoosh-based, pure Python clone of GNU id utils that provides incremental updates, pagination and a Python API.

The command line client works like this:

ws-update -b --index my.idx datadir  # build the index
ws-update -b --append --index my.idx datadir  # incremental update
ws --index my.idx hello world     # query the index

(-b is for batch updating, which is faster but requires more memory. For the full CLI syntax use --help.)

It does not come close to the speed of GNU id utils, but by updating the index using several incremental batch (in-memory) updates it's fast enough for us.

answered Sep 21 '22 17:09

knipknap

Related questions
                            
                                Python MatPlot bar function arguments
                            
                                PyInstaller but keeping .py files upgradeable
                            
                                sys.excepthook doesn't work in imported modules
                            
                                Better algorithm to riffle shuffle (or interleave) multiple lists of varying lengths
                            
                                install plpython in postgresql 8 or 9
                            
                                Why are replies to my SMTP-sent gmail messages not threaded?
                            
                                Verifying ctypes type precision in Python
                            
                                Marshaling a Python PIL Image using SWIG
                            
                                Draw dendrogram in python manually
                            
                                efficient way to change the header of a file in Python
                            
                                Python threads in C
                            
                                cannot set initial value for forms.ChoiceField in django
                            
                                Can I write a function that carries out symbolic calculations in Python 2.7?
                            
                                Among Lucene/Solr, Whoosh, Sphinx, Xapian which integrates best with python? [closed]
                            
                                Finding an Insertion in a String
                            
                                How to get a reliable unicode character count in Python?
                            
                                AssertRaises fails even though exception is raised
                            
                                Creating a Draft message in Gmail using the imaplib in Python
                            
                                Creating a remote project with PyDev
                            
                                Using Plink (PuTTy) to SSH through Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fast, line-wise "grep -n" equivalent for Unix directory structure

Tags:

python

indexing

search

lucene

sphinx

knipknap

People also ask

2 Answers

sarnold

knipknap

Recent Activity

Donate For Us