I've used Lucene.net to implement search functionality (for both database content and uploaded documents) on several small websites with no problem. Now I've got a site where I'm indexing 5000+ documents (mainly PDFs) and the querying is becoming a bit slow.
I'm assuming the best way to speed it up would be to implement caching of some kind. Can anyone give my any pointers / examples on where to start? If you've got any other suggestions aside from caching (e.g should I be using multiple indexes?) I'd like to hear those too.
Edit:
Dumb user error responsible for the slow querying. I was creating highlights for the entire results set at once, instead of just the 'page' I was displaying. Oops.
I'm going to make a big assumption here and assume you're not hanging onto your index searchers in-between calls to query the index.
If that's true, then you should definitely share index searchers for all queries to your index. As the index becomes larger (and it doesn't really have to get very large for this to become a factor), rebuilding the index searcher will become more and more of an overhead. To make this work correctly, you'll need to synchronise access to the query parser class (it isn't thread safe).
BTW, the Java docs are (I've found) just as applicable to the .net version.
For more info on your problem, see here: http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
Lucene uses its own internal "caching" mechanism to make index retrieval a fast operation. I don't think caching is your issue here, though.
A 5000-index document sounds trivial in size, but this largely depends on how you're constructing your index, what you're indexing/storing, how you're querying (operationally), document size, etc.
Please fill in the blanks with as much information as you can about your index.
First, Lucene itself supports an in-memory version of directories:
Lucene.Net.Store.RAMDirectory
You can use it like:
RAMDirectory idx = new RAMDirectory();
// Make an writer to create the index
IndexWriter writer =
new IndexWriter(idx, new StandardAnalyzer(), true);
If this works for you but it is using too much ram, write a wrapper and expose it as an Interface or webservice. Or, if you simply want to cache what you are querying to control when entities drop out of the cache, you can write a wrapper around Lucene that caches the most common results for you based on the keywords obviously.
I prefer the forementioned. Create a webservice or service project that wraps around the Lucene store, using RAMDirectory. That way you can offload the webservice onto another server with lots of ram if the index is huge - and have near-instant results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With