Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best practices for combining Lucene.NET and a relational database?

Tags:

lucene.net

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.

My plan is to build an index using Lucene for this form of search.

My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.

This could be done in two ways (That I can think of so far):

  • N amount of queries (Horrible)
  • Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.

I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.

Advice?

like image 780
FlySwat Avatar asked Jun 13 '09 14:06

FlySwat


3 Answers

I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.

like image 153
Luca Matteis Avatar answered Nov 15 '22 13:11

Luca Matteis


When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.

like image 33
SztupY Avatar answered Nov 15 '22 11:11

SztupY


I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.

This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.

like image 40
Gusa98 Avatar answered Nov 15 '22 12:11

Gusa98