Lucene and SQL Server - best practice

Tags:

lucene

I am pretty new to Lucene, so would like to get some help from you guys :)

BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.

Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?

Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?

Thanks,

654

asked Feb 27 '13 19:02

soleiljy

2 Answers

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.

As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.

Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:

document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));

and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:

some stuff +tags:forkids

126

answered Oct 20 '22 01:10

femtoRgon

Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.

I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.

http://wiki.apache.org/solr/SchemaXml

answered Oct 20 '22 00:10

Elmer

Related questions
                            
                                Can someone give me a high overview of how lucene.net works?
                            
                                How can I convert Geometry data into a Geography data in MS SQL Server 2008?
                            
                                Escaping Bracket [ in a CONTAINS() clause?
                            
                                SQLServer try catch performance
                            
                                Slow performance of SqlDataReader
                            
                                SQL Server ALTER field NOT NULL takes forever
                            
                                SQL Server 2005: Why Name Transactions?
                            
                                order by a parameter
                            
                                Inverse of COALESCE
                            
                                how to retrieve a non sa password in SQL Server?
                            
                                Linq-to-sql failing on insert and update when this is a trigger attached
                            
                                SQL query on multiple databases
                            
                                How to design Date-of-Birth in DB and ORM for mix of known and unknown date parts
                            
                                how to query SQL Server via REST to get XML
                            
                                Is there a way to turn off implicit type conversion in SQL Server?
                            
                                Updating in a many-to-many relationship
                            
                                SQL Server MERGE + Joining other tables
                            
                                TABLESAMPLE returns wrong number of rows?
                            
                                Convert a .bak file to .sql file
                            
                                Why does SUM(...) on an empty recordset return NULL instead of 0?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With