Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene and SQL Server - best practice

I am pretty new to Lucene, so would like to get some help from you guys :)

BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.

Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?

Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?

Thanks,

like image 654
soleiljy Avatar asked Feb 27 '13 19:02

soleiljy


People also ask

Does Lucene use a database?

Lucene is not a database — as I mentioned earlier, it's just a Java library.

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

Is Elasticsearch faster than SQL Server?

In Elasticsearch partitioning is done by sharding. In MS SQL partitioning is done by horizontal partitions . 9. It is considered less than MS SQL in terms of ranking.


2 Answers

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.

As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.

Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:

document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));

and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:

some stuff +tags:forkids
like image 126
femtoRgon Avatar answered Oct 20 '22 01:10

femtoRgon


Documents can also be stored in Lucene, you can retrieve and reference them using the document ID.

I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default.

http://wiki.apache.org/solr/SchemaXml

like image 39
Elmer Avatar answered Oct 20 '22 00:10

Elmer