Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene - is it the right answer for huge index?

Tags:

lucene

Is Lucene capable of indexing 500M text documents of 50K each?

What performance can be expected such index, for single term search and for 10 terms search?

Should I be worried and directly move to distributed index environment?

Saar

like image 940
Saar Avatar asked Aug 03 '11 07:08

Saar


1 Answers

Yes, Lucene should be able to handle this, according to the following article: http://www.lucidimagination.com/content/scaling-lucene-and-solr

Here's a quote:

Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.

The article goes into great depth about scaling to multiple servers. So you can start small and scale if needed.

A great resource about Lucene's performance is the blog of Mike McCandless, who is actively involved in the development of Lucene: http://blog.mikemccandless.com/ He often uses Wikipedia's content (25 GB) as test input for Lucene.

Also, it might be interesting that Twitter's real-time search is now implemented with Lucene (see http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html).

However, I am wondering if the numbers you provided are correct: 500 million documents x 50 KB = ~23 TB -- Do you really have that much data?

like image 108
Stefan Mücke Avatar answered Oct 20 '22 14:10

Stefan Mücke