Is Lucene capable of indexing 500M text documents of 50K each?
What performance can be expected such index, for single term search and for 10 terms search?
Should I be worried and directly move to distributed index environment?
Saar
Yes, Lucene should be able to handle this, according to the following article: http://www.lucidimagination.com/content/scaling-lucene-and-solr
Here's a quote:
Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.
The article goes into great depth about scaling to multiple servers. So you can start small and scale if needed.
A great resource about Lucene's performance is the blog of Mike McCandless, who is actively involved in the development of Lucene: http://blog.mikemccandless.com/ He often uses Wikipedia's content (25 GB) as test input for Lucene.
Also, it might be interesting that Twitter's real-time search is now implemented with Lucene (see http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html).
However, I am wondering if the numbers you provided are correct: 500 million documents x 50 KB = ~23 TB -- Do you really have that much data?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With