Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Lucene doesn't support any type of update to an existing document

My use case involves index a Lucene document, then on multiple future occasions add terms that point to this existing doc, that's without deleting and re-adding the entire document for each new term (because of performance, and not keeping the original terms).

I do know that a document can not be truly updated. My question is why?

Or more precisely, why are all forms of updates (terms, stored fields) not supported?
Why it's not possible to add another term to point to an existing document - technically: isn't all that's needed is to have the existing doc Id placed in the posting list of the term. Why is that hard? Is there some immutable statistics that are in the way?

Are there any workarounds for supporting my usecase of adding a term (indexed field) to an existing doc?

like image 849
Gili Nachum Avatar asked Aug 29 '12 19:08

Gili Nachum


People also ask

How do I update my Lucene index?

Step 1 − IndexWriter class acts as a core component which creates/updates indexes during the indexing process. Step 2 − Create object of IndexWriter. Step 3 − Create a Lucene directory which should point to location where indexes are to be stored.

Why Lucene is so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

How does Lucene store data?

Lucene's “doc values” is basically a hack that takes advantage of Cassandra-style “columnar” data storage. We store all the document values in a simple format on-disk. Basically, in flat files.

How does Lucene Query work?

Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book. This allows for faster search responses, as it searches through an index, instead of searching through text directly.


1 Answers

I do know that a document can not be truly updated. My question is why?

Gili, editing a document will cause changes in the related terms postings and this is problematic due to to the terms posting-list structure. The posting-list is sorted and stored sequential in memory. Thus to add a document to a term's posting-list you have to give it a higher doc id this is done by deleting and re-index the entire document.

like image 167
dolbi Avatar answered Sep 19 '22 16:09

dolbi