Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does schema change require reindex of all Solr documents or just documents containing the changed schema fields?

I have millions of documents in my Solr index. Only a thousand of those documents have field A, whose schema I want to change. The schema changes include changing multiValued from true to false, stored from false to true, and type from text to string, things that require re-index. Re-indexing the thousand documents will take me a few minutes, where-as re-indexing everything will take days.

The re-indexing page on Solr wiki (http://wiki.apache.org/solr/HowToReindex) says "you may need to delete all documents before you begin your indexing process", but doesn't say when you don't.

Can I delete just the thousand documents containing field A and re-index those thousand, or do I need to delete the entire index (all documents) before re-indexing them all?

I've tested the "deleting the few" scenario in a small, sample index; and updates and queries work as expected on the changed field. However, I don't know if I just got lucky and some problems are lurking due to not deleting everything.

like image 617
user2704791 Avatar asked Mar 19 '23 21:03

user2704791


1 Answers

  • If you index documents with the same id (unique key defined in your schema.xml), then you don't have to delete them before indexing. Indexing a document with the same Id will overwrite existing documents.

Just keep in mind that when you index a document with the same Id, the old document is automatically marked as 'deleted' but not physically deleted from the index. And Term Vector Analysis is applied to all documents (including deleted documents)

If you need to physically clean up deleted documents, you need to perform index 'Optimize', you can do this from solr admin interface.

  • if you make a change to the schema, you don't have to index everything. Re-indexing only affected documents is sufficient.

So If I were in your place, I would not even delete anything. I would just re-index only the few thousands affected documents. Then do optimize later to clean up the index.

like image 137
Emad Avatar answered Apr 27 '23 07:04

Emad