How to reindex all docs in Solr data

Tags:

I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.

The question is about how to "re-index" all the docs? One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.

Is there some better way can do this more efficiently? Thanks for your suggestion.

330

asked May 29 '11 16:05

Yinan

4 Answers

First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.

But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.

There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.

112

answered Sep 28 '22 04:09

Michael Dillon

The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.

While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.

answered Sep 28 '22 05:09

Jim Clouse

If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.

It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.

To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.

answered Sep 28 '22 04:09

Igor Babalich

There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.

For optimizing, call from command line:

curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'

answered Sep 28 '22 04:09

Daniel Cukier

Related questions
                            
                                Interfacing with a third-party API in Rails? ( Opening URLs and Parsing XML/JSON )
                            
                                How can I add a reference in monodevelop?
                            
                                Programmatically determine whether exceptions are enabled
                            
                                What happens if delete[] p fails?
                            
                                Modern perl - ready to run applications - learning by examples - from what? [closed]
                            
                                Python exit codes
                            
                                Why don't margin-top: auto and margin-bottom:auto work the same as their left and right counterparts?
                            
                                Is there a wildcard-sort functionality of making the internals visible to assemblies that matches the common assembly name?
                            
                                How to test if an address is readable in linux userspace app
                            
                                Is using inline JavaScript preferred to an external include if the script is really short?
                            
                                Having trouble putting real-world logic into the DDD domain layer
                            
                                Notify C# Client, when SMTP Server receive a new Email

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to reindex all docs in Solr data

Tags:

Yinan

People also ask

4 Answers

Michael Dillon

Jim Clouse

Igor Babalich

Daniel Cukier

Recent Activity

Donate For Us