Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to reindex all docs in Solr data

Tags:

I am going to change some field types in the schema, so seems it must re-index all the docs in current Solr index data with this kind of change.

The question is about how to "re-index" all the docs? One solution that I can think of is to "query" all docs through the search interface and dump a large file in XML or JSON, then convert it to the input XML format for Solr, and load it back to Solr again to make the schema change happen.

Is there some better way can do this more efficiently? Thanks for your suggestion.

like image 330
Yinan Avatar asked May 29 '11 16:05

Yinan


People also ask

How do I reindex files in Solr?

There is no process in Solr for programmatically reindexing data. When we say "reindex", we mean, literally, "index it again". However you got the data into the index the first time, you will run that process again.

Can Solr index Word documents?

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

How does Solr indexing work?

Solr works by gathering, storing and indexing documents from different sources and making them searchable in near real-time. It follows a 3-step process that involves indexing, querying, and finally, ranking the results – all in near real-time, even though it can work with huge volumes of data.

How can I make Solr index faster?

After you post all your documents, call commit once manually or from SolrJ - it will take a while to commit, but this will be much faster overall. Also after you are done with your bulk import, reduce maxTime and maxDocs , so that any incremental posts you will do to Solr will get committed much sooner.


4 Answers

First of all, dumping the results of a query may not give you the original data if you have fields that are indexed and not stored. In general, it is best to keep a copy of the input to SOLR in a form that you can easily use to rebuild indexes from scratch if you need to. In that case, just run a delete query by posting <delete><query>*:*</query></delete> then <commit/> and then <optimize/>. After that your index is empty and you can add new documents that use the new schema.

But you may be able to get away with just running <optimize/> after you restart SOLR with the new schema file. It would be good to have a backup where you can test that it works for your configuration.

There is a tool called Luke that can be used to browse and export Lucene indexes. I have never tried it myself, but it might be able to help you export your data so that you can reimport it.

like image 112
Michael Dillon Avatar answered Sep 28 '22 04:09

Michael Dillon


The idea of dumping all the results of a query could give you incomplete or invalid data since you might not surface all of the data within your index.

While the idea of keeping a copy of your index in a form in which you can re-insert it would work well in a situation where the data doesn't change, it becomes more complicated when you've added a new field to the schema. In such a situation, you'll need to collect all the data from the source, format the data to match the new schema and then insert it.

like image 26
Jim Clouse Avatar answered Sep 28 '22 05:09

Jim Clouse


If the number of documents in the Solr is big and you need to keep Solr server available for querying, the indexing job could be started to re-add/re-index documents in the background.

It is helpful to introduce a new field to keep the lastindexed timestamp per each document, so in the case of any indexing/re-indexing issues, it will be possible to identify waiting for reindexing documents.

To improve the latency of querying, it is possible to play with configurations parameters to keep the caches after every commit.

like image 45
Igor Babalich Avatar answered Sep 28 '22 04:09

Igor Babalich


There is a PHP script that does exactly this: fetch and reinsert all your Solr documents, reindexing them.

For optimizing, call from command line:

curl http://<solr_host>:<port>/solr/<core_name>/update -F stream.body=' <optimize />'
like image 36
Daniel Cukier Avatar answered Sep 28 '22 04:09

Daniel Cukier