Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequent Updates to Solr Documents - Efficiency/Scalability concerns

I have a Solr index with document fields something like:

id, body_text, date, num_upvotes, num_downvotes

In my application, a document is created with some integer id and some body_text (500 chars max). The date is set to the time of input, and num_upvotes and num_downvotes begin at 0.

My application gives users the ability to upvote and downvote the content mentioned above, and the reason I want to keep track of this in Solr instead of just the DB is that I want to be able to consider the number of upvotes and downvotes into my search.

This is a problem because you can't simply update a solr document (i.e. increment number of up_votes) and you must replace the entire document, which is probably fairly inefficient considering it would require hitting my DB to grab all the relevant data again.

I realize the solution may require a different layout of data, or possibly multiple indexes (although I don't know if you can query/score across solr cores).

Is anyone able to offer any recommendations on how to tackle this?

like image 958
DJSunny Avatar asked Nov 16 '11 15:11

DJSunny


People also ask

What are the correct ways of updating Solr index?

Index Update Overview The default strategy updates Solr index immediately, but you can alter this behavior with the appropriate spring configuration. You can also use the update-backofficeIndex-CronJob that serves to track data changes made from outside of Backoffice application to the indexed types.

Is Solr scalable?

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.

How many documents can Solr handle?

Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.

How much data can Solr handle?

For really large data sets, you'll need to watch out for the hard limit of two billion documents per Solr core.


2 Answers

A solution that I use in a similar problem is to update that information in database and do SOLR Updates/Inserts every ten minutes using the documents that were modified since the last update.

Also every night, when I don't have much traffic I do index optimize. After each import I set up some warm-up queries in SOLR config.

In my SOLR index I have around 1.5 milion documents,each document has 24 fields, and around 2000 characters in the entire document. I update the index every 10 minutes around 500 documents ( without optimizing the index ), and I do around 50 warmup queries comprised of most common facets, most used filter queries and free text search.

I don't get negative impact on performance. ( at least it is not visible ) - my queries run average in 0.1 seconds. ( before doing update at every 10 minutes average queries were 0.09 seconds)

LATER EDIT:

I didn't encounter any problems during this updates. I allways take the documents from database and insert them with a Unique key to SOLR. If the document exist in SOLR it is replaced ( this is what I mean by update).

It never takes more than 3 minutes to update SOLR. Actually I am doing 10 minutes break after each update. So I start the update of the index, I wait for it to finish, and then I wait another 10 minutes to start again.

I did not look on the performance over the night, but for me it is not relevant, as I want to have fresh information of data during the users visits peaks.

like image 148
Dorin Avatar answered Sep 23 '22 20:09

Dorin


The Join feature would help you here. Then you could store the up/down votes in a separate document.

The bad news is that you need to wait until Solr 4 unless you're comfortable running with a trunk build.

like image 33
brian519 Avatar answered Sep 23 '22 20:09

brian519