Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing Solr for Sorting

Tags:

solr

lucene

I'm using Solr for a realtime search index. My dataset is about 60M large documents. Instead of sorting by relevance, I need to sort by time. Currently I'm using the sort flag in the query to sort by time. This works fine for specific searches, but when searches return large numbers of results, Solr has to take all of the resulting documents and sort them by time before returning. This is slow, and there has to be a better way.

What is the better way?

like image 808
devinfoley Avatar asked Feb 22 '11 07:02

devinfoley


3 Answers

I found the answer.

If you want to sort by time, and not relevance, use fq= instead of q= for all of your filters. This way, Solr doesn't waste time figuring out the weighted value of the documents matching q=. It turns out that Solr was spending too much time weighting, not sorting.

Additionally, you can speed sorting up by pre-warming your sort fields in the newSearcher and firstSearcher event listeners in solrconfig.xml. This will ensure that sorts are done via cache.

like image 171
devinfoley Avatar answered Nov 01 '22 04:11

devinfoley


Obvious first question: what's type of your time field? If it's string, then sorting is obviously very slow. tdate is even faster than date.

Another point: do you have enough memory for Solr? If it starts swapping, then performance is immediately awful.

And third one: if you have older Lucene, then date is just string, which is very slow.

like image 27
Olli Avatar answered Nov 01 '22 02:11

Olli


Warning: Wild suggestion, not based on prior experience or known facts. :)

  1. Perform a query without sorting and rows=0 to get the number of matches. Disable faceting etc. to improve performance - we only need the total number of matches.
  2. Based on the number of matches from Step #1, the distribution of your data and the count/offset of the results that you need, fire another query which sorts by date and also adds a filter on the date, like fq=date:[NOW()-xDAY TO *] where x is the estimated time period in days during which we will find the required number of matching documents.
  3. If the number of results from Step #2 is less than what you need, then relax the filter a bit and fire another query.

For starters, you can use the following to estimate x:

If you are uniformly adding n documents a day to the index of size N documents and a specific query matched d documents in Step #1, then to get the top r results you can use x = (N*r*1.2)/(d*n). If you have to relax your filter too often in Step #3, then slowly increase the value 1.2 in the formula as required.

like image 33
nikhil500 Avatar answered Nov 01 '22 04:11

nikhil500