I'm using Solr for a realtime search index. My dataset is about 60M large documents. Instead of sorting by relevance, I need to sort by time. Currently I'm using the sort flag in the query to sort by time. This works fine for specific searches, but when searches return large numbers of results, Solr has to take all of the resulting documents and sort them by time before returning. This is slow, and there has to be a better way.
What is the better way?
I found the answer.
If you want to sort by time, and not relevance, use fq= instead of q= for all of your filters. This way, Solr doesn't waste time figuring out the weighted value of the documents matching q=. It turns out that Solr was spending too much time weighting, not sorting.
Additionally, you can speed sorting up by pre-warming your sort fields in the newSearcher and firstSearcher event listeners in solrconfig.xml. This will ensure that sorts are done via cache.
Obvious first question: what's type of your time field? If it's string, then sorting is obviously very slow. tdate
is even faster than date
.
Another point: do you have enough memory for Solr? If it starts swapping, then performance is immediately awful.
And third one: if you have older Lucene, then date
is just string, which is very slow.
Warning: Wild suggestion, not based on prior experience or known facts. :)
fq=date:[NOW()-xDAY TO *]
where x
is the estimated time period in days during which we will find the required number of matching documents.For starters, you can use the following to estimate x
:
If you are uniformly adding n
documents a day to the index of size N
documents and a specific query matched d
documents in Step #1, then to get the top r
results you can use x = (N*r*1.2)/(d*n)
. If you have to relax your filter too often in Step #3, then slowly increase the value 1.2 in the formula as required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With