Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stronger boosting by date in Solr

Tags:

search

solr

solr4

Boosting by date field in solr is defined as:

{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}

I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?

User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.

And the solr debug output is waaay too confusing to me to understand the problem.

Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.

like image 961
Shinhan Avatar asked Feb 25 '14 14:02

Shinhan


2 Answers

recip(x, m, a, b) implements f(x) = a/(xm+b) with :

  • x : the document age in ms, defined as ms(NOW,<datefield>).

  • m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).

  • a and b are constants (defined arbitrarily).

  • xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
    xm ≈ 0 when the document is new, resulting in a value close to a/b.

  • Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.

  • With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.

How to make a date boosting stronger ?

  • Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.

  • Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).

  • Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :

    bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
    

It is important to note a few things :

  • bf is an additive boost and acts as a bonus added to the score of newer documents.

  • {!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.

  • A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.

  • Do not use recip() for dates more than one reference_time in the future or it will yield negative values.

See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.

like image 104
EricLavault Avatar answered Nov 02 '22 18:11

EricLavault


User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.

Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.

With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:

score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)

In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.

The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.

There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.

References for this answer:

  • Advanced Apache Solr boosting: a case study
  • Using Solr’s Dismax Tie Parameter

Shishir

like image 6
Shishir Kumar Avatar answered Nov 02 '22 17:11

Shishir Kumar