Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SOLR exact match boost over text containing the exact match

I could not find a better title, I hope to change it later if possible upon your eventual sugestions.

My problem:

I got a database with music artists. These look like this: "dr. dre feat. akon", "eminem & dr. dre", "dr. dre feat. ll cool j", "dr. dre", "dr. dre feat. eminem & skylar grey". We only have two fields: id and name.

On a default schema solr core I run this query: "q=dr. dre" and the results are ok but not perfect, looking like this:

  • dr. dre feat. akon
  • eminem & dr. dre
  • dr. dre feat. ll cool j
  • dr. dre
  • ...

Note that they got the exact same score.

What I want is to have "dr. dre" as a first result, and then all the others, like this:

  • dr. dre <<-- dr. dre is first
  • eminem & dr. dre
  • dr. dre feat. ll cool j
  • dr. dre feat. akon
  • ...

How do I achieve this? (filters, tokenizers, copy fields, etc. ist does not matter. I cannot change code inside solr as I've seen on some other forum suggested)

Thanks.

like image 751
BogdanM Avatar asked Mar 17 '15 15:03

BogdanM


1 Answers

There are a couple of different ways to get the "dr. dre" result to come up first. I apologize for the lengthy answer, but as often occurs in Solr, the answer depends on your priorities and needs.

This is probably redundant, but I'd like to start by making sure that you are seeing the scores for each result. Your question didn't make this entirely clear. When you make your query, you need to explicitly tell Solr to sort the results in descending order by their scores, though this can be set up in the solrconfig.xml. I imagine that you are already doing this, but just to make sure, you can try a query like this: q="dr. dre"&fl=*,score&sort=score desc. That will show you the calculated score for each result, and sort the results with the highest scores first.

Norms

Norms are a flexible option that work with Solr fairly naturally. Your name field should probably have a type value that maps to a fieldType entry. The fieldType should probably have class="solr.TextField", and it should not have omitNorms="true". Unless you explicitly omit norms on your name field, Solr will consider how much of the name matches your search terms and how many times your search terms match in the name when calculating the score for a document. "dr. dre" would have the highest score because 100% of the words in the name match your search.

You can read about norms and see a good general text fieldType configuration on the Solr documentation wiki, or in your downloaded Solr documentation for your particular Solr version. The advantage of relying on norms is that in addition to being fairly easy to implement, they are progressive. So while "dr. dre" would be the most relevant record with 100% of its name matching your search, "eminem & dr. dre" would also be more relevant than "a whole list of guys & also dr. dre" because your search term is a larger proportion of the name.

Exact Match

Exact match is a complicated issue in Solr, largely because there are varying degrees of "exactitude", and a truly exact match is rarely desirable in real life. For example, if your record has the name "dr. dre", is "dr dre" (without the period) close enough to be exact? Is "Dr. Dre"? Is " dr. dre"?

If you decide to implement an exact match search, then you will probably want to set up a copyfield in your schema.xml:

<copyField source="name" dest="exactName"/>

Then, you will want to search both fields together. How you do this depends on which query parser you're using. If you are using the standard/lucene query parser, then you will need to set up your queries with OR searching, (e.g. q=name:"dr. dre" OR exactName:"dr. dre"^4). A "^4" after a search term makes that match 4 times as important/relevant as a match elsewhere in the query. If you are using the Dismax or Extended Dismax query parser, you have access to the newer qf field, which allows you to provide a list of fields to use for your search, and to set some up as more important than others. For example qf=exactName^4 name&q="dr. dre" tells Solr to check for "dr. dre" in both fields, but consider the match in the exactName field to be 4 times as relevant as one in the name field. (If this works for you, the default qf can be set in solrconfig.xml so it doesn't need to be restated with every query.)

This leaves the fieldType of the exactName field undecided. If you feel that only a completely precise match will work and variations in capitalization or punctuation make a match non-exact, then you could set up the exactName field as a string:

<field name="exactName" type="string" indexed="true" stored="false" multiValued="false"/>

But more likely, you will want to allow some variation in what counts as "exact", in which case you will need to make a new fieldType, probably using the Keyword Tokenizer, which will not break the exact name into multiple indexed tokens, but keep it as a single token. For example:

<fieldType name="exactish" class="solr.TextField">
  <analyzer>
   <tokenizer class="solr.KeywordTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer> 
</fieldType>

<field name="exactName" type="exactish" indexed="true" stored="false" multiValued="false"/>

This very basic example only includes the Keyword Tokenizer to keep the whole name as a single token, and the Lower Case Filter to make sure that the difference between upper and lower case is not relevant. If you want your exact match to be forgiving of any other conditions, you would need to modify the analysis for the fieldType.

Important: when searching against a string field, or a text field that has the Keyword Tokenizer, it's a good idea to make sure that the searches you send to Solr always have quotes around them (i.e. phrase search). Otherwise, your search will be broken up into individual terms before ever being compared to the field, and no one of your terms is likely to match the entire indexed field. This can lead to never finding any matches in the field at all except when the values don't contain spaces anyway. This is not an issue if you just use the Norms to control relevance in a textField with more standard tokenization.

like image 168
frances Avatar answered Nov 18 '22 09:11

frances