Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr: Scoring exact matches higher than partial matches

In a very simple case, I have three documents with filenames "Lark", "Larker", and "Larking" (no file extension). In solr, I index these three documents mapping the filename to a "title" field. When I do a search for "Lark" all three documents are returned (which is what I want) but they are all given the same score. I would prefer that "Lark" be scored the highest, as it is an exact match to my query, with the others coming behind.

<field name="title" type="text_general" indexed="true" stored="true" multiValued="false"/>

 

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I believe the reason they are getting the same score is because of the EdgeNGramFilterFactory employed at index time. Each document gets indexed as "La", "Lar", "Lark" with two of the documents ("Larker" and "Larking") being indexed with some additional variations. So in effect each document is an exact match for the query "Lark." I would like some way of executing a query where the term "Lark" would return all three documents but with the document titled "Lark" being returned higher than the others.

Results of query debug:

<lst name="debug">
  <str name="rawquerystring">Lark</str>
  <str name="querystring">Lark</str>
  <str name="parsedquery">text:lark</str>
  <str name="parsedquery_toString">text:lark</str>
  <lst name="explain">
    <str name="543d6ee4cbb33c26bbcf288b/xxnullxx/543d6ef9cbb33c26bbcf2892">
2.7104912 = (MATCH) weight(text:lark in 0) [DefaultSimilarity], result of:
  2.7104912 = fieldWeight in 0, product of:
    1.4142135 = tf(freq=2.0), with freq of:
      2.0 = termFreq=2.0
    3.8332133 = idf(docFreq=3, maxDocs=68)
    0.5 = fieldNorm(doc=0)
</str>
    <str name="543d6ee4cbb33c26bbcf288b/xxnullxx/543d6ef9cbb33c26bbcf2893">
2.7104912 = (MATCH) weight(text:lark in 1) [DefaultSimilarity], result of:
  2.7104912 = fieldWeight in 1, product of:
    1.4142135 = tf(freq=2.0), with freq of:
      2.0 = termFreq=2.0
    3.8332133 = idf(docFreq=3, maxDocs=68)
    0.5 = fieldNorm(doc=1)
</str>
    <str name="543d6ee4cbb33c26bbcf288b/xxnullxx/543d6ef9cbb33c26bbcf2894">
2.7104912 = (MATCH) weight(text:lark in 2) [DefaultSimilarity], result of:
  2.7104912 = fieldWeight in 2, product of:
    1.4142135 = tf(freq=2.0), with freq of:
      2.0 = termFreq=2.0
    3.8332133 = idf(docFreq=3, maxDocs=68)
    0.5 = fieldNorm(doc=2)
</str>
like image 818
Mike Nitchie Avatar asked Oct 14 '14 15:10

Mike Nitchie


2 Answers

To boost the exact matches, you could create a new field, called "exact_title", with a new type "text_exact" that doesn't have the EdgeNGramFilterFactory.

In your schema you can use the line:

<copyField source="title" dest="exact_title"/> 

to copy title to exact_title.

Then run your query against both fields, title and exact_title. If the query matches an exact title, the document with that exact title will receive a higher score than other documents, and will rise to the top.

like image 160
Yann Avatar answered Oct 25 '22 05:10

Yann


Maybe late, but you can also use KeywordRepeatFilterFactory without create a new field. This is how Solr documentation describes that:

A repeated question is "how can I have the original term contribute more to the score than the stemmed version"? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. This filter emits two tokens for each input token, one of them is marked with the Keyword attribute. Stemmers that respect keyword attributes will pass through the token so marked without change. So the effect of this filter would be to index both the original word and the stemmed version.

like image 37
alexf Avatar answered Oct 25 '22 05:10

alexf