Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scoring of solr multivalued field

Tags:

solr

lucene

If I have a document with a multivalued field in Solr are the multiple values scored independently or just concatenated and scored as one big field? I'm hoping they're scored independently. Here's an example of what I mean:

I have a document with a field for a person's name, where there may be multiple names for the same person. The names are all different (very different in some cases) but they all are the same person/document.

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke

Person 2: David Letterman

Person 3: David Hasselhoff, David Michael Hasselhoff

If I were to search for "David" I'd like for all of these to have about the same chance of a match. If each name is scored independently that would seem to be the case. If they are just stored and searched as a single field, David Bowie would be punished for having many more tokens than the others. How does Solr handle this scenario?

like image 502
user605331 Avatar asked Feb 13 '12 13:02

user605331


People also ask

What is multivalued field in SOLR?

A multivalued field is useful when there are more than one value present for the field. An easy example would be tags, there can be multiple tags that need to be indexed. so if we have tags field as multivalued then solr response will return a list instead of a string value.

What is score in SOLR?

Solr in Action In an SQL query to a relational database, a row either matches a query or it doesn't, and results are sorted based on one or more of the columns. A search engine returns documents sorted in descending order by a score that indicates the strength of the match of the document to the query.


1 Answers

You can just run your query q=field_name:David with debugQuery=on and see what happens.

These are the results (included the score through fl=*,score) sorted by score desc:

<doc>
    <float name="score">0.4451987</float>
    <str name="id">2</str>
    <arr name="text_ws">
        <str>David Letterman</str>
    </arr>
</doc>
<doc>
    <float name="score">0.44072422</float>
    <str name="id">3</str>
    <arr name="text_ws">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.314803</float>
    <str name="id">1</str>
    <arr name="text_ws">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>

And this is the explanation:

<lst name="explain">
    <str name="2">
        0.4451987 = (MATCH) fieldWeight(text_ws:David in 1), product of: 1.0 = tf(termFreq(text_ws:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.625 = fieldNorm(field=text_ws, doc=1)
    </str>
    <str name="3">
        0.44072422 = (MATCH) fieldWeight(text_ws:David in 2), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.4375 = fieldNorm(field=text_ws, doc=2)
    </str>
    <str name="1">
        0.314803 = (MATCH) fieldWeight(text_ws:David in 0), product of: 1.4142135 = tf(termFreq(text_ws:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 0.3125 = fieldNorm(field=text_ws, doc=0)
    </str>
</lst>

The scoring factors here are:

  • termFreq: how often a term appears in the document
  • idf: how often the term appears across the index
  • fieldNorm: importance of the term, depending on index-time boosting and field length

In your example the fieldNorm makes the difference. You have one document with lower termFreq (1 instead of 1.4142135) since the term appears just one time, but that match is more important because of the field length.

The fact that your field is multiValued doesn't change the scoring. I guess it would be the same with a single value field with the same content. Solr works in terms of field length and terms, so, yes, David Bowie is punished for having many more tokens than the others. :)

UPDATE
I actually think David Bowie deserves his opportunity. Like explained above, the fieldNorm makes the difference. Add the attribute omitNorms=true to your text_ws field in the schema.xml and reindex. The same query will give you the following result:

<doc>
    <float name="score">1.0073696</float>
    <str name="id">1</str>
    <arr name="text">
        <str>David Bowie</str>
        <str>David Robert Jones</str>
        <str>Ziggy Stardust</str>
        <str>Thin White Duke</str>
    </arr>
</doc>
<doc>
    <float name="score">1.0073696</float>
    <str name="id">3</str>
    <arr name="text">
        <str>David Hasselhoff</str>
        <str>David Michael Hasselhoff</str>
    </arr>
</doc>
<doc>
    <float name="score">0.71231794</float>
    <str name="id">2</str>
    <arr name="text">
        <str>David Letterman</str>
    </arr>
</doc>

As you can see now the termFreq wins and the fieldNorm is not taken into account at all. That's why the two documents with two David occurences are on top and with the same score, despite of their different lengths, and the shorter document with just one match is the last one with the lowest score. Here's the explanation with debugQuery=on:

<lst name="explain">
   <str name="1">
      1.0073696 = (MATCH) fieldWeight(text:David in 0), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=0)
   </str>
   <str name="3">
      1.0073696 = (MATCH) fieldWeight(text:David in 2), product of: 1.4142135 = tf(termFreq(text:David)=2) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=2)
   </str>
   <str name="2">
      0.71231794 = (MATCH) fieldWeight(text:David in 1), product of: 1.0 = tf(termFreq(text:David)=1) 0.71231794 = idf(docFreq=3, maxDocs=3) 1.0 = fieldNorm(field=text, doc=1)
   </str>
</lst>
like image 93
javanna Avatar answered Nov 11 '22 06:11

javanna