Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elastic Search - Boosting/Scoring - Two words with different length

When querying a field with the query 'text', and finding two document with 'text abcd' and 'text ab', they both get the same score.

Is there a way to increase the score for 'text ab', since it is shorter?

like image 986
Deepak Avatar asked Oct 08 '14 11:10

Deepak


1 Answers

This seems to be predicated on a misconcepttion of what length refers to in terms of lucene scoring. It's useful to think of tokens as the atomic unit of indexed text, rather than characters. The length lucene considers in scoring is the number of tokens in the fields. Both of the fields you've indicated have exactly two tokens. They have the same length, and thus their length norms are also equal, and they don't impact relative scoring.

If you had a fields with three terms, you would actually see a score impact from the length:

  • Field: "text ab" -- lengthnorm = 1/√2 = 0.7
  • Field: "text abcd" -- lengthnorm = 1/√2 = 0.7
  • Field: "text abc def ghi" -- lengthnorm = 1/√4 = 0.5

That norm is multiplied into the score, so the last document listed there will have a bit lower score.


If you're not sold on the idea of thinking of content in units of terms rather than characters:

Since the length you are considering works on characters, implementing this definitely goes against the grain somewhat. You are on the right track thinking about norms, though. This should definitely be preprocessed at index time and stored as a norm.

You will need to implement this in a custom similarity class. I'll assume we like the rest of DefaultSimilarity, so you can extend it, and override LengthNorm to make this simple. You can pretty easily leverage the field offset to get:

public class MySimilarity extends DefaultSimilarity {
    @Override
    public float lengthNorm(FieldInvertState state) {
        return state.getBoost() * ((float) (1.0 / Math.sqrt(state.getOffset())));
    }
}

And there you have it. A test run for the documents and query given show:

  • Field: "text ab" -- total score = 0.18579213
  • Field: "text abcd" -- total score = 0.18579213
  • Field: "text abcdefghi" -- total score = 0.1486337

So, you can see from the longer document I added the it is working, so then why do "text ab" and "text abcd" still have the same score?

Norms are stored in a hyper-compressed form, in a single byte. They only have a 3-bit mantissa, which gives them slightly less than 1 decimal digit of precision. As such, the difference with only those two added characters is just not enough to matter given the compression scheme. When it comes to this sort of boosting, common wisdom is: "Only big differences matter" (see the DefaultSimilarity documentation)


So, "Who cares about saving some memory at search time? Small differences matter to me!", I hear you say.

All right, you'll need to override encodeNorm and decodeNorm. Since these are final in DefaultSimilarity, you'll instead need to extend TFIDFSimilarity. I'd just start by copying the source for of DefaultSimilarity. In the end you could use something like this:

public class MySimilarity extends TFIDFSimilarity {

    public MySimilarity() {}

    @Override
    public float coord(int overlap, int maxOverlap) {
        return overlap / (float)maxOverlap;
    }

    @Override
    public float queryNorm(float sumOfSquaredWeights) {
        return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
    }

    //Since length norms are generally going to leave us with results less than one, multiply
    //by a sufficiently large number to not lose all our precision when casting to long
    private static final float NORM_ADJUSTMENT = Integer.MAX_VALUE;

    @Override
    public final long encodeNormValue(float f) {
        return (long) (f * NORM_ADJUSTMENT);
    }

    @Override
    public final float decodeNormValue(long norm) {
        System.out.println(norm);
        return ((float) norm) / NORM_ADJUSTMENT;
    }

    @Override
    public float lengthNorm(FieldInvertState state) {
        return state.getBoost() * ((float) (1.0 / Math.sqrt(state.getOffset())));
    }

    @Override
    public float tf(float freq) {
        return (float)Math.sqrt(freq);
    }

    @Override
    public float sloppyFreq(int distance) {
        return 1.0f / (distance + 1);
    }

    @Override
    public float scorePayload(int doc, int start, int end, BytesRef payload) {
        return 1;
    }

    @Override
    public float idf(long docFreq, long numDocs) {
        return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
    }

    @Override
    public String toString() {
        return "DefaultSimilarity";
    }
}

And now I get:

  • Field: "text ab" -- total score = 0.2518424
  • Field: "text abcd" -- total score = 0.22525471
  • Field: "text abcdefghi" -- total score = 0.1839197
like image 71
femtoRgon Avatar answered Sep 30 '22 19:09

femtoRgon