When querying a field with the query 'text', and finding two document with 'text abcd' and 'text ab', they both get the same score.
Is there a way to increase the score for 'text ab', since it is shorter?
This seems to be predicated on a misconcepttion of what length refers to in terms of lucene scoring. It's useful to think of tokens as the atomic unit of indexed text, rather than characters. The length lucene considers in scoring is the number of tokens in the fields. Both of the fields you've indicated have exactly two tokens. They have the same length, and thus their length norms are also equal, and they don't impact relative scoring.
If you had a fields with three terms, you would actually see a score impact from the length:
That norm is multiplied into the score, so the last document listed there will have a bit lower score.
If you're not sold on the idea of thinking of content in units of terms rather than characters:
Since the length you are considering works on characters, implementing this definitely goes against the grain somewhat. You are on the right track thinking about norms, though. This should definitely be preprocessed at index time and stored as a norm.
You will need to implement this in a custom similarity class. I'll assume we like the rest of DefaultSimilarity
, so you can extend it, and override LengthNorm
to make this simple. You can pretty easily leverage the field offset to get:
public class MySimilarity extends DefaultSimilarity {
@Override
public float lengthNorm(FieldInvertState state) {
return state.getBoost() * ((float) (1.0 / Math.sqrt(state.getOffset())));
}
}
And there you have it. A test run for the documents and query given show:
So, you can see from the longer document I added the it is working, so then why do "text ab" and "text abcd" still have the same score?
Norms are stored in a hyper-compressed form, in a single byte. They only have a 3-bit mantissa, which gives them slightly less than 1 decimal digit of precision. As such, the difference with only those two added characters is just not enough to matter given the compression scheme. When it comes to this sort of boosting, common wisdom is: "Only big differences matter" (see the DefaultSimilarity
documentation)
So, "Who cares about saving some memory at search time? Small differences matter to me!", I hear you say.
All right, you'll need to override encodeNorm
and decodeNorm
. Since these are final in DefaultSimilarity
, you'll instead need to extend TFIDFSimilarity
. I'd just start by copying the source for of DefaultSimilarity
. In the end you could use something like this:
public class MySimilarity extends TFIDFSimilarity {
public MySimilarity() {}
@Override
public float coord(int overlap, int maxOverlap) {
return overlap / (float)maxOverlap;
}
@Override
public float queryNorm(float sumOfSquaredWeights) {
return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
}
//Since length norms are generally going to leave us with results less than one, multiply
//by a sufficiently large number to not lose all our precision when casting to long
private static final float NORM_ADJUSTMENT = Integer.MAX_VALUE;
@Override
public final long encodeNormValue(float f) {
return (long) (f * NORM_ADJUSTMENT);
}
@Override
public final float decodeNormValue(long norm) {
System.out.println(norm);
return ((float) norm) / NORM_ADJUSTMENT;
}
@Override
public float lengthNorm(FieldInvertState state) {
return state.getBoost() * ((float) (1.0 / Math.sqrt(state.getOffset())));
}
@Override
public float tf(float freq) {
return (float)Math.sqrt(freq);
}
@Override
public float sloppyFreq(int distance) {
return 1.0f / (distance + 1);
}
@Override
public float scorePayload(int doc, int start, int end, BytesRef payload) {
return 1;
}
@Override
public float idf(long docFreq, long numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}
@Override
public String toString() {
return "DefaultSimilarity";
}
}
And now I get:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With