Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Solr MoreLikeThis boosting query fields

I am experimenting with Solr's MoreLikeThis feature.

My schema deals with articles, and I'm looking for similarities between articles within three fields: articletitle, articletext and topic.

The following query works well:

q=id:(2e2ec74c-7c26-49c9-b359-31a11ea50453)
&rows=100000000&mlt=true
&mlt.fl=articletext,articletitle,topic&mlt.boost=true&mlt.mindf=1&mlt.mintf=1

But I would like to experiment with boosting different query fields - i.e. putting more weight on similarities in the articletitle, for instance.

The documentation (http://wiki.apache.org/solr/MoreLikeThis) suggests that this can be achieved by including the mlt.qf property, with some boosting.

My attempt at such a query is as follows:

q=id:(2e2ec74c-7c26-49c9-b359-31a11ea50453)&rows=100000000&mlt=true
&mlt.fl=articletext,articletitle,topic&mlt.boost=true
&mlt.mindf=1&mlt.mintf=1
&mlt.qf=articletext^0.1 articletitle^100 topic^0.1

However, the boosts seem to have no affect - no matter what boosts I supply, the recommendations remain the same (I would except the above query to heavily favour similarities in the titles, but this doesn't seem to be happening)

I can't find any examples in the documentation that use MoreLikeThis in this way, which leads me to believe I've got something wrong.

Has anyone managed to achieve something like this?

like image 378
JBradshaw Avatar asked Dec 17 '13 22:12

JBradshaw


1 Answers

The MLT component is useful if you have simple recommendation requirements where you have only one field to match on, or several of equal importance. But any time you want to vary the relative importances of the different fields, or need to do something more specific like include an inverse distance boost, then you will probably want to write your own pseudo MLT handler. All the MLT handler does is to generate the top terms from the fields specified based on their tf.idf scores from the source document. You can easily emulate that functionality in some code that generates a custom SOLR OR query. You will lose the advantage of the termvectors, but so long as your queries are reasonably sized (say < 20 terms) it will probably perform pretty well. We have a small index and so generate our own MLT queries with several hundred terms and it executes in an acceptable amount of time (a few ms). However, I have seen this behavior deteriorate somewhat on large indexes with a few 100 million documents and larger fields, and in those cases you need to restrict your query to a small number of top terms. Using your own code in place of MLT is more work, but you gain a lot more in flexibility.

like image 133
Simon Avatar answered Sep 19 '22 18:09

Simon