Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining TF-IDF (cosine similarity) with pagerank?

Given a query I have a cosine score for a document. I also have the documents pagerank. Is there a standard good way of combining the two?

I was thinking of multiply them

 Total_Score = cosine-score * pagerank

Because if you get to low on either pagerank or the cosine-score, the document is not interesting.

Or is it preferable to have a weighted sum?

Total_Score = weight1 * cosine-score + weight2 * pagerank

Is this better? Then you might have zero cosine score, but a high pagerank, and the page will show up among the results.

like image 712
user1506145 Avatar asked Feb 18 '13 16:02

user1506145


2 Answers

The weighted sum is probably better as a ranking rule.

It helps to break the problem up into a retrieval/ filtering step and a ranking step. The problem outlined with the weighted sum approach then no longer holds.

The process outlined in this paper by Sergey Brin and Lawrence Page uses a variant of the vector/ cosine model for retrieval and it seems some kind of weighted sum for the ranking where the weights are determined by user activity (see section 4.5.1). Using this approach a document with zero cosine would not get pass the retrieval/ filtering step and thus would not be considered for ranking.

like image 120
Ryan Harmuth Avatar answered Oct 14 '22 14:10

Ryan Harmuth


You could consider using a harmonic mean. With a harmonic mean the the 2 scores will essentially be averaged however, low scores will drag the average down more than they would in a regular average.

You could use:

Total_Score = 2*(cosine-score * pagerank) / (cosine-score + pagerank)

Let's say pagerank scored 0.1 and cosine 0.9, the normal average of these two number would be: (0.1 + 0.9)/2 = 0.5, the harmonic mean would be: 2*(0.9*0.1)/(0.9 + 0.1) = 0.18.

like image 33
jksnw Avatar answered Oct 14 '22 14:10

jksnw