Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most effective similarity measure for list-ranked items

We're trying to find similarity between items (and later users) where the items are ranked in various lists by users (think Rob, Barry and Dick in Hi Fidelity). A lower index in a given list implies a higher rating.

I suppose a standard approach would be to use the Pearson correlation and then invert the indexes in some way.

However, as I understand it, the aim of the Pearson correlation is to compensate for differences between users who typically rate things higher or lower but have a similar relative ratings.

It seems to me that if the lists are continuous (although of arbitrary length) it's not an issue that the ratings implied from the position will be skewed in this way.

I suppose in this case a Euclidean based similarity would suffice. Is this the case? Would using the Pearson correlation have a negative effect and find correlation that isn't appropriate? What similarity measure might best suit this data?

Additionally while we want position in the list to have effect we don't want to penalise rankings that are too far apart. Two users both featuring an item in a list with very differing ranking should still be considered similar.

like image 801
Tom Martin Avatar asked Oct 17 '12 12:10

Tom Martin


1 Answers

Jaccard Similarity looks better in your case. To include the rank you mentioned, you can take a bag-of-items approach.

Using your example of (Rob, Barry, Dick) with their rating being (3,2,1) respectively, you insert Rob 3 times into this user a's bag.

Rob, Rob, Rob.

Then for Barry, you do it twice. The current bag looks like below,

Rob, Rob, Rob, Barry, Barry.

You put Dick into the bag finally.

Rob, Rob, Rob, Barry, Barry, Dick

Suppose another user b has a bag of [Dick, Dick, Barry], you calculate the Jaccard Similarity as below:

  • The intersection between a and b = [Dick, Barry]
  • The union of a and b = [Rob, Rob, Rob, Barry, Barry, Dick, Dick]
  • The Jaccard Similarity = 2/7,

that is, the number of items in the intersection divided by the number of items in the union.

This similarity measure does NOT penalize rankings that are far apart. You can see that:

Two users both featuring an item in a list with very differing ranking should still be considered similar.

like image 156
greeness Avatar answered Oct 12 '22 10:10

greeness