This function is from the book "Programming Collective Intelligence”, and is supposed to calculate the Pearson correlation coefficient for p1 and p2, which is supposed to be a number between -1 and 1.
If two critics rate items very similarly the function should return 1, or close to 1.
With real user data I sometimes get weird results. In the following example the dataset critics2 should return 1 - instead it returns 0.
Does anyone spot a mistake?
(This is not a duplicate of What is wrong with this python function from “Programming Collective Intelligence”)
from __future__ import division
from math import sqrt
def sim_pearson(prefs,p1,p2):
si={}
for item in prefs[p1]:
if item in prefs[p2]: si[item]=1
if len(si)==0: return 0
n=len(si)
sum1=sum([prefs[p1][it] for it in si])
sum2=sum([prefs[p2][it] for it in si])
sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
sum2Sq=sum([pow(prefs[p2][it],2) for it in si])
pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
num=pSum-(sum1*sum2/n)
den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
if den==0: return 0
r=num/den
return r
critics = {
'user1':{
'item1': 3,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 4,
'item2': 5,
'item3': 5,
}
}
critics2 = {
'user1':{
'item1': 5,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 5,
'item3': 5,
}
}
critics3 = {
'user1':{
'item1': 1,
'item2': 3,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 3,
'item3': 1,
}
}
print sim_pearson(critics, 'user1', 'user2', )
result: 1.0 (expected)
print sim_pearson(critics2, 'user1', 'user2', )
result: 0 (unexpected)
print sim_pearson(critics3, 'user1', 'user2', )
result: -1 (expected)
There is nothing wrong in your result. You are trying to plot a line through 3 points. In second case you have all three points with the same coordinates, i.e. effectively one point. You can't say do these points correlate or anti-correlate, because you can draw infinite number of lines through one point (den
in your code equals to zero).
If you look up Pearson correlation on wikipedia, you'll see that the formula uses the difference between each item in a series and the mean of the series. When all the items in the series are the same, you get division by zero, so your calculation fails.
If it is any clearer, you can use this code:
def simplified_sim_pearson(p1, p2):
n = len(p1)
assert (n != 0)
sum1 = sum(p1)
sum2 = sum(p2)
m1 = float(sum1) / n
m2 = float(sum2) / n
p1mean = [(x - m1) for x in p1]
p2mean = [(y - m2) for y in p2]
numerator = sum(x * y for x, y in zip(p1mean, p2mean))
denominator = math.sqrt(sum(x * x for x in p1mean) * sum(y * y for y in p2mean))
return numerator / denominator if denominator else 0
def sim_pearson(prefs,p1,p2):
p1 = prefs[p1]
p2 = prefs[p2]
si = set(p1.keys()).intersection(set(p2.keys()))
p1_x = [p1[k] for k in sorted(si)]
p2_x = [p2[k] for k in sorted(si)]
return simplified_sim_pearson(p1_x, p2_x)
critics = {
'user1':{
'item1': 3,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 4,
'item2': 5,
'item3': 5,
}
}
critics2 = {
'user1':{
'item1': 5,
'item2': 5,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 5,
'item3': 5,
}
}
critics3 = {
'user1':{
'item1': 1,
'item2': 3,
'item3': 5,
},
'user2':{
'item1': 5,
'item2': 3,
'item3': 1,
}
}
print sim_pearson(critics, 'user1', 'user2', )
print sim_pearson(critics2, 'user1', 'user2', )
print sim_pearson(critics3, 'user1', 'user2', )
By the way, using Excel to determine the correct answer is a good way to validate most calculations. In this case, you would have used correl
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With