Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is wrong with the pearson algorithm from “Programming Collective Intelligence”?

This function is from the book "Programming Collective Intelligence”, and is supposed to calculate the Pearson correlation coefficient for p1 and p2, which is supposed to be a number between -1 and 1.

If two critics rate items very similarly the function should return 1, or close to 1.

With real user data I sometimes get weird results. In the following example the dataset critics2 should return 1 - instead it returns 0.

Does anyone spot a mistake?

(This is not a duplicate of What is wrong with this python function from “Programming Collective Intelligence”)

from __future__ import division
from math import sqrt

def sim_pearson(prefs,p1,p2):
    si={}
    for item in prefs[p1]: 
        if item in prefs[p2]: si[item]=1
    if len(si)==0: return 0
    n=len(si)
    sum1=sum([prefs[p1][it] for it in si])
    sum2=sum([prefs[p2][it] for it in si])
    sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
    sum2Sq=sum([pow(prefs[p2][it],2) for it in si]) 
    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])
    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
    if den==0: return 0
    r=num/den
    return r

critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}
critics2 = {
    'user1':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        }
}
critics3 = {
    'user1':{
        'item1': 1,
        'item2': 3,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 3,
        'item3': 1,
        }
}

print sim_pearson(critics, 'user1', 'user2', )
result: 1.0 (expected)
print sim_pearson(critics2, 'user1', 'user2', )
result: 0 (unexpected)
print sim_pearson(critics3, 'user1', 'user2', )
result: -1 (expected)
like image 572
Hobhouse Avatar asked Nov 22 '09 10:11

Hobhouse


2 Answers

There is nothing wrong in your result. You are trying to plot a line through 3 points. In second case you have all three points with the same coordinates, i.e. effectively one point. You can't say do these points correlate or anti-correlate, because you can draw infinite number of lines through one point (den in your code equals to zero).

like image 162
Denis Otkidach Avatar answered Nov 14 '22 22:11

Denis Otkidach


If you look up Pearson correlation on wikipedia, you'll see that the formula uses the difference between each item in a series and the mean of the series. When all the items in the series are the same, you get division by zero, so your calculation fails.

If it is any clearer, you can use this code:

def simplified_sim_pearson(p1, p2):
    n = len(p1)
    assert (n != 0)
    sum1 = sum(p1)
    sum2 = sum(p2)
    m1 = float(sum1) / n
    m2 = float(sum2) / n
    p1mean = [(x - m1) for x in p1]
    p2mean = [(y - m2) for y in p2]
    numerator = sum(x * y for x, y in zip(p1mean, p2mean))
    denominator = math.sqrt(sum(x * x for x in p1mean) * sum(y * y for y in p2mean))
    return numerator / denominator if denominator else 0

def sim_pearson(prefs,p1,p2):
    p1 = prefs[p1]
    p2 = prefs[p2]
    si = set(p1.keys()).intersection(set(p2.keys()))
    p1_x = [p1[k] for k in sorted(si)]
    p2_x = [p2[k] for k in sorted(si)]
    return simplified_sim_pearson(p1_x, p2_x)



critics = {
    'user1':{
        'item1': 3,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 4,
        'item2': 5,
        'item3': 5,
        }
}
critics2 = {
    'user1':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 5,
        'item3': 5,
        }
}
critics3 = {
    'user1':{
        'item1': 1,
        'item2': 3,
        'item3': 5,
        },
    'user2':{
        'item1': 5,
        'item2': 3,
        'item3': 1,
        }
}

print sim_pearson(critics, 'user1', 'user2', )
print sim_pearson(critics2, 'user1', 'user2', )
print sim_pearson(critics3, 'user1', 'user2', )

By the way, using Excel to determine the correct answer is a good way to validate most calculations. In this case, you would have used correl.

like image 28
hughdbrown Avatar answered Nov 14 '22 22:11

hughdbrown