Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

User based filtering:Recommendation system

I know this is not a coding specific problem but this is the most suitable place for asking such questions.So please bear with me.

Suppose I have a dictionary like given below,listing ten liked items of each person

likes={
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"}
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}

How can I determine persons who have similar likes.Or maybe who two persons resemble the most.Also it will be helpful if you can point me to the appropriate example or tutorial for user based or item based filtering.

like image 824
Rajat Saxena Avatar asked Jul 16 '12 10:07

Rajat Saxena


2 Answers

(Disclaimer, I am not adept in this field and only have a passing knowledge of collective filtering. The following is simply a collection of resources that I have found useful)

The basics of this is covered quite comprehensively in Chapter 2 of the "Programming Collective Intelligence" book. Example code are in Python, which is another plus.

You might also find this site useful - A Programmer's Guide to Data Mining, in particular Chapter 2 and Chapter 3 which discusses recommendation systems and item-based filtering.

In short, one can use techniques such as calculating the Pearson Correlation Coefficient, Cosine Similarity, k-nearest neighbours, etc. to determine similarities between users based on items they've like/bought/voted on.

Note that there are various python libraries out there that were written for this purpose, e.g. pysuggest, Crab, python-recsys, and SciPy.stats.stats.pearsonr.

For a large data set where number of users exceed the number of items, you can scale the solution better by inversing your data and calculate the correlation between items instead (i.e. item-based filtering) and use that to infer similar users. Naturally, you wouldn't do this in real time but schedule periodic recalculations as a backend task. Some of the approaches can be parallelised/distributed to drastically shorten the computation time (assuming you have the resources to throw at it).

like image 181
Shawn Chin Avatar answered Nov 02 '22 10:11

Shawn Chin


A solution using python recsys library [ http://ocelma.net/software/python-recsys/build/html/quickstart.html ]

from recsys.algorithm.factorize import SVD
from recsys.datamodel.data import Data

likes={
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"},
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"},
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"},
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"},
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"},
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"},
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"},
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"},
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"},
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"},
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"},
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"},
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"},
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"},
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"},
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"}
}

data = Data()
VALUE = 1.0
for username in likes:
    for user_likes in likes[username]:
        data.add_tuple((VALUE, username, user_likes)) # Tuple format is: <value, row, column>

svd = SVD()
svd.set_data(data)
k = 5 # Usually, in a real dataset, you should set a higher number, e.g. 100
svd.compute(k=k, min_values=3, pre_normalize=None, mean_center=False, post_normalize=True)

svd.similar('sheila')
svd.similar('rajat')

Results:

In [11]: svd.similar('sheila')
Out[11]: 
[('sheila', 0.99999999999999978),
 ('brenda', 0.94929845546505753),
 ('anita', 0.85943494201162518),
 ('kelly', 0.53385495931440263),
 ('saif', 0.39985366653259058),
 ('rajat', 0.30757664244952165),
 ('toby', 0.28541364367155014),
 ('priya', 0.26184289111194581),
 ('steve', 0.25043700194182622),
 ('katy', 0.21812807229358305)]

In [12]: svd.similar('rajat')
Out[12]: 
[('rajat', 1.0000000000000004),
 ('mark', 0.89164019482177692),
 ('katy', 0.65207273451425907),
 ('stuart', 0.61675507205285718),
 ('steve', 0.55730648750670264),
 ('anita', 0.49836982296014803),
 ('brenda', 0.42759524471725929),
 ('kelly', 0.40436047539358799),
 ('toby', 0.35972227835054826),
 ('ravi', 0.31113813325818901)]
like image 39
ocelma Avatar answered Nov 02 '22 12:11

ocelma