I have found what must be dozens of articles on Towards Data Science/ medium/ etc. of people making recommendation engines with imdb data (based on ratings that users gave to movies, what movies should we recommend to those users). These articles begin with 'memory based approaches' of user-based content filtering and item-based content filtering. I have been tasked with making a recommendation engine, and since none of the suits really care or know anything about this, I want to do the bare minimum (which seems to be user-based content filtering).
Problem is, all of my data is binary (no ratings, just based on the items that other users bought, should we recommend items to similar users - this is actually similar to the cartoons that all of the medium articles have stolen from eachother, but none of the medium articles give an example of how to do that).
All of the articles use Pearson Correlation or cosine similarity to determine user similarity, can I use these approaches with binary dimensions (bought or not), if so how, and if not is there a different way to measure user similarity?
I am working with python btw. And I was thinking of maybe using Hamming Distance (is there a reason that wouldn't be good)
from scipy.spatial.distance import cosine
1 - cosine([5,3,2,0,1],[4,3,3,0,0])
0.961161313666907
1 - cosine([1,1,1,0,1],[1,1,1,0,0])
0.8660254037844386
The data you have is an implicit data which means interactions are not necessarily indicate user's interest it's just interaction. Interaction value of 1 and interaction value of 1000 has no difference in this case they both shows interaction nothing else, such that memory based algorithms are useless here. If you are not familiar with neural networks, then you have to at least use matrix factorization techniques to make a meaningful recommendation using this data, you can start with surprise library here which has a bunch of matrix factorization models.
It will be better if you use ALS as optimization technique, but SGD will also do the work. If you are ok with deep-learning I can refer to the sources of the best work so far.
I once used non-negative matrix factorization(NNMF for short) algorithm in surprise for data like yours and the results was good enough.
It seems, that in your situation the best approach would be collaborative filtering. You don't need scores, everything that you need is a user-item interaction matrix. The simplest algorithm, in this case, is Alternating Least Square (ALS).
There're already a few implementations in python. For instance, this one. Also, there's an implementation in PySpark recommendation module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With