I'm trying to compute item-to-item similarity along the lines of Amazon's "Customers who viewed/purchased X have also viewed/purchased Y and Z". All of the examples and references I've seen are for either computing item similarity for ranked items, for finding user-user similarity, or for finding recommended items based on the current users' history. I'd like to start off with a non-targeted approach before factoring in the current users' preferences.
Looking at the Amazon.com recommendations white paper, they use the following logic for offline item-item similarity:
For each item in product catalog, I1
For each customer C who purchased I1
For each item I2 purchased by customer C
Record that a customer purchased I1 and I2
For each item I2
Compute the similarity between I1 and I2
If I understand correctly, by the time we're at "Compute similiarty between I1 and I2", I have a list of items(I2) purchased in conjunction with a single value I1(the outer loop).
How is this calculation performed?
Another idea is that I'm overthinking this and making it more difficult than I need to - Would it be enough to do a top-n query on the count of I2 bought in conjunction with I1?
I also appreciate suggestions on whether or not this approach is a correct one. My product database has about 150k items at any time. Since the bulk of the reading material I've seen shows user-item similarity or even user-user similarity, should I be looking to go that route instead.
I've worked with similarity algorithms in the past but they've always involved a rank or a score. I think the only way this would work would be to build a customer-product matrix scoring 0/1 for not purchased/purchased. Given the purchase history and the item size, this could get really large.
edit: although i listed python as a tag, i'd prefer to keep the logic inside of a db, preferably using Oracle PL/SQL.
Let's understand Item-to-Item Collaborative Filtering. suppose we have purchase matrix
Item1 Item2 ... ItemN
User1 0 1 ... 0
User2 1 1 ... 0
.
.
.
UserM 1 0 ... 0
Then we can calculate Item similarity using column vector, e.g use cosine. We have a item similarity symmetry matrix as below
Item1 Item2 ... ItemN
Item1 1 1/M ... 0
Item2 1/M 1 ... 0
.
.
.
ItemN 0 0 ... 1
It's can be explained as "Customers who viewed/purchased X have also viewed/purchased Y, Z, ..." (Collaborative Filtering). Because Item's vectorization is based on user's purchased.
Amazon's logic is exactly same with above while it's target is to improve efficient. As they said
We could build a product-to-product matrix by iterating through all item pairs and com- puting a similarity metric for each pair. However, many product pairs have no common customers, and thus the approach is inefficient in terms of processing time and memory usage. The iterative algorithm provides a better approach by calculating the similarity between a single prod-uct and all related products
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With