I am trying to build a poor man's recommendation system for a online store. I want to realize that kind of Amazon "Customers Who Bought This Item Also Bought" feature and I read a lot about it. I know there is that Apache Mahout thing, but I am unable to tweak the server that way. Then there would be the google prediction API, but it cost money so I start experimenting myself.
I got an orderhistory with 250.000+ items and I wrote a nested MySQL Query to find orders which contain the current article, rank the other order items and sort that table for ranking, so I got a set of products which other people ordered along with the current article.
The problem is, the query could take up to 10sec - so this can't be used directly. I thought about a caching table, but this query stops after 20 minutes (there are 60.000 products and 250.000 ordered items) So I am unable to fill that table.
My current workaround is the following: The recommendation HTML is loaded via AJAX ondocumentready, so the site loads, while the recommendation loads in the background. The recommendation data is processed once and stored in a filecache (PEAR simple cache) so it loads faster the next time. So the cache is made on demand if someone visits the site and stored for a day or maybe a week.
I ask myself and you, would that be an acceptable approach or is it stupid and unperformant? Would it be better to store the cached data in a db or in file (I think about performance and parallel hits). I mean, in the worst case I would endup with 60.000 cachefiles.
I would prefer a pre-computed table with all the data, but as I said it takes to long and I don't know how to optimize it. (Waiting till the SQL Dude come back from holidays ^^)
Thanks for any hint, opinion.
btw. this is the query:
SELECT c.ArtNr as artnr , count(c.ArtNr) as rank, s.ArtNr as parent_artnr
FROM (
SELECT a.ID_order, a.ArtNr
FROM net_orderposition a
WHERE a.ArtNr = 'TT-PV0005'
) s
JOIN net_orderposition c
WHERE s.ID_order = c.ID_order AND s.ArtNr != c.ArtNr
GROUP BY c.ArtNr
ORDER BY rank DESC,c.Stamp DESC
LIMIT 10;
EDIT:
I thought about the given answers and I think they are similar to my initial idea. The above code result in the following table:
ID,ParentID , ChildID , Rank
1, TT-PV0005, TT-PV0040, 220
2, TT-PV0005, TT-PV0355, 135
3, TT-PV0005, TT-PV0450, 134
4, TT-PV0005, TT-PV0451, 89
5, TT-PV0005, RH-01V2 , 83
6, TT-PV0005, TT-PV0041, 83
7, TT-PV0005, TT-PV0353, 82
8, TT-PV0005, TT-PV0037, 80
The ParentID is the current item, ChildID the items that ordered in the past along with ParentID, Rank is the precomputed count of how often the child is ordered with current item. Now I can UPDATE or INSERT related items on every new order and count up Rank if it's already present in DB. The only thing I fear, I will endup in a really really big table. Maybe it shouldn't be a problem, if I precalculate it offline once a week? But then I have to optimize the query so it doesn't take 10 sec per item.
What do you think?
check out easyrec it has the features you need and is free. no tweaking needed and you can use the Demo instance like google analytics. I think it will be much easier to just use this free to use web service then code the whole logic on your own.
In a tweet today they mentioned that they support full mahout support to easyrec so you have the whole thing with easyrec.You can either use easyrec's free webservice or deploy the free WAR file on your webserver.
To add to @GalacticCowboy's answer and fill in where you're comment was, @Marcus...
One schema to accomplish this would be to create a table like:
RelatedItems
RelatedItemsId
purchasedItemId
relatedItemId
Then when an order is completed (or viewed depending on your requirements) you'd write records to the RelatedItems table, where each item purchased gets a record where that id is the purchasedItemId. Then all the other items would be written as the relatedItemId.
For example, if I made a purchase of Items 5, 9, 12, and 19, I would have 12 records that were written to my table that look like:
RelatedItemId, PurchasedItemId, RelatedItemId
1, 5, 9
2, 5, 12
3, 5, 19
4, 9, 5
5, 9, 12
6, 9, 19
7, 12, 5
8, 12, 9
9, 12, 19
10, 19, 5
11, 19, 9
12, 19, 12
Then you could usage a query similar to GalacticCowboy to get the top 10 items that were normally purchased alongside any of those items.
Please note, this is not the most efficient schema for a task like this, it could be tweaked quite a bit to reduce redundant data, but given that we don't know an awful lot about your system and overall schema design (and what seems a shaky understanding of some SQL concepts) I'm not going to go deep into that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With