I have implemented a recommender system based upon matrix factorization techniques. I want to evaluate it.
I want to use 10-fold-cross validation with All-but-one protocol (https://ai2-s2-pdfs.s3.amazonaws.com/0fcc/45600283abca12ea2f422e3fb2575f4c7fc0.pdf).
My data set has the following structure:
user_id,item_id,rating
1,1,2
1,2,5
1,3,0
2,1,5
...
It's confusing for me to think how the data is going to be splitted, because I can't put some triples (user,item,rating) in the testing set. For example, if I select the triple (2,1,5) to the testing set and this is the only rating user 2 has made, there won't be any other information about this user and the trained model won't predict any values for him.
Considering this scenario, how should I do the splitting?
You didn't specify a language or toolset so I cannot give you a concise answer that is 100% applicable to you, but here's the approach I took to solve this same exact problem.
I'm working on a recommender system using Treasure Data (i.e. Presto) and implicit observations, and ran into a problem with my matrix where some users and items were not present. I had to re-write the algorithm to split the observations into train and test so that every user and every item would be represented in the training data. For the description of my algorithm I assume there are more users than items. If this is not true for you then just swap the two. Here's my algorithm.
As I mentioned, I'm doing this using Treasure Data and Presto so the only tool I have at my disposal is SQL, common table expressions, temporary tables, and Treasure Data workflow.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With