Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting data set into training and testing sets on recommender systems

I have implemented a recommender system based upon matrix factorization techniques. I want to evaluate it.

I want to use 10-fold-cross validation with All-but-one protocol (https://ai2-s2-pdfs.s3.amazonaws.com/0fcc/45600283abca12ea2f422e3fb2575f4c7fc0.pdf).

My data set has the following structure:

user_id,item_id,rating
1,1,2
1,2,5
1,3,0
2,1,5
...

It's confusing for me to think how the data is going to be splitted, because I can't put some triples (user,item,rating) in the testing set. For example, if I select the triple (2,1,5) to the testing set and this is the only rating user 2 has made, there won't be any other information about this user and the trained model won't predict any values for him.

Considering this scenario, how should I do the splitting?

like image 314
Vitor Tonon Avatar asked Mar 30 '17 23:03

Vitor Tonon


1 Answers

You didn't specify a language or toolset so I cannot give you a concise answer that is 100% applicable to you, but here's the approach I took to solve this same exact problem.

I'm working on a recommender system using Treasure Data (i.e. Presto) and implicit observations, and ran into a problem with my matrix where some users and items were not present. I had to re-write the algorithm to split the observations into train and test so that every user and every item would be represented in the training data. For the description of my algorithm I assume there are more users than items. If this is not true for you then just swap the two. Here's my algorithm.

  1. Select one observation for each user
  2. For each item that has only one observation and has not already been selected from the previous step select one observation
  3. Merge the results of the previous two steps together. This should produce a set of observations that covers all of the users and all of the items.
  4. Calculate how many observations you need to fill your training set (generally 80% of the total number of observations)
  5. Calculate how many observations are in the merged set from step 3. The difference between steps 4 and 5 is the number of remaining observations necessary to fill the training set.
  6. Randomly select enough of the remaining observations to fill the training set.
  7. Merge the sets from step 3 and 6: this is your training set.
  8. The remaining observations is your testing set.

As I mentioned, I'm doing this using Treasure Data and Presto so the only tool I have at my disposal is SQL, common table expressions, temporary tables, and Treasure Data workflow.

like image 194
JZimmerman Avatar answered Nov 15 '22 07:11

JZimmerman