Is there a way to use sklearn.model_selection.train_test_split
to retain all unique values from a specific column(s) in the training set.
Let me set up an example. The most common matrix factorization problem I am aware of is predicting movie ratings for users say in the Netflix Challenge or Movielens data sets. Now this question isn't really centered around any single matrix factorization approach, but within the range of possibilities there is a group that will make predictions only for known combinations of users and items.
So in Movielens 100k for example we have 943 unique users and 1682 unique movies. If we were to use train_test_split
even with a high train_size
ratio (say 0.9) the number of unique users and movies would not be the same. This presents a problem as the group of methods I mentioned would not be able to predict anything but 0 for movies or users it had not been trained on. Here is an example of what I mean.
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split ml = pd.read_csv('ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts']) ml.head() User_id Item_id Rating ts 0 196 242 3 881250949 1 186 302 3 891717742 2 22 377 1 878887116 3 244 51 2 880606923 4 166 346 1 886397596 ml.User_id.unique().size 943 ml.Item_id.unique().size 1682 utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9) np.unique(utrain).size 943 np.unique(itrain).size 1644
Try this as many times as you may and you just wont end up with 1682 unique movies in the train set. This is a result of a number of movies only having a single rating in the dataset. Luckily the same isn't true for users (lowest number of ratings by a user is 20) so it isn't a problem there. But in order to have a functioning training set we need all of the unique movies to be in the training set at least once. Furthermore, I cannot utilize the stratify=
kwarg for train_test_split
as there are not more than 1 entry for all users or for all movies.
My question is this.
My rudimentary solution to the problem is as follows.
train_test_split
on the data excluding these rarely rated items/users (ensuring that the split size + the exclude size will equal your desired split size).Example:
item_counts = ml.groupby(['Item_id']).size() user_counts = ml.groupby(['User_id']).size() rare_items = item_counts.loc[item_counts <= 5].index.values rare_users = user_counts.loc[user_counts <= 5].index.values rare_items.size 384 rare_users.size 0 # We can ignore users in this example rare_ratings = ml.loc[ml.Item_id.isin(rare_items)] rare_ratings.shape[0] 968 ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)] items = ml_less_rare.Item_id.values users = ml_less_rare.User_id.values ratings = ml_less_rare.Rating.values # Establish number of items desired from train_test_split desired_ratio = 0.9 train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0] train_ratio = train_size / ml_less_rare.shape[0] itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio) itrain = np.concatenate((itrain, rare_ratings.Item_id.values)) np.unique(itrain).size 1682 utrain = np.concatenate((utrain, rare_ratings.User_id.values)) np.unique(utrain).size 943 rtrain = np.concatenate((rtrain, rare_ratings.Rating.values))
This approach works, but I just have to feel there is a way to accomplish the same with train_test_split
or another splitting method from sklearn.
While the approach that @serv-inc proposes would work for data where every class is represented more than once. That is not the case with this data, nor with most recommendation/ranking data sets.
The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train,X_test , y_train and y_test. X_train and y_train sets are used for training and fitting the model.
test_size is the number that defines the size of the test set. It's very similar to train_size . You should provide either train_size or test_size . If neither is given, then the default share of the dataset that will be used for testing is 0.25 , or 25 percent.
cross_validation. train_test_split. Quick utility that wraps calls to check_arrays and next(iter(ShuffleSplit(n_samples))) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Python lists or tuples occurring in arrays are converted to 1D numpy arrays.
What you are looking for is called stratification. Luckily, sklearn
has just that. Just change the line to
itrain, itest, utrain, utest, rtrain, rtest = train_test_split( items, users, ratings, train_size=train_ratio, stratify=users)
If stratify
is not set, data is shuffled randomly. See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
If [
stratify
is] notNone
, data is split in a stratified fashion, using this as the class labels.
Update to the updated question: it seems that putting unique instances into the training set is not built into scikit-learn. You could abuse PredefinedSplit
, or extend StratifiedShuffleSplit
, but this might be more complicated than simply rolling your own.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With