Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn train_test_split; retaining unique values from column(s) in training set

Tags:

Is there a way to use sklearn.model_selection.train_test_split to retain all unique values from a specific column(s) in the training set.

Let me set up an example. The most common matrix factorization problem I am aware of is predicting movie ratings for users say in the Netflix Challenge or Movielens data sets. Now this question isn't really centered around any single matrix factorization approach, but within the range of possibilities there is a group that will make predictions only for known combinations of users and items.

So in Movielens 100k for example we have 943 unique users and 1682 unique movies. If we were to use train_test_split even with a high train_size ratio (say 0.9) the number of unique users and movies would not be the same. This presents a problem as the group of methods I mentioned would not be able to predict anything but 0 for movies or users it had not been trained on. Here is an example of what I mean.

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split  ml = pd.read_csv('ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts']) ml.head()       User_id  Item_id Rating         ts 0      196      242      3  881250949 1      186      302      3  891717742 2       22      377      1  878887116 3      244       51      2  880606923 4      166      346      1  886397596 ml.User_id.unique().size 943 ml.Item_id.unique().size 1682 utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9) np.unique(utrain).size 943 np.unique(itrain).size 1644 

Try this as many times as you may and you just wont end up with 1682 unique movies in the train set. This is a result of a number of movies only having a single rating in the dataset. Luckily the same isn't true for users (lowest number of ratings by a user is 20) so it isn't a problem there. But in order to have a functioning training set we need all of the unique movies to be in the training set at least once. Furthermore, I cannot utilize the stratify= kwarg for train_test_split as there are not more than 1 entry for all users or for all movies.

My question is this.

Is there a way in sklearn to split a dataset to ensure that the set of unique values from a specific column(s) are retained in the training set?

My rudimentary solution to the problem is as follows.

  1. Separate the items that/users have a low number of total ratings.
  2. create a train_test_split on the data excluding these rarely rated items/users (ensuring that the split size + the exclude size will equal your desired split size).
  3. combine the two to get a final representative training set

Example:

item_counts = ml.groupby(['Item_id']).size() user_counts = ml.groupby(['User_id']).size() rare_items = item_counts.loc[item_counts <= 5].index.values rare_users = user_counts.loc[user_counts <= 5].index.values rare_items.size 384 rare_users.size 0 # We can ignore users in this example rare_ratings = ml.loc[ml.Item_id.isin(rare_items)] rare_ratings.shape[0] 968 ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)] items = ml_less_rare.Item_id.values users = ml_less_rare.User_id.values ratings = ml_less_rare.Rating.values # Establish number of items desired from train_test_split desired_ratio = 0.9 train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0] train_ratio = train_size / ml_less_rare.shape[0] itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio) itrain = np.concatenate((itrain, rare_ratings.Item_id.values)) np.unique(itrain).size 1682 utrain = np.concatenate((utrain, rare_ratings.User_id.values)) np.unique(utrain).size 943 rtrain = np.concatenate((rtrain, rare_ratings.Rating.values)) 

This approach works, but I just have to feel there is a way to accomplish the same with train_test_split or another splitting method from sklearn.

Caveat - Data Contains Single Entries for Users and Movies

While the approach that @serv-inc proposes would work for data where every class is represented more than once. That is not the case with this data, nor with most recommendation/ranking data sets.

like image 519
Grr Avatar asked Dec 07 '17 17:12

Grr


People also ask

How do you split data into training and testing in python Sklearn?

The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_train,X_test , y_train and y_test. X_train and y_train sets are used for training and fitting the model.

What is Test_size in train_test_split?

test_size is the number that defines the size of the test set. It's very similar to train_size . You should provide either train_size or test_size . If neither is given, then the default share of the dataset that will be used for testing is 0.25 , or 25 percent.

What does Sklearn Cross_validation train_test_split do?

cross_validation. train_test_split. Quick utility that wraps calls to check_arrays and next(iter(ShuffleSplit(n_samples))) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Python lists or tuples occurring in arrays are converted to 1D numpy arrays.


1 Answers

What you are looking for is called stratification. Luckily, sklearn has just that. Just change the line to

itrain, itest, utrain, utest, rtrain, rtest = train_test_split(      items, users, ratings, train_size=train_ratio, stratify=users) 

If stratify is not set, data is shuffled randomly. See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

If [stratify is] not None, data is split in a stratified fashion, using this as the class labels.


Update to the updated question: it seems that putting unique instances into the training set is not built into scikit-learn. You could abuse PredefinedSplit, or extend StratifiedShuffleSplit, but this might be more complicated than simply rolling your own.

like image 84
serv-inc Avatar answered Nov 15 '22 15:11

serv-inc