Is there a way to use <code>sklearn.model_selection.train_test_split</code> to retain all unique values from a specific column(s) in the training set. Let me set up an example. The most common matrix factorization problem I am aware of is predicting movie ratings for users say in the Netflix Challenge or Movielens data sets. Now this question isn't really centered around any single matrix factorization approach, but within the range of possibilities there is a group that will make predictions only for known combinations of users and items. So in Movielens 100k for example we have 943 unique users and 1682 unique movies. If we were to use <code>train_test_split</code> even with a high <code>train_size</code> ratio (say 0.9) the number of unique users and movies would not be the same. This presents a problem as the group of methods I mentioned would not be able to predict anything but 0 for movies or users it had not been trained on. Here is an example of what I mean. <pre class="prettyprint"><code>import numpy as np import pandas as pd from sklearn.model_selection import train_test_split ml = pd.read_csv('ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts']) ml.head() User_id Item_id Rating ts 0 196 242 3 881250949 1 186 302 3 891717742 2 22 377 1 878887116 3 244 51 2 880606923 4 166 346 1 886397596 ml.User_id.unique().size 943 ml.Item_id.unique().size 1682 utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9) np.unique(utrain).size 943 np.unique(itrain).size 1644 </code></pre> Try this as many times as you may and you just wont end up with 1682 unique movies in the train set. This is a result of a number of movies only having a single rating in the dataset. Luckily the same isn't true for users (lowest number of ratings by a user is 20) so it isn't a problem there. But in order to have a functioning training set we need all of the unique movies to be in the training set at least once. Furthermore, I cannot utilize the <code>stratify=</code> kwarg for <code>train_test_split</code> as there are not more than 1 entry for all users or for all movies. My question is this. <h3>Is there a way in sklearn to split a dataset to ensure that the set of unique values from a specific column(s) are retained in the training set?</h3> My rudimentary solution to the problem is as follows. <ol> <li>Separate the items that/users have a low number of total ratings.</li> <li>create a <code>train_test_split</code> on the data excluding these rarely rated items/users (ensuring that the split size + the exclude size will equal your desired split size).</li> <li>combine the two to get a final representative training set</li> </ol> Example: <pre class="prettyprint"><code>item_counts = ml.groupby(['Item_id']).size() user_counts = ml.groupby(['User_id']).size() rare_items = item_counts.loc[item_counts <= 5].index.values rare_users = user_counts.loc[user_counts <= 5].index.values rare_items.size 384 rare_users.size 0 # We can ignore users in this example rare_ratings = ml.loc[ml.Item_id.isin(rare_items)] rare_ratings.shape[0] 968 ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)] items = ml_less_rare.Item_id.values users = ml_less_rare.User_id.values ratings = ml_less_rare.Rating.values # Establish number of items desired from train_test_split desired_ratio = 0.9 train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0] train_ratio = train_size / ml_less_rare.shape[0] itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio) itrain = np.concatenate((itrain, rare_ratings.Item_id.values)) np.unique(itrain).size 1682 utrain = np.concatenate((utrain, rare_ratings.User_id.values)) np.unique(utrain).size 943 rtrain = np.concatenate((rtrain, rare_ratings.Rating.values)) </code></pre> This approach works, but I just have to feel there is a way to accomplish the same with <code>train_test_split</code> or another splitting method from sklearn. <h3>Caveat - Data Contains Single Entries for Users and Movies</h3> While the approach that @serv-inc proposes would work for data where every class is represented more than once. That is not the case with this data, nor with most recommendation/ranking data sets.

What you are looking for is called stratification. Luckily, <code>sklearn</code> has just that. Just change the line to <pre class="prettyprint"><code>itrain, itest, utrain, utest, rtrain, rtest = train_test_split( items, users, ratings, train_size=train_ratio, stratify=users) </code></pre> If <code>stratify</code> is not set, data is shuffled randomly. See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html <blockquote> If [<code>stratify</code> is] not <code>None</code>, data is split in a stratified fashion, using this as the class labels. </blockquote> <hr> Update to the updated question: it seems that putting unique instances into the training set is not built into scikit-learn. You could abuse <code>PredefinedSplit</code>, or extend <code>StratifiedShuffleSplit</code>, but this might be more complicated than simply rolling your own.

Sklearn train_test_split; retaining unique values from column(s) in training set

Tags:

Is there a way to use sklearn.model_selection.train_test_split to retain all unique values from a specific column(s) in the training set.

Let me set up an example. The most common matrix factorization problem I am aware of is predicting movie ratings for users say in the Netflix Challenge or Movielens data sets. Now this question isn't really centered around any single matrix factorization approach, but within the range of possibilities there is a group that will make predictions only for known combinations of users and items.

So in Movielens 100k for example we have 943 unique users and 1682 unique movies. If we were to use train_test_split even with a high train_size ratio (say 0.9) the number of unique users and movies would not be the same. This presents a problem as the group of methods I mentioned would not be able to predict anything but 0 for movies or users it had not been trained on. Here is an example of what I mean.

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split  ml = pd.read_csv('ml-100k/u.data', sep='\t', names=['User_id', 'Item_id', 'Rating', 'ts']) ml.head()       User_id  Item_id Rating         ts 0      196      242      3  881250949 1      186      302      3  891717742 2       22      377      1  878887116 3      244       51      2  880606923 4      166      346      1  886397596 ml.User_id.unique().size 943 ml.Item_id.unique().size 1682 utrain, utest, itrain, itest, rtrain, rtest = train_test_split(ml, train_size=0.9) np.unique(utrain).size 943 np.unique(itrain).size 1644

Try this as many times as you may and you just wont end up with 1682 unique movies in the train set. This is a result of a number of movies only having a single rating in the dataset. Luckily the same isn't true for users (lowest number of ratings by a user is 20) so it isn't a problem there. But in order to have a functioning training set we need all of the unique movies to be in the training set at least once. Furthermore, I cannot utilize the stratify= kwarg for train_test_split as there are not more than 1 entry for all users or for all movies.

My question is this.

Is there a way in sklearn to split a dataset to ensure that the set of unique values from a specific column(s) are retained in the training set?

My rudimentary solution to the problem is as follows.

Separate the items that/users have a low number of total ratings.
create a train_test_split on the data excluding these rarely rated items/users (ensuring that the split size + the exclude size will equal your desired split size).
combine the two to get a final representative training set

Example:

item_counts = ml.groupby(['Item_id']).size() user_counts = ml.groupby(['User_id']).size() rare_items = item_counts.loc[item_counts <= 5].index.values rare_users = user_counts.loc[user_counts <= 5].index.values rare_items.size 384 rare_users.size 0 # We can ignore users in this example rare_ratings = ml.loc[ml.Item_id.isin(rare_items)] rare_ratings.shape[0] 968 ml_less_rare = ml.loc[~ml.Item_id.isin(rare_items)] items = ml_less_rare.Item_id.values users = ml_less_rare.User_id.values ratings = ml_less_rare.Rating.values # Establish number of items desired from train_test_split desired_ratio = 0.9 train_size = desired_ratio * ml.shape[0] - rare_ratings.shape[0] train_ratio = train_size / ml_less_rare.shape[0] itrain, itest, utrain, utest, rtrain, rtest = train_test_split(items, users, ratings, train_size=train_ratio) itrain = np.concatenate((itrain, rare_ratings.Item_id.values)) np.unique(itrain).size 1682 utrain = np.concatenate((utrain, rare_ratings.User_id.values)) np.unique(utrain).size 943 rtrain = np.concatenate((rtrain, rare_ratings.Rating.values))

This approach works, but I just have to feel there is a way to accomplish the same with train_test_split or another splitting method from sklearn.

Caveat - Data Contains Single Entries for Users and Movies

While the approach that @serv-inc proposes would work for data where every class is represented more than once. That is not the case with this data, nor with most recommendation/ranking data sets.

519

asked Dec 07 '17 17:12

Grr

1 Answers

What you are looking for is called stratification. Luckily, sklearn has just that. Just change the line to

itrain, itest, utrain, utest, rtrain, rtest = train_test_split(      items, users, ratings, train_size=train_ratio, stratify=users)

If stratify is not set, data is shuffled randomly. See http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

If [stratify is] not None, data is split in a stratified fashion, using this as the class labels.

Update to the updated question: it seems that putting unique instances into the training set is not built into scikit-learn. You could abuse PredefinedSplit, or extend StratifiedShuffleSplit, but this might be more complicated than simply rolling your own.

answered Nov 15 '22 15:11

serv-inc

Related questions
                            
                                Is there a way to deserialize arbitrary JSON using Serde without creating fine-grained objects?
                            
                                Spring Boot Application gets stuck on "Hikari-Pool-1 - Starting..."
                            
                                Prevent network requests from Facebook Android SDK
                            
                                Flutter live streaming
                            
                                Using reactor's Flux.buffer to batch work only works for single item
                            
                                How to Use `import.meta` When Testing With Jest
                            
                                Converting PDF to HTML with Python [duplicate]
                            
                                Hashing a python function to regenerate output when the function is modified
                            
                                Getting readable diff displays in Mercurial on Unicode files (MS Windows)
                            
                                Funny observation about (recursive) structural types in Scala
                            
                                Are there tools that help organizing #includes?
                            
                                How to Improve WCF Data Services Performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With