Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split dataset per user according to timestamp in training and test set in python

Tags:

python

pandas

I am using movielens dataset(ratings.dat), and pandas dataframe to read and process the data. I have to split this data into test and training set. By using pandas dataframe.sample function, and can divide the data into random splits.For example:

train = df.sample(frac=0.8,random_state=200)

test = df.drop(train.index)

Now I am trying to sort data on user_id and then on timestamp, and I need to divide data into 80%-20% per user in training set and test set respectively.

So, for example if user1 rated 10 movies, then the entries for this user should sorted from oldest to latest according to timestamp

ratings = pd.read_csv('filename', sep='\t', engine='python', header=0)

sorted_df = ratings.sort(['user_id', 'timestamp'], ascending=[True, True])

and the splitting should be in such a way that the first 8 entries with oldest timestamp will be in training set and the latest 2 entries will be in the test set.

I have no idea how could I do that. Any suggestions?

Thanks

Data:

           user_id   item_id   rating   Timestamp 
15              1      539        5  838984068
16              1      586        5  838984068
5               1      355        5  838984474
9               1      370        5  838984596
12              1      466        5  838984679
14              1      520        5  838984679
19              1      594        5  838984679
7               1      362        5  838984885
20              1      616        5  838984941
23              2      260        5  868244562
29              2      733        3  868244562
32              2      786        3  868244562
36              2     1073        3  868244562
33              2      802        2  868244603
38              2     1356        3  868244603
30              2      736        3  868244698
31              2      780        3  868244698
27              2      648        2  868244699
like image 333
ssh26 Avatar asked Feb 22 '17 15:02

ssh26


People also ask

How to split training and testing data sets in Python?

How to split training and testing data sets in Python? The most common split ratio is 80:20. That is 80% of the dataset goes into the training set and 20% of the dataset goes into the testing set.

How do I split data between two data sets in Python?

Split the data using sklearn To split the data we will be using train_test_split from sklearn. train_test_split randomly distributes your data into training and testing set according to the ratio provided. Let’s see how it is done in python.

Why do we need to split a dataset into train and test?

We need to split a dataset into train and test sets to evaluate how well our machine learning model performs. The train set is used to fit the model, the statistics of the train set are known.

Why should I split my dataset randomly?

Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets: The training set is applied to train, or fit, your model.


1 Answers

It requires multiple step, but can be achieve as follow.

The intuition is to generate a rank according to the time stamp, and constraint it between 0 and 1. Then everything below 0.8 will be your train set, otherwise your test set.

How we do this? Creating the rank is easy as that

df.groupby('user_id')['Timestamp'].rank(method='first')
Out[51]: 
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9     1.0
10    2.0
11    3.0
12    4.0
13    5.0
14    6.0
15    7.0
16    8.0
17    9.0
Name: Timestamp, dtype: float64

Then you need to create a mapping between of how many value are in each groups. You can find additional information here: Inplace transformation pandas with groupby.

df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
Out[52]: 
0     9
1     9
2     9
3     9
4     9
5     9
6     9
7     9
8     9
9     9
10    9
11    9
12    9
13    9
14    9
15    9
16    9
17    9
Name: user_id, dtype: int64

Now you can put everything together

ranks = df.groupby('user_id')['Timestamp'].rank(method='first')
counts = df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
(ranks / counts) > 0.8
Out[55]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16     True
17     True
dtype: bool
like image 133
Alessandro Mariani Avatar answered Oct 06 '22 04:10

Alessandro Mariani