I am using movielens dataset(ratings.dat), and pandas dataframe to read and process the data. I have to split this data into test and training set. By using pandas dataframe.sample function, and can divide the data into random splits.For example:
train = df.sample(frac=0.8,random_state=200)
test = df.drop(train.index)
Now I am trying to sort data on user_id and then on timestamp, and I need to divide data into 80%-20% per user in training set and test set respectively.
So, for example if user1 rated 10 movies, then the entries for this user should sorted from oldest to latest according to timestamp
ratings = pd.read_csv('filename', sep='\t', engine='python', header=0)
sorted_df = ratings.sort(['user_id', 'timestamp'], ascending=[True, True])
and the splitting should be in such a way that the first 8 entries with oldest timestamp will be in training set and the latest 2 entries will be in the test set.
I have no idea how could I do that. Any suggestions?
Thanks
Data:
user_id item_id rating Timestamp
15 1 539 5 838984068
16 1 586 5 838984068
5 1 355 5 838984474
9 1 370 5 838984596
12 1 466 5 838984679
14 1 520 5 838984679
19 1 594 5 838984679
7 1 362 5 838984885
20 1 616 5 838984941
23 2 260 5 868244562
29 2 733 3 868244562
32 2 786 3 868244562
36 2 1073 3 868244562
33 2 802 2 868244603
38 2 1356 3 868244603
30 2 736 3 868244698
31 2 780 3 868244698
27 2 648 2 868244699
How to split training and testing data sets in Python? The most common split ratio is 80:20. That is 80% of the dataset goes into the training set and 20% of the dataset goes into the testing set.
Split the data using sklearn To split the data we will be using train_test_split from sklearn. train_test_split randomly distributes your data into training and testing set according to the ratio provided. Let’s see how it is done in python.
We need to split a dataset into train and test sets to evaluate how well our machine learning model performs. The train set is used to fit the model, the statistics of the train set are known.
Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets: The training set is applied to train, or fit, your model.
It requires multiple step, but can be achieve as follow.
The intuition is to generate a rank according to the time stamp, and constraint it between 0 and 1. Then everything below 0.8 will be your train set, otherwise your test set.
How we do this? Creating the rank is easy as that
df.groupby('user_id')['Timestamp'].rank(method='first')
Out[51]:
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 1.0
10 2.0
11 3.0
12 4.0
13 5.0
14 6.0
15 7.0
16 8.0
17 9.0
Name: Timestamp, dtype: float64
Then you need to create a mapping between of how many value are in each groups. You can find additional information here: Inplace transformation pandas with groupby.
df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
Out[52]:
0 9
1 9
2 9
3 9
4 9
5 9
6 9
7 9
8 9
9 9
10 9
11 9
12 9
13 9
14 9
15 9
16 9
17 9
Name: user_id, dtype: int64
Now you can put everything together
ranks = df.groupby('user_id')['Timestamp'].rank(method='first')
counts = df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
(ranks / counts) > 0.8
Out[55]:
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 True
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 True
17 True
dtype: bool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With