Split dataset per user according to timestamp in training and test set in python

Tags:

I am using movielens dataset(ratings.dat), and pandas dataframe to read and process the data. I have to split this data into test and training set. By using pandas dataframe.sample function, and can divide the data into random splits.For example:

train = df.sample(frac=0.8,random_state=200)

test = df.drop(train.index)

Now I am trying to sort data on user_id and then on timestamp, and I need to divide data into 80%-20% per user in training set and test set respectively.

So, for example if user1 rated 10 movies, then the entries for this user should sorted from oldest to latest according to timestamp

ratings = pd.read_csv('filename', sep='\t', engine='python', header=0)

sorted_df = ratings.sort(['user_id', 'timestamp'], ascending=[True, True])

and the splitting should be in such a way that the first 8 entries with oldest timestamp will be in training set and the latest 2 entries will be in the test set.

I have no idea how could I do that. Any suggestions?

Thanks

Data:

           user_id   item_id   rating   Timestamp 
15              1      539        5  838984068
16              1      586        5  838984068
5               1      355        5  838984474
9               1      370        5  838984596
12              1      466        5  838984679
14              1      520        5  838984679
19              1      594        5  838984679
7               1      362        5  838984885
20              1      616        5  838984941
23              2      260        5  868244562
29              2      733        3  868244562
32              2      786        3  868244562
36              2     1073        3  868244562
33              2      802        2  868244603
38              2     1356        3  868244603
30              2      736        3  868244698
31              2      780        3  868244698
27              2      648        2  868244699

333

asked Feb 22 '17 15:02

ssh26

1 Answers

It requires multiple step, but can be achieve as follow.

The intuition is to generate a rank according to the time stamp, and constraint it between 0 and 1. Then everything below 0.8 will be your train set, otherwise your test set.

How we do this? Creating the rank is easy as that

df.groupby('user_id')['Timestamp'].rank(method='first')
Out[51]: 
0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9     1.0
10    2.0
11    3.0
12    4.0
13    5.0
14    6.0
15    7.0
16    8.0
17    9.0
Name: Timestamp, dtype: float64

Then you need to create a mapping between of how many value are in each groups. You can find additional information here: Inplace transformation pandas with groupby.

df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
Out[52]: 
0     9
1     9
2     9
3     9
4     9
5     9
6     9
7     9
8     9
9     9
10    9
11    9
12    9
13    9
14    9
15    9
16    9
17    9
Name: user_id, dtype: int64

Now you can put everything together

ranks = df.groupby('user_id')['Timestamp'].rank(method='first')
counts = df['user_id'].map(df.groupby('user_id')['Timestamp'].apply(len))
(ranks / counts) > 0.8
Out[55]: 
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8      True
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16     True
17     True
dtype: bool

133

answered Oct 06 '22 04:10

Alessandro Mariani

Related questions
                            
                                Converting time to epoch (Python) [duplicate]
                            
                                Search and replace for text within a pdf, in Python
                            
                                How to create a table as select in pyspark.sql
                            
                                How to set single element of multi dimensional Numpy Array using another Numpy array?
                            
                                How to handle variable length sublist unpacking in Python2?
                            
                                Why does this Python subprocess command only work when shell=True on Windows?
                            
                                PyQt Event when a variable value is changed
                            
                                Upsample and Interpolate a NumPy Array
                            
                                selecting a specific value from a data frame
                            
                                Read from a large file without loading whole thing into memory using h5py
                            
                                Add multiple values to one numpy array index
                            
                                How to pass keyword argument to function called by concurrent.futures map call
                            
                                Get saved object of a model form in Django?
                            
                                PyQt5 setText by object name?
                            
                                django error on migration: "There is no unique constraint matching given keys for referenced table
                            
                                TensorFlow FileWriter not writing to file
                            
                                Google AppEngine Endpoints Error: Fetching service config failed (status code 404)
                            
                                Get names of positional arguments from function's signature
                            
                                Keras jupyter notebook outputs blocks during training
                            
                                How to compute standard deviation errors with scipy.optimize.least_squares

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Split dataset per user according to timestamp in training and test set in python

Tags:

python

pandas

ssh26

People also ask

1 Answers

Alessandro Mariani

Recent Activity

Donate For Us