I know that train_test_split
splits it randomly, but I need to know how to split it based on time.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# this splits the data randomly as 67% test and 33% train
How to split the same data set based on time as 67% train and 33% test? The dataset has a column TimeStamp.
I tried searching on the similar questions but was not sure about the approach.
Can someone explain briefly?
Train/test splits in time series For example, if you had 144 records at monthly intervals (12 years), a good approach would be to keep the first 120 records (10 years) for training and the last 24 records (2 years) for testing. And that's all there is to train/test splits.
In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.
The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.
Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing.
One easy way to do it..
First: sort the data by time
Second:
import numpy as np
train_set, test_set= np.split(data, [int(.67 *len(data))])
That makes the train_set with the first 67% of the data, and the test_set with rest 33% of the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With