Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting data using time-based splitting in test and train datasets

I know that train_test_split splits it randomly, but I need to know how to split it based on time.

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 
  # this splits the data randomly as 67% test and 33% train

How to split the same data set based on time as 67% train and 33% test? The dataset has a column TimeStamp.

I tried searching on the similar questions but was not sure about the approach.

Can someone explain briefly?

like image 722
dhruv bhardwaj Avatar asked Jun 15 '18 17:06

dhruv bhardwaj


People also ask

How do you split time series data into test and train?

Train/test splits in time series For example, if you had 144 records at monthly intervals (12 years), a good approach would be to keep the first 120 records (10 years) for training and the last 24 records (2 years) for testing. And that's all there is to train/test splits.

What is the common data split for training and test data?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.

How do you split a train set and test set?

The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.

What do you mean by splitting dataset into training and testing set?

Separating data into training and testing sets is an important part of evaluating data mining models. Typically, when you separate a data set into a training set and testing set, most of the data is used for training, and a smaller portion of the data is used for testing.


1 Answers

One easy way to do it..

First: sort the data by time

Second:

import numpy as np 
train_set, test_set= np.split(data, [int(.67 *len(data))])

That makes the train_set with the first 67% of the data, and the test_set with rest 33% of the data.

like image 50
zetadaro Avatar answered Sep 23 '22 07:09

zetadaro