Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a non-shuffled train_test_split in sklearn

Tags:

If I want a random train/test split, I use the sklearn helper function:

In [1]: from sklearn.model_selection import train_test_split    ...: train_test_split([1,2,3,4,5,6])    ...: Out[1]: [[1, 6, 4, 2], [5, 3]] 

What is the most concise way to get a non-shuffled train/test split, i.e.

[[1,2,3,4], [5,6]] 

EDIT Currently I am using

train, test = data[:int(len(data) * 0.75)], data[int(len(data) * 0.75):]  

but hoping for something a little nicer. I have opened an issue on sklearn https://github.com/scikit-learn/scikit-learn/issues/8844

EDIT 2: My PR has been merged, in scikit-learn version 0.19, you can pass the parameter shuffle=False to train_test_split to obtain a non-shuffled split.

like image 935
maxymoo Avatar asked May 08 '17 00:05

maxymoo


People also ask

Does Sklearn train test split shuffle?

Scikit-learn has the TimeSeriesSplit functionality for this. The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly.

Why You Should not Trust the train_test_split () function?

The train_test_split() function is provided by the scikit-learn Python package. Usually, we do not care much about the effects of using this function, because with a single line of code we obtain the division of the dataset into two parts, train and test set. Indeed, using this function could be dangerous.

Does train_test_split split randomly?

sklearn. model_selection . train_test_split. Split arrays or matrices into random train and test subsets.

What does Sklearn Cross_validation train_test_split do?

cross_validation. train_test_split. Quick utility that wraps calls to check_arrays and next(iter(ShuffleSplit(n_samples))) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Python lists or tuples occurring in arrays are converted to 1D numpy arrays.


1 Answers

I'm not adding much to Psidom's answer except an easy to copy paste function:

def non_shuffling_train_test_split(X, y, test_size=0.2):     i = int((1 - test_size) * X.shape[0]) + 1     X_train, X_test = np.split(X, [i])     y_train, y_test = np.split(y, [i])     return X_train, X_test, y_train, y_test 

Update: At some point this feature became built in, so now you can do:

from sklearn.model_selection import train_test_split train_test_split(X, y, test_size=0.2, shuffle=False) 
like image 127
Anake Avatar answered Oct 03 '22 22:10

Anake