How to split/partition a dataset into training and test datasets for, e.g., cross validation?

People also ask

Which function is used for splitting the dataset in training and testing samples?

Using train_test_split() from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for bias in your evaluation and validation process.

If you want to split the data set once in two parts, you can use numpy.random.shuffle, or numpy.random.permutation if you need to keep track of the indices (remember to fix the random seed to make everything reproducible):

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]

import numpy
# x is your dataset
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape[0])
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]

There are many ways other ways to repeatedly partition the same data set for cross validation. Many of those are available in the sklearn library (k-fold, leave-n-out, ...). sklearn also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

This way you can keep in sync the labels for the data you're trying to split into training and test.

Just a note. In case you want train, test, AND validation sets, you can do this:

from sklearn.cross_validation import train_test_split

X = get_my_X()
y = get_my_y()
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

These parameters will give 70 % to training, and 15 % each to test and val sets. Hope this helps.

As sklearn.cross_validation module was deprecated, you can use:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(10).reshape((5, 2)), range(5)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

You may also consider stratified division into training and testing set. Startified division also generates training and testing set randomly but in such a way that original class proportions are preserved. This makes training and testing sets better reflect the properties of the original dataset.

import numpy as np  

def get_train_test_inds(y,train_proportion=0.7):
    '''Generates indices, making random stratified split into training set and testing sets
    with proportions train_proportion and (1-train_proportion) of initial sample.
    y is any iterable indicating classes of each observation in the sample.
    Initial proportions of classes inside training and 
    testing sets are preserved (stratified sampling).
    '''

    y=np.array(y)
    train_inds = np.zeros(len(y),dtype=bool)
    test_inds = np.zeros(len(y),dtype=bool)
    values = np.unique(y)
    for value in values:
        value_inds = np.nonzero(y==value)[0]
        np.random.shuffle(value_inds)
        n = int(train_proportion*len(value_inds))

        train_inds[value_inds[:n]]=True
        test_inds[value_inds[n:]]=True

    return train_inds,test_inds

y = np.array([1,1,2,2,3,3])
train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5)
print y[train_inds]
print y[test_inds]

This code outputs:

[1 2 3]
[1 2 3]

Related questions
                            
                                Flake8: Ignore specific warning for entire file
                            
                                while (1) vs. while(True) -- Why is there a difference (in python 2 bytecode)?
                            
                                What does 'killed' mean when a processing of a huge CSV with Python, which suddenly stops?
                            
                                Nested classes' scope?
                            
                                Numpy: find first index of value fast
                            
                                How to turn on line numbers in IDLE?
                            
                                Is there any way to show the dependency trees for pip packages?
                            
                                Filter by property
                            
                                Python PIP Install throws TypeError: unsupported operand type(s) for -=: 'Retry' and 'int'
                            
                                How can I filter lines on load in Pandas read_csv function?
                            
                                Selecting specific rows and columns from NumPy array
                            
                                How Pony (ORM) does its tricks?
                            
                                How to raise a ValueError?
                            
                                How to know function return type and argument types?
                            
                                Cost of exception handlers in Python
                            
                                What is the difference between 'log' and 'symlog'?
                            
                                Plotting with seaborn using the matplotlib object-oriented interface
                            
                                How do I get the user agent with Flask?
                            
                                Invalid syntax when using "print"? [duplicate]
                            
                                Why in Python does "0, 0 == (0, 0)" equal "(0, False)"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split/partition a dataset into training and test datasets for, e.g., cross validation?

Tags:

python

arrays

optimization

numpy

People also ask

Recent Activity

Donate For Us