Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train test split without using scikit learn

I have a house price prediction dataset. I have to split the dataset into train and test.
I would like to know if it is possible to do this by using numpy or scipy?
I cannot use scikit learn at this moment.

like image 846
CODE_DIY Avatar asked Nov 09 '17 12:11

CODE_DIY


People also ask

How do you split a dataset randomly in python?

The simplest way to split the modelling dataset into training and testing sets is to assign 2/3 data points to the former and the remaining one-third to the latter. Therefore, we train the model using the training set and then apply the model to the test set. In this way, we can evaluate the performance of our model.

What is the best train test split?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.


3 Answers

I know that your question was only to do a train_test_split with numpy or scipy but there is actually a very simple way to do it with Pandas :

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]

For those who would like a fast and easy solution.

like image 150
Antoine Krajnc Avatar answered Sep 27 '22 21:09

Antoine Krajnc


Although this is old question, this answer might help.

This is how sklearn implements train_test_split, this method given below, takes similar arguments as sklearn.

import numpy as np
from itertools import chain

def _indexing(x, indices):
    """
    :param x: array from which indices has to be fetched
    :param indices: indices to be fetched
    :return: sub-array from given array and indices
    """
    # np array indexing
    if hasattr(x, 'shape'):
        return x[indices]

    # list indexing
    return [x[idx] for idx in indices]

def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
    """
    splits array into train and test data.
    :param arrays: arrays to split in train and test
    :param test_size: size of test set in range (0,1)
    :param shufffle: whether to shuffle arrays or not
    :param random_seed: random seed value
    :return: return 2*len(arrays) divided into train ans test
    """
    # checks
    assert 0 < test_size < 1
    assert len(arrays) > 0
    length = len(arrays[0])
    for i in arrays:
        assert len(i) == length

    n_test = int(np.ceil(length*test_size))
    n_train = length - n_test

    if shufffle:
        perm = np.random.RandomState(random_seed).permutation(length)
        test_indices = perm[:n_test]
        train_indices = perm[n_test:]
    else:
        train_indices = np.arange(n_train)
        test_indices = np.arange(n_train, length)

    return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))

Of course sklearn's implementation supports stratified k-fold, splitting of pandas series etc. This one only works for splitting lists and numpy arrays, which I think will work for your case.

like image 42
Vivek Mehta Avatar answered Sep 27 '22 20:09

Vivek Mehta


import numpy as np
import pandas as pd

X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"], 
            axis=1, inplace=True) # important to drop prices as well

# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]

This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. [:num_training_indices] just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (np.random.seed(some_integer) in the beginning).

like image 40
Jens Petersen Avatar answered Sep 27 '22 19:09

Jens Petersen