I have a house price prediction dataset. I have to split the dataset into <code>train</code> and <code>test</code>. I would like to know if it is possible to do this by using <code>numpy</code> or <code>scipy</code>? I cannot use <code>scikit</code> learn at this moment.

<pre class="prettyprint"><code>import numpy as np import pandas as pd X_data = pd.read_csv('house.csv') Y_data = X_data["prices"] X_data.drop(["offers", "brick", "bathrooms", "prices"], axis=1, inplace=True) # important to drop prices as well # create random train/test split indices = range(X_data.shape[0]) num_training_instances = int(0.8 * X_data.shape[0]) np.random.shuffle(indices) train_indices = indices[:num_training_indices] test_indices = indices[num_training_indices:] # split the actual data X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices] Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices] </code></pre> This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. <code>[:num_training_indices]</code> just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (<code>np.random.seed(some_integer)</code> in the beginning).

Train test split without using scikit learn

Tags:

python

numpy

scipy

scikit-learn

I have a house price prediction dataset. I have to split the dataset into train and test.
I would like to know if it is possible to do this by using numpy or scipy?
I cannot use scikit learn at this moment.

846

asked Nov 09 '17 12:11

CODE_DIY

3 Answers

I know that your question was only to do a train_test_split with numpy or scipy but there is actually a very simple way to do it with Pandas :

import pandas as pd 

# Shuffle your dataset 
shuffle_df = df.sample(frac=1)

# Define a size for your train set 
train_size = int(0.7 * len(df))

# Split your dataset 
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]

For those who would like a fast and easy solution.

150

answered Sep 27 '22 21:09

Antoine Krajnc

Although this is old question, this answer might help.

This is how sklearn implements train_test_split, this method given below, takes similar arguments as sklearn.

import numpy as np
from itertools import chain

def _indexing(x, indices):
    """
    :param x: array from which indices has to be fetched
    :param indices: indices to be fetched
    :return: sub-array from given array and indices
    """
    # np array indexing
    if hasattr(x, 'shape'):
        return x[indices]

    # list indexing
    return [x[idx] for idx in indices]

def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
    """
    splits array into train and test data.
    :param arrays: arrays to split in train and test
    :param test_size: size of test set in range (0,1)
    :param shufffle: whether to shuffle arrays or not
    :param random_seed: random seed value
    :return: return 2*len(arrays) divided into train ans test
    """
    # checks
    assert 0 < test_size < 1
    assert len(arrays) > 0
    length = len(arrays[0])
    for i in arrays:
        assert len(i) == length

    n_test = int(np.ceil(length*test_size))
    n_train = length - n_test

    if shufffle:
        perm = np.random.RandomState(random_seed).permutation(length)
        test_indices = perm[:n_test]
        train_indices = perm[n_test:]
    else:
        train_indices = np.arange(n_train)
        test_indices = np.arange(n_train, length)

    return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))

Of course sklearn's implementation supports stratified k-fold, splitting of pandas series etc. This one only works for splitting lists and numpy arrays, which I think will work for your case.

answered Sep 27 '22 20:09

Vivek Mehta

import numpy as np
import pandas as pd

X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"], 
            axis=1, inplace=True) # important to drop prices as well

# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]

This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. [:num_training_indices] just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (np.random.seed(some_integer) in the beginning).

answered Sep 27 '22 19:09

Jens Petersen

Related questions
                            
                                Groupby and count the number of unique values (Pandas)
                            
                                Django REST Framework - Class UserSerializer missing "Meta.model" attribute [closed]
                            
                                Stack a square DataFrame to only keep the upper/lower triangle
                            
                                'float' object has no attribute 'strip'
                            
                                Insert a column to a pandas dataframe
                            
                                ggplot in python: plot size and color
                            
                                Pandas: sum values from column to unique values
                            
                                Flask-Login documentation: LoginForm()
                            
                                Determining Bit-Depth of a wav file
                            
                                Python zip multiple directories into one zip file
                            
                                numpy.ndarray' object is not callable - Using Pandas
                            
                                Replacing newlines with spaces for str columns through pandas dataframe
                            
                                Tkinter - window focus loss event
                            
                                Django class view: __init__
                            
                                Getting a OSError when trying to LIST ftp directories in Python
                            
                                In python, when to use a square or round brackets? [duplicate]
                            
                                Have pyodbc return a simple (scalar) value for a query that only returns one item
                            
                                Append Multiple Excel Files(xlsx) together in python
                            
                                AttributeError: module 'django.contrib.auth.views' has no attribute 'login'
                            
                                Pandas: categorize column values by range

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With