Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split data into trainset and testset randomly?

Tags:

python

file-io

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

This could be achieved easily in Matlab

fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
    trainstring = C{plist(i)};
    fprintf(train_file,trainstring);
end
for i=51:100
    teststring = C{plist(i)};
    fprintf(test_file,teststring);
end

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

like image 587
Freya Ren Avatar asked Oct 15 '22 12:10

Freya Ren


People also ask

Does Train_test_split split randomly?

model_selection . train_test_split. Split arrays or matrices into random train and test subsets.

Why do we split our data into train set and test set?

In machine learning, data splitting is typically done to avoid overfitting. That is an instance where a machine learning model fits its training data too well and fails to reliably fit additional data. The original data in a machine learning model is typically taken and split into three or four sets.

Do we always need to split your dataset into train and test?

Don't use the same dataset for model training and model evaluation. If you want to build a reliable machine learning model, you need to split your dataset into the training, validation, and test sets. If you don't, your results will be biased, and you'll end up with a false impression of better model accuracy.

Does Train_test_split shuffle the data?

The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly.


1 Answers

This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')

random.shuffle(data)

train_data = data[:50]
test_data = data[50:]
like image 50
ijmarshall Avatar answered Oct 21 '22 08:10

ijmarshall