I have a large dataset and want to split it into training(50%) and testing set(50%).
Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.
My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.
This could be achieved easily in Matlab
fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
trainstring = C{plist(i)};
fprintf(train_file,trainstring);
end
for i=51:100
teststring = C{plist(i)};
fprintf(test_file,teststring);
end
But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.
model_selection . train_test_split. Split arrays or matrices into random train and test subsets.
In machine learning, data splitting is typically done to avoid overfitting. That is an instance where a machine learning model fits its training data too well and fails to reliably fit additional data. The original data in a machine learning model is typically taken and split into three or four sets.
Don't use the same dataset for model training and model evaluation. If you want to build a reliable machine learning model, you need to split your dataset into the training, validation, and test sets. If you don't, your results will be biased, and you'll end up with a false impression of better model accuracy.
The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly.
This can be done similarly in Python using lists, (note that the whole list is shuffled in place).
import random
with open("datafile.txt", "rb") as f:
data = f.read().split('\n')
random.shuffle(data)
train_data = data[:50]
test_data = data[50:]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With