I have a large dataset and want to split it into training(50%) and testing set(50%). Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set. My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set. This could be achieved easily in Matlab <pre class="prettyprint"><code>fid=fopen(datafile); C = textscan(fid, '%s','delimiter', '\n'); plist=randperm(100); for i=1:50 trainstring = C{plist(i)}; fprintf(train_file,trainstring); end for i=51:100 teststring = C{plist(i)}; fprintf(test_file,teststring); end </code></pre> But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

This can be done similarly in Python using lists, (note that the whole list is shuffled in place). <pre class="prettyprint"><code>import random with open("datafile.txt", "rb") as f: data = f.read().split('\n') random.shuffle(data) train_data = data[:50] test_data = data[50:] </code></pre>

How to split data into trainset and testset randomly?

Tags:

python

file-io

I have a large dataset and want to split it into training(50%) and testing set(50%).

Say I have 100 examples stored the input file, each line contains one example. I need to choose 50 lines as training set and 50 lines testing set.

My idea is first generate a random list with length 100 (values range from 1 to 100), then use the first 50 elements as the line number for the 50 training examples. The same with testing set.

This could be achieved easily in Matlab

fid=fopen(datafile);
C = textscan(fid, '%s','delimiter', '\n');
plist=randperm(100);
for i=1:50
    trainstring = C{plist(i)};
    fprintf(train_file,trainstring);
end
for i=51:100
    teststring = C{plist(i)};
    fprintf(test_file,teststring);
end

But how could I accomplish this function in Python? I'm new to Python, and don't know whether I could read the whole file into an array, and choose certain lines.

587

asked Oct 15 '22 12:10

Freya Ren

1 Answers

This can be done similarly in Python using lists, (note that the whole list is shuffled in place).

import random

with open("datafile.txt", "rb") as f:
    data = f.read().split('\n')

random.shuffle(data)

train_data = data[:50]
test_data = data[50:]

answered Oct 21 '22 08:10

ijmarshall

Related questions
                            
                                Rotate tick labels for seaborn barplot
                            
                                How do I display tooltips in Tkinter?
                            
                                randomizing two lists and maintaining order in python
                            
                                How to log Keras loss output to a file
                            
                                Django -- Can't get static CSS files to load
                            
                                Create a PyCharm configuration that runs a module a la "python -m foo"
                            
                                Counting positive integer elements in a list with Python list comprehensions
                            
                                Get total physical memory in Python
                            
                                How to add noise (Gaussian/salt and pepper etc) to image in Python with OpenCV [duplicate]
                            
                                Why can a floating point dictionary key overwrite an integer key with the same value?
                            
                                Typing Greek letters etc. in Python plots
                            
                                subprocess.Popen(): OSError: [Errno 8] Exec format error in python?
                            
                                Force django-admin startproject if project folder already exists
                            
                                Background color for Tk in Python
                            
                                Is there a Python equivalent of range(n) for multidimensional ranges?
                            
                                Can't install virtualenvwrapper on OSX 10.11 El Capitan
                            
                                Python: "TypeError: __str__ returned non-string" but still prints to output?
                            
                                Making SVM run faster in python
                            
                                OpenCV not working properly with python on Linux with anaconda. Getting error that cv2.imshow() is not implemented
                            
                                Double Progress Bar in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With