I have a fairly large dataset in the form of a dataframe and I was wondering how I would be able to split the dataframe into two random samples (80% and 20%) for training and testing.
Thanks!
If you insist on concatenating the two dataframes, then first add a new column to each DataFrame called source . Make the value for test. csv 'test' and likewise for the training set. When you have finished cleaning the combined df , then use the source column to split the data again.
Scikit Learn's train_test_split
is a good one. It will split both numpy arrays and dataframes.
from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With