Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
I will answer the question based on comments:
Using this method for splitting:
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
You are getting 3 different objects, which consist of the first 60% of data from df
for train
, the data corresponding to the interval between 60% and 80% for validate
and the last 20% corresponding to 80%-100% in test
. The labels are within these dataframes.
In train_test_split
you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables
and explained/target variable
, and then splitting these two objects into train and test.
With np.split
you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables
commonly known as X and target variable
known as Y. You are doing the same splits, just in reverse order.
However, keep in mind that by passing the indexes for np.split
it means the splitting is not performed randomly, whereas with train_test_split
you get a random train and test subesets. np.split
on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.
Maybe this will help!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With