Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train, Test, Validate split Python. Three sets

Someone presented a solution to split a dataset into three sets. I wonder where is the label in this case. Or how to set the labels then.

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

like image 629
Curious Avatar asked Sep 20 '25 03:09

Curious


1 Answers

I will answer the question based on comments:

Using this method for splitting:

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test. The labels are within these dataframes.

In train_test_split you are passing two objects, X and Y, which have been most likely previously split from an original dataset and getting in return 4 objects, 2 corresponding to train and two corresponding to test. Keep in mind this: You are first splitting your dataset into independent variables and explained/target variable, and then splitting these two objects into train and test.

With np.split you are going the otherway around, you are first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y. You are doing the same splits, just in reverse order.

However, keep in mind that by passing the indexes for np.split it means the splitting is not performed randomly, whereas with train_test_split you get a random train and test subesets. np.split on the other hand, allows for more flexibility, for instance, as you prove with your example, creating more than 2 subsets.

Maybe this will help! enter image description here

like image 56
Celius Stingher Avatar answered Sep 22 '25 18:09

Celius Stingher