Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split training and test sets?

Tags:

python

split

Where should we use

X_train,X_test,y_train,y_test= train_test_split(data, test_size=0.3, random_state=42)

and where should we use

train, test= train_test_split(data, test_size=0.3, random_state=0). 

The former one return this:

value error: not enough values to unpack (expected 4, got 2)

like image 458
MSG Avatar asked May 30 '18 09:05

MSG


People also ask

What is a good split for train and test?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.

Why 70/30 or 80/20 relation between training and testing sets a pedagogical explanation?

Empirical studies show that the best results are obtained if we use 20-30% of the data for testing, and the remaining 70-80% of the data for training.

Why do you split data into training and test sets?

By using similar data for training and testing, you can minimize the effects of data discrepancies and better understand the characteristics of the model. After a model has been processed by using the training set, you test the model by making predictions against the test set.


2 Answers

The first form you use if you want to split instances with features (X) and labels (y). The second form you use if you only want to split features (X).

X_train, X_test, y_train, y_test= train_test_split(data, y, test_size=0.3, random_state=42)

The reason why it didn' t work for you was because you didn't prodide the label data in your train_test_split() function. The above should work well. Just replace y with your label/target data.

like image 59
MrLeeh Avatar answered Oct 20 '22 19:10

MrLeeh


if you have 1 data list, it split to 2,

                             |---data_train
data ----train_test_split()--|
                             |---data_test

if you have 2 data list, it split EACH of the data list to 2, that is 4 in total.

                                       |---data_train_x
                                       |---data_train_y
data_x, data_y ----train_test_split()--|
                                       |---data_test_x
                                       |---data_test_y

The same as n data list.

like image 43
Leoli Avatar answered Oct 20 '22 19:10

Leoli