I am new to ML with Python and was trying my first attempt via a tutorial. In that tutorial, there are some lines of code that I am having difficulties understanding how they interact with each other.
First code is the splitting of the data occurred as the following:
train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)
My first question: Why do we use validation data over test data? Why not all, train, val, and test? What is the use case for which combination to use?
The next section speficies the ML model and preditcs.
model = DecisionTreeRegressor()
model.fit(train_x, train_y)
val_predictions = model.predict(val_x)
My second question: For the model.predict() statement, why do we put val_x in there? Don't we want to predict val_y?
Bonus Question: Also, in many tutorials I saw how StandardScalers are applied. However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it? Please help.
Q 1.1 : Why do we use validation data over test data? (in above scenario)
train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)
First of all, the terms validation set and test set are very loosely used in many tutorials and sometimes interchangeably. It is quite possible to call the above
val_x, val_y
astest_x, test_y
Q 1.2 :Why not all, train, val, and test? (why the split?)
All our machine learning algorithms are going to be used on some real-world data (refer actual test data). However after devising an algorithm we want to "test" how well it performs, what is its accuracy, etc.
Actually we currently don't have the real world data! Right?
But what do we have? The train data! so we cleverly put aside a portion of it (splitting) for later testing the algorithm. The test data is used to evaluate the perform once the model is ready.
model = DecisionTreeRegressor()
model.fit(train_x, train_y)
val_predictions = model.predict(val_x) # contains y values predicted by the model
score = model.score(val_x, val_y) # evaluates predicted y against actual y of test data
print(score)
Q 2. : For the model.predict() statement, why do we put val_x in there? Don't we want to predict val_y?
Absolutely right we want to predict val_y
, but the model needs val_x
to predict y. That's exactly what we are passing as argument to the predict function.
I understand it might be confusing to read
model
predict
val_x
.So the better way is to interpret it, as
model
could you u pleasepredict
fromval_x
, and returnpredicted_y
.
I say predicted_y
and not val_y
because, both won't be exactly similar. How much they differ? That is what given by score.
Some Terminologies
Q 1.3 : What is the use case for which combination to use?
You will always have train & test, or all three. However in your case the test is just named as val.
BONUS Question : In many tutorials I saw how StandardScalers are applied. However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it?
It all depends on your data. If the data is pre-processed and all scaled properly then StandardScalers need not be applied. This particular tutorial just implies that data is already normalised accordingly.
1) Validation sets are often used to help you tune hyper-parameters accordingly. Because you may fine tune the model according to its performance on the validation set, the model may become slightly biased to the validation data, even though it isn't directly trained on that data, which is why we keep this separate from the test set. Once you tune the model to your liking based on the validation set, you can evaluate it on your test set to see how well it generalizes.
2) Calling model.predict(val_x)
will return the predicted y values based on the given x values. You can then use some loss function to compare those predicted values with val_y
to evaluate the model's performance on your validation set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With