Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

model.fit vs model.predict - differences & usage in sklearn

I am new to ML with Python and was trying my first attempt via a tutorial. In that tutorial, there are some lines of code that I am having difficulties understanding how they interact with each other.

First code is the splitting of the data occurred as the following:

train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)

My first question: Why do we use validation data over test data? Why not all, train, val, and test? What is the use case for which combination to use?

The next section speficies the ML model and preditcs.

model = DecisionTreeRegressor() 
model.fit(train_x, train_y)
val_predictions = model.predict(val_x)

My second question: For the model.predict() statement, why do we put val_x in there? Don't we want to predict val_y?

Bonus Question: Also, in many tutorials I saw how StandardScalers are applied. However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it? Please help.

like image 861
ee8291 Avatar asked Jun 09 '19 22:06

ee8291


2 Answers

Q 1.1 : Why do we use validation data over test data? (in above scenario)

train_x, val_x, train_y, val_y = train_test_split(X, y,test_size=0.3)

First of all, the terms validation set and test set are very loosely used in many tutorials and sometimes interchangeably. It is quite possible to call the above val_x, val_y as test_x, test_y

Q 1.2 :Why not all, train, val, and test? (why the split?)

All our machine learning algorithms are going to be used on some real-world data (refer actual test data). However after devising an algorithm we want to "test" how well it performs, what is its accuracy, etc.

Actually we currently don't have the real world data! Right?

But what do we have? The train data! so we cleverly put aside a portion of it (splitting) for later testing the algorithm. The test data is used to evaluate the perform once the model is ready.

model = DecisionTreeRegressor() 
model.fit(train_x, train_y)
val_predictions = model.predict(val_x) # contains y values predicted by the model
score = model.score(val_x, val_y) # evaluates predicted y against actual y of test data
print(score)

Q 2. : For the model.predict() statement, why do we put val_x in there? Don't we want to predict val_y?

Absolutely right we want to predict val_y, but the model needs val_x to predict y. That's exactly what we are passing as argument to the predict function.

I understand it might be confusing to read model predict val_x.

So the better way is to interpret it, as model could you u please predict from val_x, and return predicted_y.

I say predicted_yand not val_y because, both won't be exactly similar. How much they differ? That is what given by score.

Some Terminologies

  • Data Set : Data in hand. It is this data that gets divided later
  • Train Set : It is part of Data Set from which our model learns. Usually large, about 70-80%. Commmonly denoted by train_x and train_y.
  • Test Set : Part of Data Set that we set aside to evaluate the performance of model. This "tests" the model hence the name. Denoted by test_x and test_y.
  • Validation Set : If we want unbiased estimates of accuracy during the learning process, we use another split of Data Set. Usually to find hyperparamters etc. Typically to
    • Pick best-performing algorithm (NB vs DT vs..)
    • Fine-tune parameters (Tree depth, k in kNN, c in SVM)

Q 1.3 : What is the use case for which combination to use?

You will always have train & test, or all three. However in your case the test is just named as val.

BONUS Question : In many tutorials I saw how StandardScalers are applied. However, in this tutorial it doesnt appear as such or did some other function already scale it without having to explicitly state it?

It all depends on your data. If the data is pre-processed and all scaled properly then StandardScalers need not be applied. This particular tutorial just implies that data is already normalised accordingly.

like image 131
arjnt Avatar answered Nov 02 '22 13:11

arjnt


1) Validation sets are often used to help you tune hyper-parameters accordingly. Because you may fine tune the model according to its performance on the validation set, the model may become slightly biased to the validation data, even though it isn't directly trained on that data, which is why we keep this separate from the test set. Once you tune the model to your liking based on the validation set, you can evaluate it on your test set to see how well it generalizes.

2) Calling model.predict(val_x) will return the predicted y values based on the given x values. You can then use some loss function to compare those predicted values with val_y to evaluate the model's performance on your validation set.

like image 38
JTunis Avatar answered Nov 02 '22 14:11

JTunis