Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I need to scale test data and Dependent variable in the train data?

I am new to the concept of scaling a feature in Machine Learning, I read that scaling will be useful when one feature range is very high when compared to other features. But if I choose to scale the training data then:

  1. Can I just scale that one feature that has high range?
  2. If I scale the entire X of train data then do I need to also scale the y of train data and entire test data?
like image 284
learncode Avatar asked Sep 16 '17 19:09

learncode


1 Answers

  1. Yes, you can scale that one feature that has high range, but do ensure that there is no other feature that has a high range, because if it exist and has not been scaled then that feature will make the algorithm overlook the contributions of the scaled features and effect the result(output value) with even a slight change in it. It is recommended( but not compulsory) to scale all the features in the training set.
  2. You do not need to scale the Y of train data as the algorithm or model will set the parameter values to get least Cost(error), that is k{Y(output)-Y(original)} anyway. But if the Xtrain was scaled then the test set(feature values, Xtest)(Scale Ytest only if the Ytrain was scaled) needs to be scaled(using training mean and variance) before feeding it to the model because the model hasn't seen this data before and has been trained on data with scaled range, so if the test data has a feature value diverging from the corresponding feature range in train data by a considerably high value then the model will output a wrong prediction for the corresponding test data.
like image 63
im_w0lf Avatar answered Nov 14 '22 21:11

im_w0lf