Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I need to normalize (or scale) data for randomForest (R package)? [closed]

I am doing regression task - do I need to normalize (or scale) data for randomForest (R package)? And is it neccessary to scale also target values? And if - I want to use scale function from caret package, but I did not find how to get data back (descale, denormalize). Do not you know about some other function (in any package) which is helpfull with normalization/denormalization? Thanks, Milan

like image 760
gutompf Avatar asked Jan 22 '12 14:01

gutompf


People also ask

Does Random Forest need scaling?

Stack Overflow: (1) No, scaling is not necessary for random forests, (2) Random Forest is a tree-based model and hence does not require feature scaling.

How do I know if I need to normalize my data?

Normalization is useful when your data has varying scales and the algorithm you are using does not make assumptions about the distribution of your data, such as k-nearest neighbors and artificial neural networks. Standardization assumes that your data has a Gaussian (bell curve) distribution.

Do you need to scale data for decision tree?

Decision trees and ensemble methods do not require feature scaling to be performed as they are not sensitive to the the variance in the data.

Does Random Forest classifier need normalization?

The Answer for the question “Does random forest need normalization ?” is No. Random Forest is Tree Based Approach where distance matrix is not required. In fact, The normalization or any kind of Feature scaling is only applicable for only those ML algorithms where any distance matrix is required.


6 Answers

No, scaling is not necessary for random forests.

  • The nature of RF is such that convergence and numerical precision issues, which can sometimes trip up the algorithms used in logistic and linear regression, as well as neural networks, aren't so important. Because of this, you don't need to transform variables to a common scale like you might with a NN.

  • You're don't get any analogue of a regression coefficient, which measures the relationship between each predictor variable and the response. Because of this, you also don't need to consider how to interpret such coefficients which is something that is affected by variable measurement scales.

like image 190
Hong Ooi Avatar answered Oct 05 '22 07:10

Hong Ooi


Scaling is done to Normalize data so that priority is not given to a particular feature. Role of Scaling is mostly important in algorithms that are distance based and require Euclidean Distance.

Random Forest is a tree-based model and hence does not require feature scaling.

This algorithm requires partitioning, even if you apply Normalization then also> the result would be the same.

like image 23
Shaurya Uppal Avatar answered Oct 05 '22 09:10

Shaurya Uppal


I do not see any suggestions in either the help page or the Vignette that suggests scaling is necessary for a regression variable in randomForest. This example at Stats Exchange does not use scaling either.

Copy of my comment: The scale function does not belong to pkg:caret. It is part of the "base" R package. There is an unscale function in packages grt and DMwR that will reverse the transformation, or you could simply multiply by the scale attribute and then add the center attribute values.

Your conception of why "normalization" needs to be done may require critical examination. The test of non-normality is only needed after the regressions are done and may not be needed at all if there are no assumptions of normality in the goodness of fit methodology. So: Why are you asking? Searching in SO and Stats.Exchange might prove useful: citation #1 ; citation #2 ; citation #3

The boxcox function is a commonly used tranformation when one does not have prior knowledge of twhat a distribution "should" be and when you really need to do a tranformation. There are many pitfalls in applying transformations, so the fact that you need to ask the question raises concerns that you may be in need of further consultations or self-study.

like image 27
IRTFM Avatar answered Oct 05 '22 09:10

IRTFM


Guess, what will happen in the following example? Imagine, you have 20 predictive features, 18 of them are in [0;10] range and the other 2 in [0;1,000,000] range (taken from a real-life example). Question1: what feature importances will Random Forest assign. Question2: what will happen to the feature importance after scaling the 2 large-range features?

Scaling is important. It is that Random Forest is less sensitive to the scaling then other algorithms and can work with "roughly"-scaled features.

like image 36
Danylo Zherebetskyy Avatar answered Oct 05 '22 08:10

Danylo Zherebetskyy


If you are going to add interactions to dataset - that is, new variable being some function of other variables (usually simple multiplication), and you dont feel what that new variable stands for (cant interprete it), then you should calculate this variable using scaled variables.

like image 45
Qbik Avatar answered Oct 05 '22 09:10

Qbik


Random Forest uses information gain / gini coefficient inherently which will not be affected by scaling unlike many other machine learning models which will (such as k-means clustering, PCA etc). However, it might 'arguably' fasten the convergence as hinted in other answers

like image 43
Vaibhav Avatar answered Oct 05 '22 09:10

Vaibhav