I have a dataset where I find that the dependent (target) variable has a skewed distribution - i.e. there are a few very large values and a long tail.
When I run the regression tree, one end-node is created for the large-valued observations and one end-node is created for majority of the other observations.
Would it be ok to log transform the dependent (target) variable and use it for regression tree analysis ? When I tried this, I get a different set of nodes and splits that seem to have a more even distribution of observations in each bucket. With log transformation, the Rsquare value for Predicted vs. Observed is also quite good. In other words, I seem to get better testing and validation performance with log transformation. Just want to make sure log transformation is an accepted way to run regression tree when the dependent variable has a skewed distribution.
Thanks !
Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198.
Log transformation of target variable help lessens the distance between these data points and result in the better model. The sole reason is that logarithms apply only to positive numbers, so any model that estimates logarithms perforce is estimating positive values.
Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.
Such variable transformations are not required with decision trees because the tree structure will remain the same with or without the transformation. Another feature that saves data preparation time: missing values in training data will not impede partitioning the data for building trees.
Yes. It is completely fine to apply log transformation on target variable when it has skewed distribution. That being said, you need to apply inverse function on top of the predicted values to get the actual predicted target value.
Moreover you have tested that by transforming you are getting better estimates on Rsquare error. I am assuming you have computed RSquare after inverting the log using exponent function.
For more details please refer, wiki link on data transformation.
Note that if your training data contains any negative target values, log transformation cannot be applied directly. You might have to apply some other functions which can accept negative values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With