Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Log transform dependent variable for regression tree

I have a dataset where I find that the dependent (target) variable has a skewed distribution - i.e. there are a few very large values and a long tail.

When I run the regression tree, one end-node is created for the large-valued observations and one end-node is created for majority of the other observations.

Would it be ok to log transform the dependent (target) variable and use it for regression tree analysis ? When I tried this, I get a different set of nodes and splits that seem to have a more even distribution of observations in each bucket. With log transformation, the Rsquare value for Predicted vs. Observed is also quite good. In other words, I seem to get better testing and validation performance with log transformation. Just want to make sure log transformation is an accepted way to run regression tree when the dependent variable has a skewed distribution.

Thanks !

like image 632
airjordan707 Avatar asked Jan 30 '15 16:01

airjordan707


People also ask

Can you log transform a dependent variable?

Both dependent/response variable and independent/predictor variable(s) are log-transformed. Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable. Example: the coefficient is 0.198.

Should I log transform target variable?

Log transformation of target variable help lessens the distance between these data points and result in the better model. The sole reason is that logarithms apply only to positive numbers, so any model that estimates logarithms perforce is estimating positive values.

What is the purpose of using log transformed variables in a linear regression?

Using the logarithm of one or more variables improves the fit of the model by transforming the distribution of the features to a more normally-shaped bell curve.

Why variable transformation is not needed for a decision tree model?

Such variable transformations are not required with decision trees because the tree structure will remain the same with or without the transformation. Another feature that saves data preparation time: missing values in training data will not impede partitioning the data for building trees.


1 Answers

Yes. It is completely fine to apply log transformation on target variable when it has skewed distribution. That being said, you need to apply inverse function on top of the predicted values to get the actual predicted target value.

Moreover you have tested that by transforming you are getting better estimates on Rsquare error. I am assuming you have computed RSquare after inverting the log using exponent function.

For more details please refer, wiki link on data transformation.

Note that if your training data contains any negative target values, log transformation cannot be applied directly. You might have to apply some other functions which can accept negative values.

like image 162
Sandeep Avatar answered Oct 22 '22 06:10

Sandeep