Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting the best combination of variables for regression model based on reg score

Tags:

Hello old faithful community,

This might be a though one as I can barely find any material on this.

The Problem I have a data set of crimes committed in NSW Australia by council, and have merged this with average house prices by council. I'm now looking to produce a linear regression to try and predict said house price by the crime in the neighbourhood. The issue is, I have 49 crimes, and only want the best ones (statistically speaking) to be used in my model.

I've run the regression score over all and some variables (using correlation), and had results from .23 - .38 but I want to perfect this to the best possible - if there is a way to do this of course.

I've thought about looping over every possible combination, but this would end up by couple of million according to google.

So, my friends - how can I python this dataframe to get the best columns?

like image 875
Jake Bourne Avatar asked Jan 03 '18 05:01

Jake Bourne


2 Answers

If I might add, you may want to take a look at the Python package mlxtend, http://rasbt.github.io/mlxtend.

It is a package that features several forward/backward stepwise regression algorithms, while still using the regressors/selectors of sklearn.

like image 151
1313e Avatar answered Oct 04 '22 23:10

1313e


There is no gold standard to solving this problem and you are right, selecting every combination is computational not feasible most of the time -- especially with 49 variables. One method would be to implement a forward or backward selection by adding/removing variables based on a user specified p-value criteria (this is the statistically relevant criteria you mention). For python implementations using statsmodels, check out these links:

  • https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn/24447#24447
  • http://planspace.org/20150423-forward_selection_with_statsmodels/

Other approaches that are less 'statistically valid' would be to define a model evaluation metric (e.g., r squared, mean squared error, etc) and use a variable selection approach such as LASSO, random forest, genetic algorithm, etc to identify the set of variables that optimize the metric of choice. I find that in practice, ensembling these techniques in a voting-type scheme works the best as different techniques work better for certain types of data. Check out the links below from sklearn to see some options that you can code up pretty quickly with your data:

  • Overview of techniques: http://scikit-learn.org/stable/modules/feature_selection.html
  • A stepwise procedure: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html
  • Select best features based on model: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

If you are up for it, I would try a few techniques and see if the answers converge to the same set of features -- This will give you some insight into the relationships between your variables.

like image 41
rmilletich Avatar answered Oct 04 '22 21:10

rmilletich