Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn (Python): what does f_regression() compute?

I'm trying to understand what f_regression() in the feature selection package does. (http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression)

According to the documentation, the first step in f_regression is as follows:

"1. the regressor of interest and the data are orthogonalized wrt constant regressors."

What does this line mean, exactly? What are these constant regressors?

Thanks!

like image 724
monkeybiz7 Avatar asked Jul 18 '14 17:07

monkeybiz7


People also ask

What is F_regression?

f_regression is therefore recommended as a feature selection criterion to identify potentially predictive feature for a downstream classifier, irrespective of the sign of the association with the target variable. Furthermore f_regression returns p-values while r_regression does not.

What is Coef_ in sklearn?

The coef_ contain the coefficients for the prediction of each of the targets. It is also the same as if you trained a model to predict each of the targets separately.

How do you find the summary of a linear regression in Python?

If you want to extract a summary of a regression model in Python, you should use the statsmodels package. The code below demonstrates how to use this package to fit the same multiple linear regression model as in the earlier example and obtain the model summary. To access and download the CSV file click here.


1 Answers

It means that the mean is subtracted on both variables.

A constant regressor is a vector full of ones. What this vector can explain in your data is then subtracted out. This leads to a vector with zero sum, i.e. a centered variable.

What f1_regression essentially calculates is correlation, a scalar product between centered and appropriately rescaled variables.

The resulting score is a function of this value and the degrees of freedom, i.e. the dimensionality of the vectors. The higher the score, the more probably the variables are associated.

like image 99
eickenberg Avatar answered Oct 16 '22 05:10

eickenberg