Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Target transformation and feature selection in scikit-learn

I am using RFECV for feature selection in scikit-learn. I would like to compare the result of a simple linear model (X,y) with that of a log transformed model (using X, log(y))

Simple Model: RFECV and cross_val_score provide the same result (we need to compare the average score of cross-validation across all folds with the score of RFECV for all features: 0.66 = 0.66, no problem, results are reliable)

Log Model: the Problem: it seems that RFECV does not provide a way to trasnform the y. the scores in this case are 0.55 vs 0.53. This is quite expected though, because I had to manually apply np.log to fit the data: log_seletor = log_selector.fit(X,np.log(y)). This r2 score is for y = log(y), with no inverse_func, while what we need is a way to fit the model on the log(y_train) and calculate the score using exp(y_test). Alternatively, if I try to use the TransformedTargetRegressor, I get the error shown in the code: The classifier does not expose "coef_" or "feature_importances_" attributes

How do I resolve the problem and make sure that the feature selection process is reliable?

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
                                                func=np.log,
                                                inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y) 
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))

print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features") 
print("no of feat: ", selector.n_features_ )

print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2)) 
print("no of feat: ", log_selector.n_features_ )

Output:

**Simple Model**
RFECV, r2 scores:  [0.45 0.6  0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score:  0.66 , same as RFECV score with all features
no of feat:  6

**Log Model**
RFECV, r2 scores:  [0.39 0.5  0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score:  0.55
no of feat:  3
like image 436
towi_parallelism Avatar asked Sep 29 '19 13:09

towi_parallelism


People also ask

What are the three types of feature selection methods?

There are three types of feature selection: Wrapper methods (forward, backward, and stepwise selection), Filter methods (ANOVA, Pearson correlation, variance thresholding), and Embedded methods (Lasso, Ridge, Decision Tree).

What is the difference between feature selection and feature extraction?

The main difference:- Feature Extraction transforms an arbitrary data, such as text or images, into numerical features that is understood by machine learning algorithms. Feature Selection on the other hand is a machine learning technique applied on these (numerical) features.

What is SelectKBest feature selection?

The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training.

Which technique can be used to do feature selection?

Fisher score is one of the most widely used supervised feature selection methods. The algorithm which we will use returns the ranks of the variables based on the fisher's score in descending order.


1 Answers

All you need to do is to add such properties to the TransformedTargetRegressor:

class MyTransformedTargetRegressor(TransformedTargetRegressor):
    @property
    def feature_importances_(self):
        return self.regressor_.feature_importances_

    @property
    def coef_(self):
        return self.regressor_.coef_

Then in you code, use that:

log_estimator = MyTransformedTargetRegressor(regressor=linear_model.LinearRegression(),
                                             func=np.log,
                                             inverse_func=np.exp)
like image 156
Computer_guy Avatar answered Sep 30 '22 03:09

Computer_guy