I am using RFECV
for feature selection in scikit-learn. I would like to compare the result of a simple linear model (X,y
) with that of a log transformed model (using X, log(y)
)
Simple Model:
RFECV
and cross_val_score
provide the same result (we need to compare the average score of cross-validation across all folds with the score of RFECV
for all features: 0.66
= 0.66
, no problem, results are reliable)
Log Model:
the Problem: it seems that RFECV
does not provide a way to trasnform the y
. the scores in this case are 0.55
vs 0.53
. This is quite expected though, because I had to manually apply np.log
to fit the data: log_seletor = log_selector.fit(X,np.log(y))
. This r2 score is for y = log(y)
, with no inverse_func
, while what we need is a way to fit the model on the log(y_train)
and calculate the score using exp(y_test)
. Alternatively, if I try to use the TransformedTargetRegressor
, I get the error shown in the code: The classifier does not expose "coef_" or "feature_importances_" attributes
How do I resolve the problem and make sure that the feature selection process is reliable?
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
from sklearn.compose import TransformedTargetRegressor
import numpy as np
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = linear_model.LinearRegression()
log_estimator = TransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)
selector = RFECV(estimator, step=1, cv=5, scoring='r2')
selector = selector.fit(X, y)
###
# log_selector = RFECV(log_estimator, step=1, cv=5, scoring='r2')
# log_seletor = log_selector.fit(X,y)
# #RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
###
log_selector = RFECV(estimator, step=1, cv=5, scoring='r2')
log_seletor = log_selector.fit(X,np.log(y))
print("**Simple Model**")
print("RFECV, r2 scores: ", np.round(selector.grid_scores_,2))
scores = cross_val_score(estimator, X, y, cv=5)
print("cross_val, mean r2 score: ", round(np.mean(scores),2), ", same as RFECV score with all features")
print("no of feat: ", selector.n_features_ )
print("**Log Model**")
log_scores = cross_val_score(log_estimator, X, y, cv=5)
print("RFECV, r2 scores: ", np.round(log_selector.grid_scores_,2))
print("cross_val, mean r2 score: ", round(np.mean(log_scores),2))
print("no of feat: ", log_selector.n_features_ )
Output:
**Simple Model**
RFECV, r2 scores: [0.45 0.6 0.63 0.68 0.68 0.69 0.68 0.67 0.66 0.66]
cross_val, mean r2 score: 0.66 , same as RFECV score with all features
no of feat: 6
**Log Model**
RFECV, r2 scores: [0.39 0.5 0.59 0.56 0.55 0.54 0.53 0.53 0.53 0.53]
cross_val, mean r2 score: 0.55
no of feat: 3
There are three types of feature selection: Wrapper methods (forward, backward, and stepwise selection), Filter methods (ANOVA, Pearson correlation, variance thresholding), and Embedded methods (Lasso, Ridge, Decision Tree).
The main difference:- Feature Extraction transforms an arbitrary data, such as text or images, into numerical features that is understood by machine learning algorithms. Feature Selection on the other hand is a machine learning technique applied on these (numerical) features.
The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training.
Fisher score is one of the most widely used supervised feature selection methods. The algorithm which we will use returns the ranks of the variables based on the fisher's score in descending order.
All you need to do is to add such properties to the TransformedTargetRegressor
:
class MyTransformedTargetRegressor(TransformedTargetRegressor):
@property
def feature_importances_(self):
return self.regressor_.feature_importances_
@property
def coef_(self):
return self.regressor_.coef_
Then in you code, use that:
log_estimator = MyTransformedTargetRegressor(regressor=linear_model.LinearRegression(),
func=np.log,
inverse_func=np.exp)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With