I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.
Code:
feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
clf.fit(X_train, y_train)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
Now I am trying to use SelectFromModel
from sklearn.feature_selection
but how can I decide the threshold value for my given dataset.
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)
# Train the selector
sfm.fit(X_train, y_train)
When I try threshold=0.15
and then try to train my model I get an error saying data is too noisy or the selection is too strict.
But if I use threshold=0.015
I am able to train my model on selected new features So how can I decide this threshold value ?
thresholdstr or float, default=None. The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp.
Evaluation function is usually adopted in feature selection method to calculate the value of feature words, and the feature words which assessed value is higher than setted threshold are maintained as the final feature subset. So the threshold is the important factors of feature selection.
The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training.
I would try the following approach:
1e-4
SelectFromModel
fit & transformRandomForestClassifier
in your case) for selected featuresUsing this approach you can estimate what is the best threshold
for your particular data and your estimator
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With