Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to decide threshold value in SelectFromModel() for selecting features?

I am using random forest classifier for feature selection. I have 70 features in all and I want to select the most important features out of 70. Below code shows the classifier displaying the features from most significant to least significant.

Code:

feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))  

enter image description here

Now I am trying to use SelectFromModel from sklearn.feature_selection but how can I decide the threshold value for my given dataset.

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)

# Train the selector
sfm.fit(X_train, y_train)

When I try threshold=0.15 and then try to train my model I get an error saying data is too noisy or the selection is too strict.

But if I use threshold=0.015 I am able to train my model on selected new features So how can I decide this threshold value ?

like image 803
stone rock Avatar asked Mar 18 '18 07:03

stone rock


People also ask

What is threshold in SelectFromModel?

thresholdstr or float, default=None. The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp.

What is feature selection threshold?

Evaluation function is usually adopted in feature selection method to calculate the value of feature words, and the feature words which assessed value is higher than setted threshold are maintained as the final feature subset. So the threshold is the important factors of feature selection.

What is K in SelectKBest?

The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training.


1 Answers

I would try the following approach:

  1. start with a low threshold, for example: 1e-4
  2. reduce your features using SelectFromModel fit & transform
  3. compute metrics (accuracy, etc.) for your estimator (RandomForestClassifier in your case) for selected features
  4. increase threshold and repeat all steps starting from point 1.

Using this approach you can estimate what is the best threshold for your particular data and your estimator

like image 193
MaxU - stop WAR against UA Avatar answered Oct 12 '22 03:10

MaxU - stop WAR against UA