Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing features with low variance using scikit-learn

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,

http://scikit-learn.org/stable/modules/feature_selection.html

but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.

The code below has been taken from the tutorial.

    from sklearn.feature_selection import VarianceThreshold
    X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
    sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.

For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.

thanks

like image 389
jax Avatar asked Mar 27 '15 10:03

jax


People also ask

Why do we remove features with low variance?

Features without much variance or variability in the data do not provide any information to an ML model for learning the patterns. For example, a feature with only 5 as a value for every record in a dataset is a constant and is an unimportant feature to be used. Removing this feature is essential.

What is variance threshold in Sklearn?

The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn't meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.


2 Answers

Then, what you can do, if I'm not wrong is:

In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).

Having a threhold, you can extract the features of the transformation as fit_transform would do:

X[:, vt.variances_ > threshold]

Or get the indexes as:

idx = np.where(vt.variances_ > threshold)[0]

Or as a mask

mask = vt.variances_ > threshold

PS: default threshold is 0

EDIT:

A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:

get_support([indices])  Get a mask, or integer index, of the features selected

You should call this method after fit or fit_transform.

like image 106
Imanol Luengo Avatar answered Sep 18 '22 20:09

Imanol Luengo


this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:

from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]] 
like image 25
Mehran Sahandi Far Avatar answered Sep 17 '22 20:09

Mehran Sahandi Far