scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below, http://scikit-learn.org/stable/modules/feature_selection.html but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept. The code below has been taken from the tutorial. <pre class="prettyprint"><code> from sklearn.feature_selection import VarianceThreshold X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]] sel = VarianceThreshold(threshold=(.8 * (1 - .8))) sel.fit_transform(X) array([[0, 1], [1, 0], [0, 0], [1, 1], [1, 0], [1, 1]]) </code></pre> The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features. For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features. thanks

Then, what you can do, if I'm not wrong is: In the case of the VarianceThreshold, you can call the method <code>fit</code> instead of <code>fit_transform</code>. This will fit data, and the resulting variances will be stored in <code>vt.variances_</code> (assuming <code>vt</code> is your object). Having a threhold, you can extract the features of the transformation as <code>fit_transform</code> would do: <pre class="prettyprint"><code>X[:, vt.variances_ > threshold] </code></pre> Or get the indexes as: <pre class="prettyprint"><code>idx = np.where(vt.variances_ > threshold)[0] </code></pre> Or as a mask <pre class="prettyprint"><code>mask = vt.variances_ > threshold </code></pre> PS: default threshold is 0 EDIT: A more straight forward to do, is by using the method <code>get_support</code> of the class <code>VarianceThreshold</code>. From the documentation: <pre class="prettyprint"><code>get_support([indices]) Get a mask, or integer index, of the features selected </code></pre> You should call this method after <code>fit</code> or <code>fit_transform</code>.

Removing features with low variance using scikit-learn

Tags:

python-2.7

scikit-learn

scikits

scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,

http://scikit-learn.org/stable/modules/feature_selection.html

but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.

The code below has been taken from the tutorial.

    from sklearn.feature_selection import VarianceThreshold
    X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
    sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
    sel.fit_transform(X)
array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.

For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.

thanks

389

asked Mar 27 '15 10:03

jax

2 Answers

Then, what you can do, if I'm not wrong is:

In the case of the VarianceThreshold, you can call the method fit instead of fit_transform. This will fit data, and the resulting variances will be stored in vt.variances_ (assuming vt is your object).

Having a threhold, you can extract the features of the transformation as fit_transform would do:

X[:, vt.variances_ > threshold]

Or get the indexes as:

idx = np.where(vt.variances_ > threshold)[0]

Or as a mask

mask = vt.variances_ > threshold

PS: default threshold is 0

EDIT:

A more straight forward to do, is by using the method get_support of the class VarianceThreshold. From the documentation:

get_support([indices])  Get a mask, or integer index, of the features selected

You should call this method after fit or fit_transform.

106

answered Sep 18 '22 20:09

Imanol Luengo

this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:

from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]

answered Sep 17 '22 20:09

Mehran Sahandi Far

Related questions
                            
                                Can you get the AWS account name from boto?
                            
                                python flask not creating cookie when setting expiration
                            
                                Timeline bar graph using python and matplotlib
                            
                                How to query parent entity from child entity in Google App Engine (Python) NDB/Datastore?
                            
                                How to catch exceptions in workers in Multiprocessing
                            
                                Why is converting a list to a set faster than using just list to compute a list difference?
                            
                                Reading a list stored in a text file [duplicate]
                            
                                How to change python version for use with pyinstaller
                            
                                Python sum of ASCII values of all characters in a string
                            
                                What is the best approach in python: multiple OR or IN in if statement?
                            
                                Python 2.7: log displayed twice when `logging` module is used in two python scripts
                            
                                SSL3 Certificate Verify Failed when Connecting to JIRA API Using Python
                            
                                Python Error on Google Cloud Install. How do I properly set the environment variable?
                            
                                create column with length of strings in another column pyspark
                            
                                mysql for python 2. 7 says Python v2.7 not found
                            
                                Store a dictionary in a file for later retrieval
                            
                                Running scrapy from script not including pipeline
                            
                                Delete a column in a pandas' DataFrame if its sum is less than x
                            
                                Unable to install Python packages using pip in Ubuntu Linux: InsecurePlatformWarning, SSLError, tlsv1 alert protocol version
                            
                                Browse for file path in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With