scikit-learn provides various methods to remove descriptors, a basic method for this purpose has been provided by the given tutorial below,
http://scikit-learn.org/stable/modules/feature_selection.html
but the tutorial does not provide any method or a way that can tell you the way to keep the list of features that either removed or kept.
The code below has been taken from the tutorial.
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
The given example code above depicts only two descriptors "shape(6, 2)", but in my case, I have a huge data frames with a shape of (rows 51, columns 9000). After finding a suitable model I want to keep the track of useful and useless features because I can save computational time during the computation of the features of test data set by calculating only useful features.
For example, when you perform machine learning modeling with WEKA 6.0, it provided with remarkable flexibility over feature selection and after removing the useless feature you can get a list of a discarded features along with the useful features.
thanks
Features without much variance or variability in the data do not provide any information to an ML model for learning the patterns. For example, a feature with only 5 as a value for every record in a dataset is a constant and is an unimportant feature to be used. Removing this feature is essential.
The variance threshold is a simple baseline approach to feature selection. It removes all features which variance doesn't meet some threshold. By default, it removes all zero-variance features, i.e., features that have the same value in all samples.
Then, what you can do, if I'm not wrong is:
In the case of the VarianceThreshold, you can call the method fit
instead of fit_transform
. This will fit data, and the resulting variances will be stored in vt.variances_
(assuming vt
is your object).
Having a threhold, you can extract the features of the transformation as fit_transform
would do:
X[:, vt.variances_ > threshold]
Or get the indexes as:
idx = np.where(vt.variances_ > threshold)[0]
Or as a mask
mask = vt.variances_ > threshold
PS: default threshold is 0
EDIT:
A more straight forward to do, is by using the method get_support
of the class VarianceThreshold
. From the documentation:
get_support([indices]) Get a mask, or integer index, of the features selected
You should call this method after fit
or fit_transform
.
this worked for me if you want to see exactly which columns are remained after thresholding you may use this method:
from sklearn.feature_selection import VarianceThreshold
threshold_n=0.95
sel = VarianceThreshold(threshold=(threshold_n* (1 - threshold_n) ))
sel_var=sel.fit_transform(data)
data[data.columns[sel.get_support(indices=True)]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With