After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code: <pre class="prettyprint"><code>def VarianceThreshold_selector(data): selector = VarianceThreshold(.5) selector.fit(data) selector = (pd.DataFrame(selector.transform(data))) return selector x = VarianceThreshold_selector(data) print(x) </code></pre> changes the following data (this is just a small subset of the rows): <pre class="prettyprint"><code>Survived Pclass Sex Age SibSp Parch Nonsense 0 3 1 22 1 0 0 1 1 2 38 1 0 0 1 3 2 26 0 0 0 </code></pre> into this (again just a small subset of the rows) <pre class="prettyprint"><code> 0 1 2 3 0 3 22.0 1 0 1 1 38.0 1 0 2 3 26.0 0 0 </code></pre> Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like : <pre class="prettyprint"><code> Pclass Age Sibsp Parch 0 3 22.0 1 0 1 1 38.0 1 0 2 3 26.0 0 0 </code></pre> Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.

Would something like this help? If you pass it a pandas dataframe, it will get the columns and use <code>get_support</code> like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold. <pre class="prettyprint"><code>>>> df Survived Pclass Sex Age SibSp Parch Nonsense 0 0 3 1 22 1 0 0 1 1 1 2 38 1 0 0 2 1 3 2 26 0 0 0 >>> from sklearn.feature_selection import VarianceThreshold >>> def variance_threshold_selector(data, threshold=0.5): selector = VarianceThreshold(threshold) selector.fit(data) return data[data.columns[selector.get_support(indices=True)]] >>> variance_threshold_selector(df, 0.5) Pclass Age 0 3 22 1 1 38 2 3 26 >>> variance_threshold_selector(df, 0.9) Age 0 22 1 38 2 26 >>> variance_threshold_selector(df, 0.1) Survived Pclass Sex Age SibSp 0 0 3 1 22 1 1 1 1 2 38 1 2 1 3 2 26 0 </code></pre>

Retain feature names after Scikit Feature Selection

Tags:

python

pandas

output

scikit-learn

feature-selection

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:

def VarianceThreshold_selector(data):
    selector = VarianceThreshold(.5) 
    selector.fit(data)
    selector = (pd.DataFrame(selector.transform(data)))
    return selector
x = VarianceThreshold_selector(data)
print(x)

changes the following data (this is just a small subset of the rows):

Survived    Pclass  Sex Age SibSp   Parch   Nonsense
0             3      1  22   1        0        0
1             1      2  38   1        0        0
1             3      2  26   0        0        0

into this (again just a small subset of the rows)

     0         1      2     3
0    3      22.0      1     0
1    1      38.0      1     0
2    3      26.0      0     0

Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :

     Pclass         Age      Sibsp     Parch
0        3          22.0         1         0
1        1          38.0         1         0
2        3          26.0         0         0

Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.

688

asked Oct 02 '16 00:10

Zakery Alexander Fyke

1 Answers

Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.

>>> df
   Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
0         0       3    1   22      1      0         0
1         1       1    2   38      1      0         0
2         1       3    2   26      0      0         0

>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

>>> variance_threshold_selector(df, 0.5)
   Pclass  Age
0       3   22
1       1   38
2       3   26
>>> variance_threshold_selector(df, 0.9)
   Age
0   22
1   38
2   26
>>> variance_threshold_selector(df, 0.1)
   Survived  Pclass  Sex  Age  SibSp
0         0       3    1   22      1
1         1       1    2   38      1
2         1       3    2   26      0

156

answered Sep 30 '22 10:09

Jarad

Related questions
                            
                                Using a loop in Python to name variables
                            
                                Region of Interest opencv python
                            
                                how to do circular shift in numpy
                            
                                Is file object in python an iterable
                            
                                How to close the file after pickle.load() in python
                            
                                How to mock/set system date in pytest?
                            
                                Specifying limit and offset in Django QuerySet wont work
                            
                                Why can't the import command be found?
                            
                                Django ignoring DEBUG value when I use os.environ, why?
                            
                                How to configure PIP per config file to use a proxy (with authentification)?
                            
                                Matching Nested Structures With Regular Expressions in Python
                            
                                Convert DD (decimal degrees) to DMS (degrees minutes seconds) in Python?
                            
                                How to test if every item in a list of type 'int'?
                            
                                Utility To Count Number Of Lines Of Code In Python Or Bash
                            
                                How to get around "sys.exit()" in python nosetest?
                            
                                Remove tuple from list of tuples if certain condition is met
                            
                                Pandas groupby how to compute counts in ranges
                            
                                Ubuntu - How to install a Python module (BeautifulSoup) on Python 3.3 instead of Python 2.7?
                            
                                Add a sequence number to each element in a group using python
                            
                                Logging module not working with Python3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With