Scikit-learn feature selection for regression data

Tags:

scikit-learn

I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.

I am working with scikit-learn version 0.11.

I have tried two approaches - the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.

I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.

The two approaches mentioned above:

I tried using sklearn.feature_selection.f_regression(X,Y).

This failed with the following error message: "TypeError: copy() takes exactly 1 argument (2 given)"

I tried using chi2(X,Y). This "worked" but I suspect this is because the two response values 0.1 and 1.8 in my toy dataset were being treated as class labels? Presumably, this would not yield a meaningful chi-squared statistic for a real dataset for which there would be a large number of possible response values and the number in each cell [with a particular response value and value for the attribute being tested] would be low?

Please find my toy dataset pasted into the end of this message.

The following code snippet should give the results I describe above.

from sklearn.datasets import load_svmlight_file

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file

from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features

Thanks in advance.

Richard

Contents of my contrived svmlight file - with additional blank lines inserted for clarity:

1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA

1.8 1:1.000000 2:1.000000#mB

0.1 5:1.000000#mC

1.8 1:1.000000 2:1.000000#mD

0.1 3:1.000000 4:1.000000#mE

0.1 3:1.000000#mF

1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG

1.8 2:1.000000#mH

752

asked Mar 18 '13 18:03

user1735732

1 Answers

As larsmans noted, chi2 cannot be used for feature selection with regression data.

Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.

def f_regression(X,Y):
   import sklearn
   return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general

from sklearn.datasets import load_svmlight_file

X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to  the name of my toy dataset file

from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]

179

answered Nov 05 '22 02:11

user1735732

Related questions
                            
                                python pandas csv exporting
                            
                                Python subprocess.call blocking
                            
                                scipy.optimize.curve_fit, TypeError: unsupported operand type
                            
                                Edit a commit with gitpython
                            
                                Default value of DateTimeField for South migration in Django project with activated timezone support
                            
                                Prevent running concurrent instances of a python script [duplicate]
                            
                                Flask-WTF uses input=submit instead of button type=submit
                            
                                Python package to estimate Perron-Frobenius Eigenvalue of real, square, non-negative matrix
                            
                                Pandas read_csv failing on columns with null characters
                            
                                How can I generate file on the fly and delete it after download?
                            
                                Resolving name conflicts in references in Sphinx with intersphinx
                            
                                Is there a Python equivalent to HighLine?
                            
                                Error when installing python settuptools - No such file or directory: '/usr/local/lib/python2.7/site-packages/test-easy-install-8811.pth'
                            
                                On which systems/filesystems is os.open() atomic?
                            
                                About as_view function in Flask
                            
                                python namespace: __main__.Class not isinstance of package.Class
                            
                                Efficent way to split a large text file in python [duplicate]
                            
                                Is a greenthread equal to a "real" thread
                            
                                Reading stderr of subprocess while it is executing
                            
                                Is it possible to detect conflicting method names in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-learn feature selection for regression data

Tags:

python

scikit-learn

user1735732

People also ask

1 Answers

user1735732

Recent Activity

Donate For Us