I am trying to apply a univariate feature selection method using the Python module scikit-learn to a regression (i.e. continuous valued response values) dataset in svmlight format.
I am working with scikit-learn version 0.11.
I have tried two approaches - the first of which failed and the second of which worked for my toy dataset but I believe would give meaningless results for a real dataset.
I would like advice regarding an appropriate univariate feature selection approach I could apply to select the top N features for a regression dataset. I would either like (a) to work out how to make the f_regression function work or (b) to hear alternative suggestions.
The two approaches mentioned above:
This failed with the following error message: "TypeError: copy() takes exactly 1 argument (2 given)"
Please find my toy dataset pasted into the end of this message.
The following code snippet should give the results I describe above.
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func="one of the two functions I refer to above",k=2) #sorry, I hope this message is clear
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))] #This should print the indices of the top 2 features
Thanks in advance.
Richard
Contents of my contrived svmlight file - with additional blank lines inserted for clarity:
1.8 1:1.000000 2:1.000000 4:1.000000 6:1.000000#mA
1.8 1:1.000000 2:1.000000#mB
0.1 5:1.000000#mC
1.8 1:1.000000 2:1.000000#mD
0.1 3:1.000000 4:1.000000#mE
0.1 3:1.000000#mF
1.8 2:1.000000 4:1.000000 5:1.000000 6:1.000000#mG
1.8 2:1.000000#mH
In the Stepwise regression technique, we start fitting the model with each individual predictor and see which one has the lowest p-value. Then pick that variable and then fit the model using two variable one which we already selected in the previous step and taking one by one all remaining ones.
Linear regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.
The answer is: ONLY IF the dataset was standardized before training, the coefficients can be used as feature importance. For example, if we applied a standard scaler to the raw dataset, then fit it to the model, we can say that the feature importance of age_of_a_house is 20.
As larsmans noted, chi2 cannot be used for feature selection with regression data.
Upon updating to scikit-learn version 0.13, the following code selected the top two features (according to the f_regression test) for the toy dataset described above.
def f_regression(X,Y):
import sklearn
return sklearn.feature_selection.f_regression(X,Y,center=False) #center=True (the default) would not work ("ValueError: center=True only allowed for dense data") but should presumably work in general
from sklearn.datasets import load_svmlight_file
X_train_data, Y_train_data = load_svmlight_file(svmlight_format_train_file) #i.e. change this to the name of my toy dataset file
from sklearn.feature_selection import SelectKBest
featureSelector = SelectKBest(score_func=f_regression,k=2)
featureSelector.fit(X_train_data,Y_train_data)
print [1+zero_based_index for zero_based_index in list(featureSelector.get_support(indices=True))]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With