Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn selectKbest: which variables were chosen?

I'm trying to get sklearn to select the best k variables (for example k=1) for a linear regression. This works and I can get the R-squared, but it doesn't tell me which variables were the best. How can I find that out?

I have code of the following form (real variable list is much longer):

X=[]
for i in range(len(df)):
X.append([averageindegree[i],indeg3_sum[i],indeg5_sum[i],indeg10_sum[i])


training=[]
actual=[]
counter=0
for fold in range(500):
    X_train, X_test, y_train, y_test = crossval.train_test_split(X, y, test_size=0.3)
    clf = LinearRegression()
    #clf = RidgeCV()
    #clf = LogisticRegression()
    #clf=ElasticNetCV()

    b = fs.SelectKBest(fs.f_regression, k=1) #k is number of features.
    b.fit(X_train, y_train)
    #print b.get_params

    X_train = X_train[:, b.get_support()]
    X_test = X_test[:, b.get_support()]


    clf.fit(X_train,y_train)
    sc = clf.score(X_train, y_train)
    training.append(sc)
    #print "The training R-Squared for fold " + str(1) + " is " + str(round(sc*100,1))+"%"
    sc = clf.score(X_test, y_test)
    actual.append(sc)
    #print "The actual R-Squared for fold " + str(1) + " is " + str(round(sc*100,1))+"%"
like image 822
Alexis Eggermont Avatar asked Jan 31 '14 02:01

Alexis Eggermont


People also ask

How do you know which features are selected in SelectKBest?

What you are looking for is the get_support method of feature_selection. SelectKBest . It returns an array of booleans representing whether a given feature was selected ( True ) or not ( False ).

How do you select K value in SelectKBest?

There are 4 features ( Number1 , Color1 , Number2 , Trait1 ). SelectKBest will select the K most explicative features out of the original set, so K should be a value greater than 0 and lower or equal than the total number of features.

What is L1 based feature selection?

(1) L1-based feature selection linear model with L1 penalty can eliminate some of the features, thus can act as a feature selection method before using another model to fit the data.

What is F_classif?

ANOVA F-test (f_classif) In statistics, ANOVA is used to determine whether there is any statistically significant difference between the means of two or more groups. This is particularly useful in a classification problem where we want to know how well a continuous feature discriminates between multiple classes.


1 Answers

You need to use get_support:

features_columns = [.......]
fs = SelectKBest(score_func=f_regression, k=5)
print zip(fs.get_support(),features_columns)
like image 102
Hamid K Avatar answered Oct 03 '22 20:10

Hamid K