Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sklearn Chi2 For Feature Selection

I'm learning about chi2 for feature selection and came across code like this

However, my understanding of chi2 was that higher scores mean that the feature is more independent (and therefore less useful to the model) and so we would be interested in features with the lowest scores. However, using scikit learns SelectKBest, the selector returns the values with the highest chi2 scores. Is my understanding of using the chi2 test incorrect? Or does the chi2 score in sklearn produce something other than a chi2 statistic?

See code below for what I mean (mostly copied from above link except for the end)

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import pandas as pd
import numpy as np

# Load iris data
iris = load_iris()

# Create features and target
X = iris.data
y = iris.target

# Convert to categorical data by converting data to integers
X = X.astype(int)

# Select two features with highest chi-squared statistics
chi2_selector = SelectKBest(chi2, k=2)
chi2_selector.fit(X, y)

# Look at scores returned from the selector for each feature
chi2_scores = pd.DataFrame(list(zip(iris.feature_names, chi2_selector.scores_, chi2_selector.pvalues_)), columns=['ftr', 'score', 'pval'])
chi2_scores

# you can see that the kbest returned from SelectKBest 
#+ were the two features with the _highest_ score
kbest = np.asarray(iris.feature_names)[chi2_selector.get_support()]
kbest
like image 982
RSHAP Avatar asked Aug 05 '18 15:08

RSHAP


People also ask

Can we use chi-square for feature selection?

A chi-square test is used in statistics to test the independence of two events. Given the data of two variables, we can get observed count O and expected count E. Chi-Square measures how expected count E and observed count O deviates each other.

What is chi2 in Sklearn?

sklearn.feature_selection. chi2(X, y)[source] Compute chi-squared stats between each non-negative feature and class.

Can we use chi-square with numerical dataset?

(note: Chi square tests can only be used on actual numbers and not on percentages, proportions, means, etc.) Chi-square Test is a method that is used to test if there is any relationship between two categorical variables. H0: X and Y are independent. H1: X and Y are dependent.


1 Answers

Your understanding is reversed.

The null hypothesis for chi2 test is that "two categorical variables are independent". So a higher value of chi2 statistic means "two categorical variables are dependent" and MORE USEFUL for classification.

SelectKBest gives you the best two (k=2) features based on higher chi2 values. Thus you need to get those features that it gives, rather that getting the "other features" on the chi2 selector.

You are correct to get the chi2 statistic from chi2_selector.scores_ and the best features from chi2_selector.get_support(). It will give you 'petal length (cm)' and 'petal width (cm)' as top 2 features based on chi2 test of independence test. Hope it clarifies this algorithm.

like image 87
jose_bacoy Avatar answered Oct 19 '22 05:10

jose_bacoy