Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using chi2 test for feature selection with continuous features (Scikit Learn)

I am trying to predict a binary (categorical) target from many continuous features, and would like to narrow your feature space before heading into model fitting. I noticed that the SelectKBest class from SKLearn's Feature Selection package has the following example on the Iris dataset (which is also predicting a binary target from continuous features):

from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
X_new.shape
(150,2)

The example uses the chi2 test to determine which features should be used in the model. However it is my understanding that the chi2 test is strictly meant to be used in situations where we have categorical features predicting categorical performance. I did not think the chi2 test could be used for scenarios like this. Is my understanding wrong? Can the chi2 test be used to test whether a categorical variable is dependent on a continuous variable?

like image 603
vanchman Avatar asked Apr 15 '18 22:04

vanchman


People also ask

Does chi-square test work on continuous data?

The Chi-Square Test of Independence can only compare categorical variables. It cannot make comparisons between continuous variables or between categorical and continuous variables.

Can we use chi-square for feature selection?

In feature selection, we aim to select the features which are highly dependent on the response. When two features are independent, the observed count is close to the expected count, thus we will have smaller Chi-Square value. So high Chi-Square value indicates that the hypothesis of independence is incorrect.

What is chi2 in Sklearn?

The null hypothesis for chi2 test is that "two categorical variables are independent". So a higher value of chi2 statistic means "two categorical variables are dependent" and MORE USEFUL for classification. SelectKBest gives you the best two (k=2) features based on higher chi2 values.


2 Answers

The SelectKBest function with the chi2 test only works with categorical data. In fact, the result from the test only will have real meaning if the feature only has 1's and 0's.

If you inspect a little bit the implementation of chi2 you going to see that the code only apply a sum across each feature, which means that the function expects just binary values. Also, the parameters that receive the chi2 function indicate the following:

def chi2(X, y):
...

X : {array-like, sparse matrix}, shape = (n_samples, n_features_in)
    Sample vectors.
y : array-like, shape = (n_samples,)
    Target vector (class labels).

Which means that the function expects to receive the feature vector with all their samples. But later when the expected values are calculated, you will see:

feature_count = X.sum(axis=0).reshape(1, -1)
class_prob = Y.mean(axis=0).reshape(1, -1)
expected = np.dot(class_prob.T, feature_count)

And these lines of code only make sense if the X and Y vector only has 1's and 0's.

like image 136
lalfab Avatar answered Oct 17 '22 09:10

lalfab


I agree with @lalfab however, it's not clear to me why sklearn provides an example of using chi2 on the iris dataset which has all continuous variables. https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html

>>> from sklearn.datasets import load_digits
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> X, y = load_digits(return_X_y=True)
>>> X.shape
(1797, 64)
>>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y)
>>> X_new.shape
(1797, 20)
like image 44
xyzzy Avatar answered Oct 17 '22 09:10

xyzzy