Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-learn χ² (chi-squared) statistic and corresponding contingency table

In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states

This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.

I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.

For example, consider the below dataset with boolean features and targets:

import numpy as np

>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
       [1, 1, 0, 1, 1],
       [1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 1, 1, 1],
       [0, 1, 1, 0, 0],
       [1, 0, 1, 1, 1],
       [1, 1, 1, 1, 0]])

>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])

To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)

import scipy as sp

>>> contingency_table = sp.sparse.coo_matrix(
...    (np.ones_like(y), (X[:, 0], y)), 
...    shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
       [3, 4]])

So now I can calculate the chi-squared statistic and its p-values

>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
 0.67260381744151676,
 1,
 array([[ 1.2,  1.8],
       [ 2.8,  4.2]]))

And this ought to be consistent with scikit-learn's chi2

from sklearn.feature_selection import chi2

>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)

...Nope. Have I misinterpreted something?

Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like

contingency_table = sp.sparse.coo_matrix(
    (np.ones_like(y), (X[:, 0], y)), 
    shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A

But the corresponding table of expected frequencies will most likely have several zero elements.

Edit:

To clarify further, consider the first feature X[:, 0] that is, say, gender and the targets y, say, handedness.

From this we get the cross tabulation

                Right-handed    Left-handed (!right-handed)
Male            1               2
Female (!male)  3               4

And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency

expfreq

sklearn.feature_selection.chi2 does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare.

After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2 when applying scipy.stats.chi2_contingency and to my dismay, it isn't. I'd like to ask why it isn't.

like image 224
tiao Avatar asked Jan 22 '14 11:01

tiao


1 Answers

Consider a column x of X. sklearn.feature_selection.chi2 tests whether the frequencies of the y values where x is 1 agree with the frequencies of y in the full population. (@larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table analysis of x and y. In a 2x2 contingency table analysis, the frequencies of y where x is 0 also contribute to the test.

Suppose we form the contingency table for x and y:

    | y=0  y=1
----+---------
x=0 |  a    b
x=1 |  c    d

Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).

Let nx = c + d. This is the number of occurrences of 1 in x.

Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.

sklearn.feature_selection.chi2 performs a chi2 test on [c, d] using the expected values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table analysis of a 2x2 table.

Here's an extreme example. Suppose the 2x2 contingency table for x and y is

    |  y=0  y=1
----+----------
x=0 |   8    8
x=1 |  20  188

The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.

The contingency table analysis of scipy.stats.chi2_contingency gives a chi2 score of 18.6, with a p-value of 1.60e-5.

like image 153
Warren Weckesser Avatar answered Sep 18 '22 12:09

Warren Weckesser