In the docs for the chi-squared univariate feature selection function of scikit-learn http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html, it states
This score can be used to select the n_features features with the highest values for the χ² (chi-square) statistic from X, which must contain booleans or frequencies (e.g., term counts in document classification), relative to the classes.
I am struggling to understand what the corresponding contingency table would look like, especially in the case of frequency features.
For example, consider the below dataset with boolean features and targets:
import numpy as np
>>> X = np.random.randint(2, size=50).reshape(10, 5)
array([[1, 0, 0, 0, 1],
[1, 1, 0, 1, 1],
[1, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 1, 1, 1],
[0, 1, 1, 0, 0],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0]])
>>> y = np.random.randint(2, size=10)
array([1, 0, 0, 0, 1, 1, 1, 1, 0, 1])
To construct the contingency table with respect to the first feature, we can do this (excuse my PEP8 violation)
import scipy as sp
>>> contingency_table = sp.sparse.coo_matrix(
... (np.ones_like(y), (X[:, 0], y)),
... shape=(np.unique(X[:, 0]).shape[0], np.unique(y).shape[0])).A
array([[1, 2],
[3, 4]])
So now I can calculate the chi-squared statistic and its p-values
>>> sp.stats.chi2_contingency(contingency_table)
(0.17857142857142855,
0.67260381744151676,
1,
array([[ 1.2, 1.8],
[ 2.8, 4.2]]))
And this ought to be consistent with scikit-learn's chi2
from sklearn.feature_selection import chi2
>>> chi2_, pval = chi2(X, y)
>>> chi2_[0], pval[0]
(0.023809523809523787, 0.87737055606414338)
...Nope. Have I misinterpreted something?
Also, what does the contingency table look like in the case of frequencies? I assumed it would be something like
contingency_table = sp.sparse.coo_matrix(
(np.ones_like(y), (X[:, 0], y)),
shape=(X[:, 0].max()+1, np.unique(y).shape[0])).A
But the corresponding table of expected frequencies will most likely have several zero elements.
Edit:
To clarify further, consider the first feature X[:, 0]
that is, say, gender and the targets y
, say, handedness.
From this we get the cross tabulation
Right-handed Left-handed (!right-handed)
Male 1 2
Female (!male) 3 4
And we can assess the significance of the difference between the two proportions using the Chi-squared test by setting the expected frequency
sklearn.feature_selection.chi2
does this directly without resorting to explicitly computing the table and obtains the scores using a more efficient procedure that is equivalent to scipy.stats.chisquare
.
After explicitly enumerating the table shown above, I wanted to verify it is consistent with chi2
when applying scipy.stats.chi2_contingency
and to my dismay, it isn't. I'd like to ask why it isn't.
Consider a column x
of X
. sklearn.feature_selection.chi2
tests whether
the frequencies of the y
values where x
is 1 agree with the frequencies of y
in
the full population. (@larsman's answer shows how you can reproduce the calculation with numpy and scipy.) This is not the same as the standard 2x2 contingency table
analysis of x
and y
. In a 2x2 contingency table analysis, the frequencies of y
where x
is 0 also contribute to the test.
Suppose we form the contingency table for x
and y
:
| y=0 y=1
----+---------
x=0 | a b
x=1 | c d
Let n = a + b + c + d. This is the number of samples (i.e. same as len(x) and len(y)).
Let nx = c + d. This is the number of occurrences of 1
in x
.
Let py1 = (b + d)/n. This is the fraction of the full population where y is 1.
sklearn.feature_selection.chi2
performs a chi2 test on [c, d] using the expected
values [(1-py1)*nx, py1*nx]. This is not the same as the standard contingency table
analysis of a 2x2 table.
Here's an extreme example. Suppose the 2x2 contingency table for x
and y
is
| y=0 y=1
----+----------
x=0 | 8 8
x=1 | 20 188
The sklearn calculation produces a chi2 score of 1.58, with a p-value of 0.208.
The contingency table analysis of scipy.stats.chi2_contingency
gives a chi2 score of 18.6, with a p-value of 1.60e-5.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With