I have a table (X, Y) where X is a matrix and Y is a vector of classes. Here an example:
X = 0 0 1 0 1 and Y = 1
0 1 0 0 0 1
1 1 1 0 1 0
I want to use Mann-Whitney U test to compute the feature importance(feature selection)
from scipy.stats import mannwhitneyu
results = np.zeros((X.shape[1],2))
for i in xrange(X.shape[1]):
u, prob = mannwhitneyu(X[:,i], Y)
results[i,:] = u, pro
I'm not sure if this is correct or no? I obtained large values for a large table, u = 990
for some columns.
I don't think that using Mann-Whitney U test is a good way to do feature selection. Mann-Whitney tests whether distributions of the two variable are the same, it tells you nothing about how correlated the variables are. For example:
>>> from scipy.stats import mannwhitneyu
>>> a = np.arange(100)
>>> b = np.arange(100)
>>> np.random.shuffle(b)
>>> np.corrcoef(a,b)
array([[ 1. , -0.07155116],
[-0.07155116, 1. ]])
>>> mannwhitneyu(a, b)
(5000.0, 0.49951259627554112) # result for almost not correlated
>>> mannwhitneyu(a, a)
(5000.0, 0.49951259627554112) # result for perfectly correlated
Because a
and b
have the same distributions we fail to reject the null hypothesis that the distributions are identical.
And since in features selection you are trying find features that mostly explain Y
, Mann-Whitney U does not help you with that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With