Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Mann-Whitney U test in learning

I have a table (X, Y) where X is a matrix and Y is a vector of classes. Here an example:

X = 0 0 1 0 1   and Y = 1
    0 1 0 0 0           1
    1 1 1 0 1           0

I want to use Mann-Whitney U test to compute the feature importance(feature selection)

from scipy.stats import mannwhitneyu
results = np.zeros((X.shape[1],2))
for i in xrange(X.shape[1]):
    u, prob = mannwhitneyu(X[:,i], Y)
    results[i,:] = u, pro

I'm not sure if this is correct or no? I obtained large values for a large table, u = 990 for some columns.

like image 496
Hocine Ben Avatar asked Dec 12 '22 14:12

Hocine Ben


1 Answers

I don't think that using Mann-Whitney U test is a good way to do feature selection. Mann-Whitney tests whether distributions of the two variable are the same, it tells you nothing about how correlated the variables are. For example:

>>> from scipy.stats import mannwhitneyu
>>> a = np.arange(100)
>>> b = np.arange(100)
>>> np.random.shuffle(b)
>>> np.corrcoef(a,b)
   array([[ 1.        , -0.07155116],
          [-0.07155116,  1.        ]])
>>> mannwhitneyu(a, b)
(5000.0, 0.49951259627554112) # result for almost not correlated
>>> mannwhitneyu(a, a)
(5000.0, 0.49951259627554112) # result for perfectly correlated

Because a and b have the same distributions we fail to reject the null hypothesis that the distributions are identical.

And since in features selection you are trying find features that mostly explain Y, Mann-Whitney U does not help you with that.

like image 154
Akavall Avatar answered Dec 31 '22 12:12

Akavall