Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing feature hashing in python

I'm wondering if anyone knows how to vectorize feature hashing in Python. For example, this is my code:

    import numpy as np
    hashlen = 5
    x = np.array([4, 7, 4, 2, 6, 8, 0, 6, 3, 1])
    h = np.array([0, 3, 1, 2, 4, 2, 1, 0, 3, 1])

In feature hashing, h represents the indices of the new vector I am hashing x to, i.e the index 0 of the hashed vector should have 4 and 6 summed up, index 1 should have 4, 0 and 1 summed up, etc. The resulting hashed vector should be:

    w = np.array([ 10, 5, 10, 10, 6])

One way of doing this is of course by looping through the hash indices, i.e:

    for itr in range(hashlen):
        w[itr] = np.sum(x[np.where(h==itr)])

For large vectors, the complexity is a function of hashlen (the length of the hashed vector). It could take too long, especially with a np.where() in it.

I want to do something like:

    w = np.zeros(hashlen)
    w[h]+= x

However, the result of this is the same as doing

    w = np.zeros(hashlen)
    w[h] = x

Can anyone let me know if I'm missing something here? Or if there's an 'easy' way of doing the feature hashing that doesn't involve too many computations?

like image 334
Ganesh Sundar Avatar asked Jul 31 '13 16:07

Ganesh Sundar


1 Answers

You can use bincount with weights to do what you are asking:

>>> np.bincount(h,weights=x)
array([ 10.,   5.,  10.,  10.,   6.])

For matrices:

>>> import numpy as np
>>> a=np.random.randint(0,5,(50,50))
>>> rand=np.random.rand(5)
>>> rand
array([ 0.10899745,  0.35296303,  0.21127571,  0.56433924,  0.27895281])
>>> b=np.take(rand,a)

#Unfortunately you cannot do it like this:
>>> np.bincount(a,weights=b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: object too deep for desired array

#There we go:
>>> np.bincount(a.flat,weights=b.flat)
array([  55.04371257,  172.59892108,   96.34172236,  297.40677707,
        145.89232039])

This used fancy indexing to see what was happening:

>>> np.bincount(a.flat)
array([505, 489, 456, 527, 523])
>>> np.bincount(a.flat)*rand
array([  55.04371257,  172.59892108,   96.34172236,  297.40677707,
        145.89232039])
like image 55
Daniel Avatar answered Sep 19 '22 14:09

Daniel