Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I trick numpy.histogram into behaving like numpy.bincount?

So, I have lists of words and I need to know how often each word appears on each list. Using ".count(word)" works, but it's too slow (each list has thousands of words and I have thousands of lists).

I've been trying to speed things up with numpy. I generated a unique numerical code for each word, so I could use numpy.bincount (since it only works with integers, not strings). But I get "ValueError: array is too big".

So now I'm trying to tweak the "bins" argument of the numpy.histogram function to make it return the frequency counts I need (somehow numpy.histogram seems to have no trouble with big arrays). But so far no good. Anyone out there happens to have done this before? Is it even possible? Is there some simpler solution that I'm failing to see?

like image 465
Parzival Avatar asked Jun 04 '13 21:06

Parzival


3 Answers

Don't use numpy for this. Use collections.Counter instead. It's designed for this use case.

like image 72
Robert Kern Avatar answered Oct 05 '22 18:10

Robert Kern


Why not reduce your integers to the minimum set using numpy.unique:

original_keys, lookup_vals = numpy.unique(big_int_string_array, return_inverse=True)

You can then just use numpy.bincount on lookup_vals, and if you need to get back the original string unique integer, you can just use the the values of lookup_vals as indices to original_keys.

So, something like:

import binascii
import numpy

string_list = ['a', 'b', 'c', 'a', 'b', 'd', 'c']
int_list = [binascii.crc32(string)**2 for string in string_list]

original_keys, lookup_vals = numpy.unique(int_list, return_inverse=True)

bins = bincount(lookup_vals)

Also, it avoids the need to square your integers.

like image 36
Henry Gomersall Avatar answered Oct 05 '22 18:10

Henry Gomersall


Thiago, You can also try it directly from the categorical variables with scipy's itemfreq method. Here's an example:

>>> import scipy as sp
>>> import scipy.stats
>>> rv = ['do', 're', 'do', 're', 'do', 'mi']
>>> note_frequency = sp.stats.itemfreq(rv)
>>> note_frequency
array([['do', '3'],
       ['mi', '1'],
       ['re', '2']],
      dtype='|S2')
like image 33
rafaelvalle Avatar answered Oct 05 '22 18:10

rafaelvalle