Can I trick numpy.histogram into behaving like numpy.bincount?

Question

So, I have lists of words and I need to know how often each word appears on each list. Using ".count(word)" works, but it's too slow (each list has thousands of words and I have thousands of lists).

I've been trying to speed things up with numpy. I generated a unique numerical code for each word, so I could use numpy.bincount (since it only works with integers, not strings). But I get "ValueError: array is too big".

So now I'm trying to tweak the "bins" argument of the numpy.histogram function to make it return the frequency counts I need (somehow numpy.histogram seems to have no trouble with big arrays). But so far no good. Anyone out there happens to have done this before? Is it even possible? Is there some simpler solution that I'm failing to see?

Robert Kern · Accepted Answer

Don't use numpy for this. Use collections.Counter instead. It's designed for this use case.

Henry Gomersall · Answer

Why not reduce your integers to the minimum set using numpy.unique:

original_keys, lookup_vals = numpy.unique(big_int_string_array, return_inverse=True)

You can then just use numpy.bincount on lookup_vals, and if you need to get back the original string unique integer, you can just use the the values of lookup_vals as indices to original_keys.

So, something like:

import binascii
import numpy

string_list = ['a', 'b', 'c', 'a', 'b', 'd', 'c']
int_list = [binascii.crc32(string)**2 for string in string_list]

original_keys, lookup_vals = numpy.unique(int_list, return_inverse=True)

bins = bincount(lookup_vals)

Also, it avoids the need to square your integers.

rafaelvalle · Answer

Thiago, You can also try it directly from the categorical variables with scipy's itemfreq method. Here's an example:

>>> import scipy as sp
>>> import scipy.stats
>>> rv = ['do', 're', 'do', 're', 'do', 'mi']
>>> note_frequency = sp.stats.itemfreq(rv)
>>> note_frequency
array([['do', '3'],
       ['mi', '1'],
       ['re', '2']],
      dtype='|S2')

Can I trick numpy.histogram into behaving like numpy.bincount?

Tags:

python

numpy

histogram

Parzival

3 Answers

Robert Kern

Henry Gomersall

rafaelvalle

Recent Activity

Donate For Us

Can I trick numpy.histogram into behaving like numpy.bincount?

Tags:

python

numpy

histogram

Parzival

3 Answers

Robert Kern

Henry Gomersall

rafaelvalle

Related questions

Recent Activity

Donate For Us