So, I have lists of words and I need to know how often each word appears on each list. Using ".count(word)" works, but it's too slow (each list has thousands of words and I have thousands of lists).
I've been trying to speed things up with numpy. I generated a unique numerical code for each word, so I could use numpy.bincount (since it only works with integers, not strings). But I get "ValueError: array is too big".
So now I'm trying to tweak the "bins" argument of the numpy.histogram function to make it return the frequency counts I need (somehow numpy.histogram seems to have no trouble with big arrays). But so far no good. Anyone out there happens to have done this before? Is it even possible? Is there some simpler solution that I'm failing to see?
Don't use numpy for this. Use collections.Counter
instead. It's designed for this use case.
Why not reduce your integers to the minimum set using numpy.unique
:
original_keys, lookup_vals = numpy.unique(big_int_string_array, return_inverse=True)
You can then just use numpy.bincount
on lookup_vals
, and if you need to get back the original string unique integer, you can just use the the values of lookup_vals
as indices to original_keys
.
So, something like:
import binascii
import numpy
string_list = ['a', 'b', 'c', 'a', 'b', 'd', 'c']
int_list = [binascii.crc32(string)**2 for string in string_list]
original_keys, lookup_vals = numpy.unique(int_list, return_inverse=True)
bins = bincount(lookup_vals)
Also, it avoids the need to square your integers.
Thiago, You can also try it directly from the categorical variables with scipy's itemfreq method. Here's an example:
>>> import scipy as sp
>>> import scipy.stats
>>> rv = ['do', 're', 'do', 're', 'do', 'mi']
>>> note_frequency = sp.stats.itemfreq(rv)
>>> note_frequency
array([['do', '3'],
['mi', '1'],
['re', '2']],
dtype='|S2')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With