Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy array filtering by two criteria

I'm trying to run a custom kmeans clustering algorithm and am having trouble getting the document frequency for each column(term) of a 2-d numpy array by cluster. My current algorithm has two numpy arrays, a raw dataset that lists the documents by terms [2000L,9500L] and one that is the clustering assignment [2000L,]. There are 5 clusters. What I need to do is create an array that lists the document frequency for each cluster - basically a count in each column where the column number matches a row number in a different array. The output will be a [5L, 9500L] array (clusters x terms). I'm having trouble finding a way to do the equivalent of a countif and group by. Here is some sample data and the output I would like if I ran it with only 2 clusters:

import numpy as np

dataset = np.array[[1,2,0,3,0],[0,2,0,0,3],[4,5,2,3,0],[0,0,2,3,0]]
clusters = np.array[0,1,1,0]
#run code here to get documentFrequency
print documentFrequency
>> [1,1,1,2,0],[1,2,1,1,1]

my thoughts would be to select out the specific rows that match each cluster, because then counting should be easy. For example, if I could split the data into the following arrays:

cluster0 = np.array[[1,2,0,3,0],[0,0,2,3,0]]
cluster1 = np.array[[0,2,0,0,3],[4,5,2,3,0]]

Any direction or pointers would be much appreciated!

like image 794
flyingmeatball Avatar asked Nov 19 '25 11:11

flyingmeatball


2 Answers

I don't think there is any easy way to vectorize your code, but if you have only a few clusters you could do the obvious:

>>> cluster_count = np.max(clusters)+1
>>> doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
>>> for j in xrange(cluster_count):
...     doc_freq[j] = np.sum(dataset[clusters == j], axis=0)
... 
>>> doc_freq
array([[1, 2, 2, 6, 0],
       [4, 7, 2, 3, 3]])
like image 77
Jaime Avatar answered Nov 21 '25 01:11

Jaime


As @Jaime says, if you have only a few clusters it makes sense to use the usual trick of manually looping over the smallest axis length. Often that gets you most of the benefits of full vectorization with a lot less of the headache that comes with being clever.

That said, when you find yourself wanting groupby, you're often in a domain in which a higher-level tool like pandas comes in very handy:

>>> pd.DataFrame(dataset).groupby(clusters).sum()
   0  1  2  3  4
0  1  2  2  6  0
1  4  7  2  3  3

And you can easily fall back to an ndarray if needed:

>>> pd.DataFrame(dataset).groupby(clusters).sum().values
array([[1, 2, 2, 6, 0],
       [4, 7, 2, 3, 3]])
like image 39
DSM Avatar answered Nov 21 '25 00:11

DSM



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!