Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group by sparse matrix in scipy and return a matrix

There are a few questions on SO dealing with using groupby with sparse matrices. However the output seem to be lists, dictionaries, dataframes and other objects.

I'm working on an NLP problem and would like to keep all the data in sparse scipy matrices during processing to prevent memory errors.

Here's the context:

I have vectorized some documents (sample data here):

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.read_csv('groupbysparsematrix.csv')
docs = df['Text'].tolist()

vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(docs)

print("Dimensions of training set: {0}".format(train_X.shape))
print type(train_X)

Dimensions of training set: (8, 180)
<class 'scipy.sparse.csr.csr_matrix'>

From the original dataframe I use the date, in a day of the year format, to create the groups I would like to sum over:

from scipy import sparse, hstack    

df['Date'] = pd.to_datetime(df['Date'])
groups = df['Date'].apply(lambda x: x.strftime('%j'))
groups_X = sparse.csr_matrix(groups.astype(float)).T
train_X_all = sparse.hstack((train_X, groups_X))

print("Dimensions of concatenated set: {0}".format(train_X_all.shape))

Dimensions of concatenated set: (8, 181)

Now I'd like to apply groupby (or a similar function) to find the sum of each token per day. I would like the output to be another sparse scipy matrix.

The output matrix would be 3 x 181 and look something like this:

 1, 1, 1, ..., 2, 1, 3
 2, 1, 3, ..., 1, 1, 4
 0, 0, 0, ..., 1, 2, 5

Where the columns 1 to 180 represent the tokens and column 181 represents the day of the year.

like image 827
Andrew Brown Avatar asked Jan 06 '23 07:01

Andrew Brown


2 Answers

The best way of calculating the sum of selected columns (or rows) of a csr sparse matrix is a matrix product with another sparse matrix that has 1's where you want to sum. In fact csr sum (for a whole row or column) works by matrix product, and index rows (or columns) is also done with a product (https://stackoverflow.com/a/39500986/901925)

So I'd group the dates array, and use that information to construct the summing 'mask'.

For sake of discussion, consider this dense array:

In [117]: A
Out[117]: 
array([[0, 2, 7, 5, 0, 7, 0, 8, 0, 7],
       [0, 0, 3, 0, 0, 1, 2, 6, 0, 0],
       [0, 0, 0, 0, 2, 0, 5, 0, 0, 0],
       [4, 0, 6, 0, 0, 5, 0, 0, 1, 4],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 7, 0, 8, 1, 0, 9, 0, 2, 4],
       [9, 0, 8, 4, 0, 0, 0, 0, 9, 7],
       [0, 0, 0, 1, 2, 0, 2, 0, 4, 7],
       [3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 1, 8, 5, 0, 0, 0, 8, 0]])

Make a sparse copy:

In [118]: M=sparse.csr_matrix(A)

generate some groups, based on the last column; collections.defaultdict is a convenient tool to do this:

In [119]: grps=defaultdict(list)
In [120]: for i,v in enumerate(A[:,-1]):
     ...:     grps[v].append(i)

In [121]: grps
Out[121]: defaultdict(list, {0: [1, 2, 4, 9], 2: [8], 4: [3, 5], 7: [0, 6, 7]})

I can iterate on those groups, collect rows of M, sum across those rows and produce:

In [122]: {k:M[v,:].sum(axis=0) for k, v in grps.items()}
Out[122]: 
{0: matrix([[0, 0, 4, 8, 7, 2, 7, 6, 8, 0]], dtype=int32),
 2: matrix([[3, 0, 1, 0, 0, 0, 0, 0, 0, 2]], dtype=int32),
 4: matrix([[4, 7, 6, 8, 1, 5, 9, 0, 3, 8]], dtype=int32),
 7: matrix([[ 9,  2, 15, 10,  2,  7,  2,  8, 13, 21]], dtype=int32)}

In the last column, values include 2*4, and 3*7

So there are 2 tasks - collecting the groups, whether with this defaultdict, or itertools.groupby (which in this case would require sorting), or pandas groupby. And secondly this collection of rows and summing. This dictionary iteration is conceptually simple.

A masking matrix might work like this:

In [141]: mask=np.zeros((10,10),int)
In [142]: for i,v in enumerate(A[:,-1]): # same sort of iteration
     ...:     mask[v,i]=1
     ...:     
In [143]: Mask=sparse.csr_matrix(mask)
...
In [145]: Mask.A
Out[145]: 
array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       ....
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
In [146]: (Mask*M).A
Out[146]: 
array([[ 0,  0,  4,  8,  7,  2,  7,  6,  8,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 3,  0,  1,  0,  0,  0,  0,  0,  0,  2],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 4,  7,  6,  8,  1,  5,  9,  0,  3,  8],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 9,  2, 15, 10,  2,  7,  2,  8, 13, 21],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0]], dtype=int32)

This Mask*M has the same values as the dictionary row, but with the extra 0s. I can isolate the nonzero values with the lil format:

In [147]: (Mask*M).tolil().data
Out[147]: 
array([[4, 8, 7, 2, 7, 6, 8], [], [3, 1, 2], [],
       [4, 7, 6, 8, 1, 5, 9, 3, 8], [], [],
       [9, 2, 15, 10, 2, 7, 2, 8, 13, 21], [], []], dtype=object)

I can construct the Mask matrix directly using the coo sparse style of input:

Mask = sparse.csr_matrix((np.ones(A.shape[0],int),
    (A[:,-1], np.arange(A.shape[0]))), shape=(A.shape))

That should be faster and avoid the memory error (no loop or large dense array).

like image 138
hpaulj Avatar answered Jan 09 '23 20:01

hpaulj


Here is a trick using LabelBinarizer and matrix multiplication.

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer(sparse_output=True)
grouped = lb.fit_transform(groups).T.dot(train_X)

grouped is the output sparse matrix of size 3 x 180. And you can find the list of its groups in lb.classes_.

like image 20
Sergey Zakharov Avatar answered Jan 09 '23 20:01

Sergey Zakharov