There are a few questions on SO dealing with using groupby
with sparse matrices. However the output seem to be lists, dictionaries, dataframes and other objects.
I'm working on an NLP problem and would like to keep all the data in sparse scipy matrices during processing to prevent memory errors.
Here's the context:
I have vectorized some documents (sample data here):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv('groupbysparsematrix.csv')
docs = df['Text'].tolist()
vectorizer = CountVectorizer()
train_X = vectorizer.fit_transform(docs)
print("Dimensions of training set: {0}".format(train_X.shape))
print type(train_X)
Dimensions of training set: (8, 180)
<class 'scipy.sparse.csr.csr_matrix'>
From the original dataframe I use the date, in a day of the year format, to create the groups I would like to sum over:
from scipy import sparse, hstack
df['Date'] = pd.to_datetime(df['Date'])
groups = df['Date'].apply(lambda x: x.strftime('%j'))
groups_X = sparse.csr_matrix(groups.astype(float)).T
train_X_all = sparse.hstack((train_X, groups_X))
print("Dimensions of concatenated set: {0}".format(train_X_all.shape))
Dimensions of concatenated set: (8, 181)
Now I'd like to apply groupby
(or a similar function) to find the sum of each token per day. I would like the output to be another sparse scipy matrix.
The output matrix would be 3 x 181 and look something like this:
1, 1, 1, ..., 2, 1, 3
2, 1, 3, ..., 1, 1, 4
0, 0, 0, ..., 1, 2, 5
Where the columns 1 to 180 represent the tokens and column 181 represents the day of the year.
The best way of calculating the sum of selected columns (or rows) of a csr
sparse matrix is a matrix product with another sparse matrix that has 1's where you want to sum. In fact csr
sum (for a whole row or column) works by matrix product, and index rows (or columns) is also done with a product (https://stackoverflow.com/a/39500986/901925)
So I'd group the dates array, and use that information to construct the summing 'mask'.
For sake of discussion, consider this dense array:
In [117]: A
Out[117]:
array([[0, 2, 7, 5, 0, 7, 0, 8, 0, 7],
[0, 0, 3, 0, 0, 1, 2, 6, 0, 0],
[0, 0, 0, 0, 2, 0, 5, 0, 0, 0],
[4, 0, 6, 0, 0, 5, 0, 0, 1, 4],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 7, 0, 8, 1, 0, 9, 0, 2, 4],
[9, 0, 8, 4, 0, 0, 0, 0, 9, 7],
[0, 0, 0, 1, 2, 0, 2, 0, 4, 7],
[3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
[0, 0, 1, 8, 5, 0, 0, 0, 8, 0]])
Make a sparse copy:
In [118]: M=sparse.csr_matrix(A)
generate some groups, based on the last column; collections.defaultdict
is a convenient tool to do this:
In [119]: grps=defaultdict(list)
In [120]: for i,v in enumerate(A[:,-1]):
...: grps[v].append(i)
In [121]: grps
Out[121]: defaultdict(list, {0: [1, 2, 4, 9], 2: [8], 4: [3, 5], 7: [0, 6, 7]})
I can iterate on those groups, collect rows of M
, sum across those rows and produce:
In [122]: {k:M[v,:].sum(axis=0) for k, v in grps.items()}
Out[122]:
{0: matrix([[0, 0, 4, 8, 7, 2, 7, 6, 8, 0]], dtype=int32),
2: matrix([[3, 0, 1, 0, 0, 0, 0, 0, 0, 2]], dtype=int32),
4: matrix([[4, 7, 6, 8, 1, 5, 9, 0, 3, 8]], dtype=int32),
7: matrix([[ 9, 2, 15, 10, 2, 7, 2, 8, 13, 21]], dtype=int32)}
In the last column, values include 2*4, and 3*7
So there are 2 tasks - collecting the groups, whether with this defaultdict, or itertools.groupby
(which in this case would require sorting), or pandas
groupby. And secondly this collection of rows and summing. This dictionary iteration is conceptually simple.
A masking matrix might work like this:
In [141]: mask=np.zeros((10,10),int)
In [142]: for i,v in enumerate(A[:,-1]): # same sort of iteration
...: mask[v,i]=1
...:
In [143]: Mask=sparse.csr_matrix(mask)
...
In [145]: Mask.A
Out[145]:
array([[0, 1, 1, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
....
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
In [146]: (Mask*M).A
Out[146]:
array([[ 0, 0, 4, 8, 7, 2, 7, 6, 8, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 3, 0, 1, 0, 0, 0, 0, 0, 0, 2],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 4, 7, 6, 8, 1, 5, 9, 0, 3, 8],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 9, 2, 15, 10, 2, 7, 2, 8, 13, 21],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int32)
This Mask*M
has the same values as the dictionary row, but with the extra 0s. I can isolate the nonzero values with the lil
format:
In [147]: (Mask*M).tolil().data
Out[147]:
array([[4, 8, 7, 2, 7, 6, 8], [], [3, 1, 2], [],
[4, 7, 6, 8, 1, 5, 9, 3, 8], [], [],
[9, 2, 15, 10, 2, 7, 2, 8, 13, 21], [], []], dtype=object)
I can construct the Mask
matrix directly using the coo
sparse style of input:
Mask = sparse.csr_matrix((np.ones(A.shape[0],int),
(A[:,-1], np.arange(A.shape[0]))), shape=(A.shape))
That should be faster and avoid the memory error (no loop or large dense array).
Here is a trick using LabelBinarizer
and matrix multiplication.
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer(sparse_output=True)
grouped = lb.fit_transform(groups).T.dot(train_X)
grouped
is the output sparse matrix of size 3 x 180. And you can find the list of its groups in lb.classes_
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With