Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Populate a Pandas SparseDataFrame from a SciPy Sparse Matrix

I noticed Pandas now has support for Sparse Matrices and Arrays. Currently, I create DataFrame()s like this:

return DataFrame(matrix.toarray(), columns=features, index=observations)

Is there a way to create a SparseDataFrame() with a scipy.sparse.csc_matrix() or csr_matrix()? Converting to dense format kills RAM badly. Thanks!

like image 421
Will Avatar asked Jul 23 '13 18:07

Will


People also ask

How do you convert a sparse matrix into a DataFrame?

from_spmatrix() function. The sparse-from_spmatrix() function is used to create a new DataFrame from a scipy sparse matrix. Must be convertible to csc format. Row and column labels to use for the resulting DataFrame.

What is nnz in sparse matrix?

A sparse matrix stores "non-zero" elements in several arrays. nnz essentially reports the size of these arrays.


2 Answers

A direct conversion is not supported ATM. Contributions are welcome!

Try this, should be ok on memory as the SpareSeries is much like a csc_matrix (for 1 column) and pretty space efficient

In [37]: col = np.array([0,0,1,2,2,2])

In [38]: data = np.array([1,2,3,4,5,6],dtype='float64')

In [39]: m = csc_matrix( (data,(row,col)), shape=(3,3) )

In [40]: m
Out[40]: 
<3x3 sparse matrix of type '<type 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Column format>

In [46]: pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) 
                              for i in np.arange(m.shape[0]) ])
Out[46]: 
   0  1  2
0  1  0  4
1  0  0  5
2  2  3  6

In [47]: df = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) 
                                   for i in np.arange(m.shape[0]) ])

In [48]: type(df)
Out[48]: pandas.sparse.frame.SparseDataFrame
like image 170
Jeff Avatar answered Oct 17 '22 23:10

Jeff


As of pandas v 0.20.0 you can use the SparseDataFrame constructor.

An example from the pandas docs:

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix

arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sdf = pd.SparseDataFrame(sp_arr)
like image 19
Alex Avatar answered Oct 17 '22 22:10

Alex