Is there a way to convert from a pandas.SparseDataFrame
to scipy.sparse.csr_matrix
, without generating a dense matrix in memory?
scipy.sparse.csr_matrix(df.values)
doesn't work as it generates a dense matrix which is cast to the csr_matrix
.
Thanks in advance!
Converting to CSR Matrix To convert a DataFrame to a CSR matrix, you first need to create indices for users and movies. Then, you can perform conversion with the sparse. csr_matrix function. It is a bit faster to convert via a coordinate (COO) matrix.
Description. S = sparse( A ) converts a full matrix into sparse form by squeezing out any zero elements. If a matrix contains many zeros, converting the matrix to sparse storage saves memory. S = sparse( m,n ) generates an m -by- n all zero sparse matrix.
Use DataFrame. sparse. from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.
As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:
from scipy import sparse
def sparse_df_to_csr(df):
return sparse.csr_matrix(df.to_coo())
This uses the new to_coo()
method.
Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrame
is sparse with all BlockIndex
(note: if it was created with get_dummies
, this will be the case).
Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.
import numpy as np
import pandas as pd
from scipy import sparse
def sparse_BlockIndex_df_to_csr(df):
columns = df.columns
zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
df[col].sp_index.to_int_index().indices)
for col in columns])
data, rows = map(list, zipped_data)
cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
data_f = np.concatenate(data)
rows_f = np.concatenate(rows)
cols_f = np.concatenate(cols)
arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
df.shape, dtype=np.float64)
return arr.tocsr()
The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.
def sparse_df_to_array(df):
num_rows = df.shape[0]
data = []
row = []
col = []
for i, col_name in enumerate(df.columns):
if isinstance(df[col_name], pd.SparseSeries):
column_index = df[col_name].sp_index
if isinstance(column_index, BlockIndex):
column_index = column_index.to_int_index()
ix = column_index.indices
data.append(df[col_name].sp_values)
row.append(ix)
col.append(len(df[col_name].sp_values) * [i])
else:
data.append(df[col_name].values)
row.append(np.array(range(0, num_rows)))
col.append(np.array(num_rows * [i]))
data_f = np.concatenate(data)
row_f = np.concatenate(row)
col_f = np.concatenate(col)
arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
return arr.tocsr()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With