Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

Tags:

Is there a way to convert from a pandas.SparseDataFrame to scipy.sparse.csr_matrix, without generating a dense matrix in memory?

scipy.sparse.csr_matrix(df.values)

doesn't work as it generates a dense matrix which is cast to the csr_matrix.

Thanks in advance!

998

asked Jun 27 '15 03:06

Jake0x32

2 Answers

Pandas 0.20.0+:

As of pandas version 0.20.0, released May 5, 2017, there is a one-liner for this:

from scipy import sparse


def sparse_df_to_csr(df):
    return sparse.csr_matrix(df.to_coo())

This uses the new to_coo() method.

Earlier Versions:

Building on Victor May's answer, here's a slightly faster implementation, but it only works if the entire SparseDataFrame is sparse with all BlockIndex (note: if it was created with get_dummies, this will be the case).

Edit: I modified this so it will work with a non-zero fill value. CSR has no native non-zero fill value, so you will have to record it externally.

import numpy as np
import pandas as pd
from scipy import sparse

def sparse_BlockIndex_df_to_csr(df):
    columns = df.columns
    zipped_data = zip(*[(df[col].sp_values - df[col].fill_value,
                         df[col].sp_index.to_int_index().indices)
                        for col in columns])
    data, rows = map(list, zipped_data)
    cols = [np.ones_like(a)*i for (i,a) in enumerate(data)]
    data_f = np.concatenate(data)
    rows_f = np.concatenate(rows)
    cols_f = np.concatenate(cols)
    arr = sparse.coo_matrix((data_f, (rows_f, cols_f)),
                            df.shape, dtype=np.float64)
    return arr.tocsr()

answered Oct 06 '22 13:10

T.C. Proctor

The answer by @Marigold does the trick, but it is slow due to accessing all elements in each column, including the zeros. Building on it, I wrote the following quick n' dirty code, which runs about 50x faster on a 1000x1000 matrix with a density of about 1%. My code also handles dense columns appropriately.

def sparse_df_to_array(df):
    num_rows = df.shape[0]   

    data = []
    row = []
    col = []

    for i, col_name in enumerate(df.columns):
        if isinstance(df[col_name], pd.SparseSeries):
            column_index = df[col_name].sp_index
            if isinstance(column_index, BlockIndex):
                column_index = column_index.to_int_index()

            ix = column_index.indices
            data.append(df[col_name].sp_values)
            row.append(ix)
            col.append(len(df[col_name].sp_values) * [i])
        else:
            data.append(df[col_name].values)
            row.append(np.array(range(0, num_rows)))
            col.append(np.array(num_rows * [i]))

    data_f = np.concatenate(data)
    row_f = np.concatenate(row)
    col_f = np.concatenate(col)

    arr = coo_matrix((data_f, (row_f, col_f)), df.shape, dtype=np.float64)
    return arr.tocsr()

answered Oct 06 '22 11:10

nojka_kruva

Related questions
                            
                                How to use unicode characters with PIL?
                            
                                Kivy to Apk in Windows
                            
                                How do I concatenate many objects into one object using inheritance in python? (during runtime)
                            
                                How to disable Flask-Cache caching
                            
                                Python implementation of the laplacian of gaussian edge detection
                            
                                Python multiprocessing - watch a process and restart it when fails
                            
                                Choose at random from combinations
                            
                                Python Non negative Matrix Factorization that handles both zeros and missing data?
                            
                                What does PuLP LpStatus=Undefined actually mean?
                            
                                Using custom methods in filter with django-rest-framework
                            
                                Generating low discrepancy quasi-random sequences in python/numpy/scipy?
                            
                                How to test coverage properly with Django + Nose
                            
                                Python: strftime() UTC Offset Not working as Expected in Windows
                            
                                Installing Pylab/Matplotlib
                            
                                How does one print a Unicode character code in Python?
                            
                                how to directly import now() from datetime.datetime submodule
                            
                                SAML 2.0 Service Provider in Python
                            
                                Multi-index pivoting in Pandas
                            
                                How to read Avro file in PySpark
                            
                                Dependency Algorithm - find a minimum set of packages to install

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas sparse dataFrame to sparse matrix, without generating a dense matrix in memory

Tags:

python

pandas

scipy

sparse-matrix