Efficiently create sparse pivot tables in pandas?

Tags:

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

Alternatively, is there some kind of sparse pivot capability outside of pandas?

edit: here is an example of a non-sparse pivot

import pandas as pd frame=pd.DataFrame() frame['person']=['me','you','him','you','him','me'] frame['thing']=['a','a','b','c','d','d'] frame['count']=[1,1,1,1,1,1]  frame    person thing  count 0     me     a      1 1    you     a      1 2    him     b      1 3    you     c      1 4    him     d      1 5     me     d      1  frame.pivot('person','thing')          count             thing       a   b   c   d person                    him       NaN   1 NaN   1 me          1 NaN NaN   1 you         1 NaN   1 NaN

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

569

asked Jul 27 '15 19:07

neelshiv

2 Answers

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrix  person_u = list(sort(frame.person.unique())) thing_u = list(sort(frame.thing.unique()))  data = frame['count'].tolist() row = frame.person.astype('category', categories=person_u).cat.codes col = frame.thing.astype('category', categories=thing_u).cat.codes sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))  >>> sparse_matrix  <3x4 sparse matrix of type '<type 'numpy.int64'>'     with 6 stored elements in Compressed Sparse Row format>  >>> sparse_matrix.todense()  matrix([[0, 1, 0, 1],         [1, 0, 0, 1],         [1, 0, 1, 0]])

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)                                for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)  >>> dfs      a  b  c  d him  0  1  0  1 me   1  0  0  1 you  1  0  1  0  >>> type(dfs) pandas.sparse.frame.SparseDataFrame

154

answered Sep 24 '22 03:09

khammel

The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix from pandas.api.types import CategoricalDtype  person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True) thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)  row = frame.person.astype(person_c).cat.codes col = frame.thing.astype(thing_c).cat.codes sparse_matrix = csr_matrix((frame["count"], (row, col)), \                            shape=(person_c.categories.size, thing_c.categories.size))  >>> sparse_matrix <3x4 sparse matrix of type '<class 'numpy.int64'>'      with 6 stored elements in Compressed Sparse Row format>  >>> sparse_matrix.todense() matrix([[0, 1, 0, 1],         [1, 0, 0, 1],         [1, 0, 1, 0]], dtype=int64)   dfs = pd.SparseDataFrame(sparse_matrix, \                          index=person_c.categories, \                          columns=thing_c.categories, \                          default_fill_value=0) >>> dfs         a   b   c   d  him    0   1   0   1   me    1   0   0   1  you    1   0   1   0

The main changes were:

.astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
sort() doesn't work anymore

Other changes were more superficial:

using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
pandas SparseDataFrame accepts a scipy.sparse object directly now

answered Sep 22 '22 03:09

Alnilam

Related questions
                            
                                How to plot 1-d data at given y-value with pylab
                            
                                Pluck in Python
                            
                                elegant find sub-list in list
                            
                                Create .zip in Python?
                            
                                rendering and saving images through Blender python
                            
                                Python psycopg2 timeout
                            
                                In Python, when are two objects the same?
                            
                                Automatically setting class member variables in Python
                            
                                Debugging: stepping through Python script using gdb?
                            
                                Adding a ManyToManyWidget to the reverse of a ManyToManyField in the Django Admin
                            
                                Python 3: EOF when reading a line (Sublime Text 2 is angry)
                            
                                Is there a datetime ± infinity?
                            
                                Pandas - GroupBy and then Merge on original table
                            
                                RuntimeWarning: invalid value encountered in greater
                            
                                Taking subarrays from numpy array with given stride/stepsize
                            
                                Django - run a function every x seconds
                            
                                How can I bypass the Google CAPTCHA with Selenium and Python?
                            
                                Full command line as it was typed
                            
                                Python - import in if
                            
                                Why is if True slower than if 1?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently create sparse pivot tables in pandas?

Tags:

python

pandas

scipy

scikit-learn

sparse-matrix

neelshiv

People also ask

2 Answers

khammel

Alnilam

Recent Activity

Donate For Us