Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently create sparse pivot tables in pandas?

I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.

Alternatively, is there some kind of sparse pivot capability outside of pandas?

edit: here is an example of a non-sparse pivot

import pandas as pd frame=pd.DataFrame() frame['person']=['me','you','him','you','him','me'] frame['thing']=['a','a','b','c','d','d'] frame['count']=[1,1,1,1,1,1]  frame    person thing  count 0     me     a      1 1    you     a      1 2    him     b      1 3    you     c      1 4    him     d      1 5     me     d      1  frame.pivot('person','thing')          count             thing       a   b   c   d person                    him       NaN   1 NaN   1 me          1 NaN NaN   1 you         1 NaN   1 NaN 

This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.

http://docs.scipy.org/doc/scipy/reference/sparse.html

Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.

like image 569
neelshiv Avatar asked Jul 27 '15 19:07

neelshiv


People also ask

What is the difference between pivot and pivot table in pandas?

Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation.

What method creates pivot tables with pandas?

The Pandas pivot_table() function provides a familiar interface to create Excel-style pivot tables. The function requires at a minimum either the index= or columns= parameters to specify how to split data. The function can calculate one or multiple aggregation methods, including using custom functions.

What is Aggfunc in pivot table pandas?

Pandas has a pivot_table function that applies a pivot on a DataFrame. It also supports aggfunc that defines the statistic to calculate when pivoting (aggfunc is np. mean by default, which calculates the average).

How do you create a sparse data frame?

Use DataFrame. sparse. from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.


2 Answers

Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u and thing_u are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.

from scipy.sparse import csr_matrix  person_u = list(sort(frame.person.unique())) thing_u = list(sort(frame.thing.unique()))  data = frame['count'].tolist() row = frame.person.astype('category', categories=person_u).cat.codes col = frame.thing.astype('category', categories=thing_u).cat.codes sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u)))  >>> sparse_matrix  <3x4 sparse matrix of type '<type 'numpy.int64'>'     with 6 stored elements in Compressed Sparse Row format>  >>> sparse_matrix.todense()  matrix([[0, 1, 0, 1],         [1, 0, 0, 1],         [1, 0, 1, 0]]) 

Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:

dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0)                                for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0)  >>> dfs      a  b  c  d him  0  1  0  1 me   1  0  0  1 you  1  0  1  0  >>> type(dfs) pandas.sparse.frame.SparseDataFrame 
like image 154
khammel Avatar answered Sep 24 '22 03:09

khammel


The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:

from scipy.sparse import csr_matrix from pandas.api.types import CategoricalDtype  person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True) thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True)  row = frame.person.astype(person_c).cat.codes col = frame.thing.astype(thing_c).cat.codes sparse_matrix = csr_matrix((frame["count"], (row, col)), \                            shape=(person_c.categories.size, thing_c.categories.size))  >>> sparse_matrix <3x4 sparse matrix of type '<class 'numpy.int64'>'      with 6 stored elements in Compressed Sparse Row format>  >>> sparse_matrix.todense() matrix([[0, 1, 0, 1],         [1, 0, 0, 1],         [1, 0, 1, 0]], dtype=int64)   dfs = pd.SparseDataFrame(sparse_matrix, \                          index=person_c.categories, \                          columns=thing_c.categories, \                          default_fill_value=0) >>> dfs         a   b   c   d  him    0   1   0   1   me    1   0   0   1  you    1   0   1   0 

The main changes were:

  • .astype() no longer accepts "categorical". You have to create a CategoricalDtype object.
  • sort() doesn't work anymore

Other changes were more superficial:

  • using the category sizes instead of a length of the uniqued Series objects, just because I didn't want to make another object unnecessarily
  • the data input for the csr_matrix (frame["count"]) doesn't need to be a list object
  • pandas SparseDataFrame accepts a scipy.sparse object directly now
like image 36
Alnilam Avatar answered Sep 22 '22 03:09

Alnilam