I'm working turning a list of records with two columns (A and B) into a matrix representation. I have been using the pivot function within pandas, but the result ends up being fairly large. Does pandas support pivoting into a sparse format? I know I can pivot it and then turn it into some kind of sparse representation, but isn't as elegant as I would like. My end goal is to use it as the input for a predictive model.
Alternatively, is there some kind of sparse pivot capability outside of pandas?
edit: here is an example of a non-sparse pivot
import pandas as pd frame=pd.DataFrame() frame['person']=['me','you','him','you','him','me'] frame['thing']=['a','a','b','c','d','d'] frame['count']=[1,1,1,1,1,1] frame person thing count 0 me a 1 1 you a 1 2 him b 1 3 you c 1 4 him d 1 5 me d 1 frame.pivot('person','thing') count thing a b c d person him NaN 1 NaN 1 me 1 NaN NaN 1 you 1 NaN 1 NaN
This creates a matrix that could contain all possible combinations of persons and things, but it is not sparse.
http://docs.scipy.org/doc/scipy/reference/sparse.html
Sparse matrices take up less space because they can imply things like NaN or 0. If I have a very large data set, this pivoting function can generate a matrix that should be sparse due to the large number of NaNs or 0s. I was hoping that I could save a lot of space/memory by generating something that was sparse right off the bat rather than creating a dense matrix and then converting it to sparse.
Basically, the pivot_table() function is a generalization of the pivot() function that allows aggregation of values — for example, through the len() function in the previous example. Pivot only works — or makes sense — if you need to pivot a table and show values without any aggregation.
The Pandas pivot_table() function provides a familiar interface to create Excel-style pivot tables. The function requires at a minimum either the index= or columns= parameters to specify how to split data. The function can calculate one or multiple aggregation methods, including using custom functions.
Pandas has a pivot_table function that applies a pivot on a DataFrame. It also supports aggfunc that defines the statistic to calculate when pivoting (aggfunc is np. mean by default, which calculates the average).
Use DataFrame. sparse. from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.
Here is a method that creates a sparse scipy matrix based on data and indices of person and thing. person_u
and thing_u
are lists representing the unique entries for your rows and columns of pivot you want to create. Note: this assumes that your count column already has the value you want in it.
from scipy.sparse import csr_matrix person_u = list(sort(frame.person.unique())) thing_u = list(sort(frame.thing.unique())) data = frame['count'].tolist() row = frame.person.astype('category', categories=person_u).cat.codes col = frame.thing.astype('category', categories=thing_u).cat.codes sparse_matrix = csr_matrix((data, (row, col)), shape=(len(person_u), len(thing_u))) >>> sparse_matrix <3x4 sparse matrix of type '<type 'numpy.int64'>' with 6 stored elements in Compressed Sparse Row format> >>> sparse_matrix.todense() matrix([[0, 1, 0, 1], [1, 0, 0, 1], [1, 0, 1, 0]])
Based on your original question, the scipy sparse matrix should be sufficient for your needs, but should you wish to have a sparse dataframe you can do the following:
dfs=pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) for i in np.arange(sparse_matrix.shape[0]) ], index=person_u, columns=thing_u, default_fill_value=0) >>> dfs a b c d him 0 1 0 1 me 1 0 0 1 you 1 0 1 0 >>> type(dfs) pandas.sparse.frame.SparseDataFrame
The answer posted previously by @khammel was useful, but unfortunately no longer works due to changes in pandas and Python. The following should produce the same output:
from scipy.sparse import csr_matrix from pandas.api.types import CategoricalDtype person_c = CategoricalDtype(sorted(frame.person.unique()), ordered=True) thing_c = CategoricalDtype(sorted(frame.thing.unique()), ordered=True) row = frame.person.astype(person_c).cat.codes col = frame.thing.astype(thing_c).cat.codes sparse_matrix = csr_matrix((frame["count"], (row, col)), \ shape=(person_c.categories.size, thing_c.categories.size)) >>> sparse_matrix <3x4 sparse matrix of type '<class 'numpy.int64'>' with 6 stored elements in Compressed Sparse Row format> >>> sparse_matrix.todense() matrix([[0, 1, 0, 1], [1, 0, 0, 1], [1, 0, 1, 0]], dtype=int64) dfs = pd.SparseDataFrame(sparse_matrix, \ index=person_c.categories, \ columns=thing_c.categories, \ default_fill_value=0) >>> dfs a b c d him 0 1 0 1 me 1 0 0 1 you 1 0 1 0
The main changes were:
.astype()
no longer accepts "categorical". You have to create a CategoricalDtype object.sort()
doesn't work anymoreOther changes were more superficial:
csr_matrix
(frame["count"]
) doesn't need to be a list objectSparseDataFrame
accepts a scipy.sparse object directly nowIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With