Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to efficiently create a SparseDataFrame from a long table?

I have a SQL table which I can read in as a Pandas data frame, that has the following structure:

user_id    value
1          100
1          200
2          100
4          200

It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:

    100  200
1   1    1
2   1    0
4   0    1

Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.

Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?

like image 873
Patrick McCarthy Avatar asked Feb 23 '16 15:02

Patrick McCarthy


1 Answers

I arrived at a solution, albeit a slightly imperfect one.

What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.

Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.

import pandas
import numpy

df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})

# Get unique users and unique features
num_rows = len(df['user_id'].unique())
num_features = len(df['value'].unique())
unique_users = df['user_id'].unique().copy()
unique_features = df['value'].unique().copy()
unique_users.sort()
unique_features.sort()


# assign each user_id to a row_number
user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})


vec_dict = {}

# Create a sparse vector for each feature
for i in range(num_features):
    users_with_feature = df[df['value']==unique_features[i]]['user_id']

    uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']

    vec = numpy.zeros(num_rows)
    vec[uid_rows] = 1

    sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)

    vec_dict[unique_features[i]] = sparse_vec


my_pandas_frame = pandas.DataFrame(vec_dict)    
my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id']) 

The results:

>>> my_pandas_frame
         100  200
user_id          
1          1    1
2          1    0
4          0    1

>>> type(my_pandas_frame)
<class 'pandas.core.frame.DataFrame'>

>>> type(my_pandas_frame[100])
<class 'pandas.sparse.series.SparseSeries'>

Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.

like image 150
Patrick McCarthy Avatar answered Oct 19 '22 16:10

Patrick McCarthy