Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with pandas column that has a list of dicts in every cell

I have a DataFrame that includes a column where every cell is made up of a list of dicts, and each list of dicts is of varying length (including 0).

An example:

df = pd.DataFrame({'ID' : [13423,294847,322844,429847], 'RANKS': [[{u'name': u'A', u'price': u'$1.00', u'rank': u'1'},
{u'name': u'B', u'price': u'$4.00', u'rank': u'2'},
{u'name': u'C', u'price': u'$3.99', u'rank': u'3'},
{u'name': u'D', u'price': u'$2.00', u'rank': u'4'},
{u'name': u'E', u'price': u'$2.50', u'rank': u'5'}],

[{u'name': u'AA', u'price': u'$1.99', u'rank': u'1'},
{u'name': u'BB', u'price': u'$6.99', u'rank': u'2'}],

[{u'name': u'Z', u'price': u'$0.99', u'rank': u'1'},
{u'name': u'Y', u'price': u'$10.00', u'rank': u'2'},
{u'name': u'X', u'price': u'$1.99', u'rank': u'3'}],[]], 'count' : [5,2,3,0]})

Note that 'count' is the number of dicts in 'RANKS.' The goal I had in mind was to creating a series of additional dataframes/tables (one for each 'rank') and link these to the main table in a HDFStore. Something like:

Rank_2
ID       Price   Name
13423    $4.00    B  
294847   $6.99    BB 
322844   $10.99   Y 
429847   NaN      NaN   


Rank_3
ID       Price   Name
13423    $3.99    C  
294847   NaN      NaN 
322844   $1.99    X 
429847   NaN      NaN   

This way I could easy query on ID and rank if needed, but the main table wouldn't get cluttered with the unwinding of this hierarchical data.

The problem, however, is I cannot figure out how to create the DataFrames from this column. I've tried a number of things, the first (to be nested in a for-loop if it worked, but of course it did not):

Rank_1 = pd.DataFrame(df.loc[df['count'] > 0]['RANKS'].map(lambda x: pd.DataFrame(x[0])))

And, the second, since price is the most important piece to me:

for i in range(0,5):
    df['rank_%s' % str(i+1)] = df[df['count'] > i]['RANKS'].map(lambda x: x[i]['price'].strip('$'))

Then convert to float. This works but is a pretty big compromise. Is there an efficient way (that won't get hung up on NaNs) to accomplish my goal of separate DataFrames for each rank?

like image 490
James Avatar asked Feb 22 '15 07:02

James


1 Answers

My gut reaction is that you probably should not break apart your DataFrame into many smaller DataFrames. Handling lots of small DataFrames requires a Python loop which is usually a step along the path to slowness. Instead I think you might be better served with one DataFrame, which flattens the list of dicts so that each inner dict is given its own row in the DataFrame. The keys of the inner dict will become new columns. I suspect this single flat DataFrame format would be able to do anything the multiple DataFrame alternative could do but faster, and it would make saving to HDFStore simple.

Suppose you have a DataFrame with a list of dicts in the RANKS column:

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID' : [13423,294847,322844,429847], 'RANKS': [[{u'name': u'A', u'price': u'$1.00', u'rank': u'1'},
{u'name': u'B', u'price': u'$4.00', u'rank': u'2'},
{u'name': u'C', u'price': u'$3.99', u'rank': u'3'},
{u'name': u'D', u'price': u'$2.00', u'rank': u'4'},
{u'name': u'E', u'price': u'$2.50', u'rank': u'5'}],

[{u'name': u'AA', u'price': u'$1.99', u'rank': u'1'},
{u'name': u'BB', u'price': u'$6.99', u'rank': u'2'}],

[{u'name': u'Z', u'price': u'$0.99', u'rank': u'1'},
{u'name': u'Y', u'price': u'$10.00', u'rank': u'2'},
{u'name': u'X', u'price': u'$1.99', u'rank': u'3'}],[]], 'count' : [5,2,3,0]})

then you can build a flat DataFrame with one dict per row like this:

result = []
for idx, row in df.iterrows():
    for dct in row['RANKS']:
        dct['ID'] = row['ID']
        dct['count'] = row['count']
        result.append(dct)
del df
result = pd.DataFrame(result)
result['rank'] = result['rank'].astype(np.int32)
result['price'] = result['price'].str.replace('$', '')
result['price'] = result['price'].astype('float')
print(result)

which yields

       ID  count name  price  rank
0   13423      5    A   1.00     1
1   13423      5    B   4.00     2
2   13423      5    C   3.99     3
3   13423      5    D   2.00     4
4   13423      5    E   2.50     5
5  294847      2   AA   1.99     1
6  294847      2   BB   6.99     2
7  322844      3    Z   0.99     1
8  322844      3    Y  10.00     2
9  322844      3    X   1.99     3

Note that building result directly from the original data source (thus avoiding df altogether) would be a cleaner, less memory-demanding solution.

like image 110
unutbu Avatar answered Sep 19 '22 23:09

unutbu