Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ensuring the Cartesian product of keys appears in a Pandas table

Tags:

python

pandas

I have a Pandas dataframe that has two key columns, and I want to ensure that the Cartesian product of those keys exist in the table (because I'll have to make a 2D plot containing all combinations). I'm having trouble coming up with a reasonably brief and idiomatic way to do this.

For example, I start with this table giving combinations of fruits and vegetables, and how they taste together:

   combo   fruit      veg
0  tasty   apple   carrot
1  yucky  banana   carrot
2  tasty  banana  lettuce
3  yucky   lemon  lettuce

I want to end up with this table in which all possible combinations occur:

    fruit      veg    combo
0   apple   carrot    tasty
1   apple  lettuce  UNKNOWN
2  banana   carrot    yucky
3  banana  lettuce    tasty
4   lemon   carrot  UNKNOWN
5   lemon  lettuce    yucky

Here's the best way I've found to do it:

import pandas as pd

# Initial data
df=pd.DataFrame(dict(fruit=['apple','banana','banana','lemon'],
                     veg=['carrot','carrot','lettuce','lettuce'],
                     combo=['tasty','yucky','tasty','yucky']))

# Solution starts here
veg=df.groupby('veg').size().reset_index()
fruit=df.groupby('fruit').size().reset_index()
fruit[0] = veg[0] = 0    #use this dummy column for the join to work!
cartesian = pd.merge(fruit, veg, how='outer', on=0)
del cartesian[0]
all_combos = pd.merge(cartesian, df, how='left')
all_combos[ pd.isnull(all_combos.combo) ] = 'UNKNOWN'

I imagine that there's got to be a simpler and less error-prone way to do this... any advice?

I'd especially appreciate it if someone could show me how to do this both with and without a multi-index containing the fruit and veg columns, because I am really stumped about how to do this with indexes. Based on my SQL experience, I'd think these are exactly the situations that indexes are intended for.

like image 597
Dan Lenski Avatar asked Jun 07 '13 23:06

Dan Lenski


1 Answers

Sometime after this answer, I added cartesian_product to pandas, and soon after MultiIndex.from_product was added (following its suggestion in another question). This enables the following simplification which is more efficient:

In [21]: p = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)

In [22]: df1.reindex(p, fill_value='UNKNOWN')
Out[22]:
                  combo
fruit  veg
apple  carrot     tasty
       lettuce  UNKNOWN
banana carrot     yucky
       lettuce    tasty
lemon  carrot   UNKNOWN
       lettuce    yucky

The older answer follows:


If you use fruit and veg as the index, then you could use itertools.product* to create the MultiIndex to reindex by:

In [10]: from itertools import product

In [11]: df
Out[11]:
   combo   fruit      veg
0  tasty   apple   carrot
1  yucky  banana   carrot
2  tasty  banana  lettuce
3  yucky   lemon  lettuce

The tricky part is to grab the right MultiIndex of all the possible fruit/veg:

In [12]: fruit_x_veg = list(product(np.unique(df['fruit']), np.unique(df['veg'])))

In [13]: fruit_x_veg = pd.MultiIndex.from_tuples(fruit_x_veg,
                                                 names=['fruit', 'veg'])

Then you can just reindex by these:

In [14]: df1 = df.set_index(['fruit', 'veg'])

In [15]: df1
Out[15]:
                combo
fruit  veg
apple  carrot   tasty
banana carrot   yucky
       lettuce  tasty
lemon  lettuce  yucky

In [16]: df1.reindex(fruit_x_veg, fill_value='UNKNOWN')
Out[16]:
                  combo
fruit  veg
apple  carrot     tasty
       lettuce  UNKNOWN
banana carrot     yucky
       lettuce    tasty
lemon  carrot   UNKNOWN
       lettuce    yucky

* If itertools.product is not fast enough consider using this numpy implemention

Note: this implementation was extended in the pandas.tools.util.cartesian_product, which now supports more dtypes (and is used under the hood in MultiIndex.from_product).

like image 181
Andy Hayden Avatar answered Oct 09 '22 18:10

Andy Hayden