pivot irregular dictionary of lists into pandas dataframe

Tags:

(Or a list of lists... I just edited)

Is there an existing python/pandas method for converting a structure like this

food2 = {}
food2["apple"]   = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"]  = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]

into a pivot table like this?

+---------+-------+-----+-------+------+--------+--------+-----+
|         | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple   | 1     |     | 1     |      |        |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1     |     |       | 1    | 1      |        |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot  |       | 1   |       | 1    |        | 1      |     |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish |       | 1   |       |      |        |        | 1   |
+---------+-------+-----+-------+------+--------+--------+-----+

Naively, I would probably just loop through the dictionary. I see how I can use a map on each inner list, but I don't know how to join/stack them over the dictionary. Once I did join them, I could just use pandas.pivot_table

for key in food2:
    attrlist = food2[key]
    onefruit_pairs = map(lambda x: [key, x], attrlist)
    one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr'])
    print(one_fruit_frame)

     fruit    attr
0  bananna   fruit
1  bananna  yellow
2  bananna    long
    fruit    attr
0  carrot     veg
1  carrot  orange
2  carrot    long
   fruit   attr
0  apple  fruit
1  apple  round
     fruit attr
0  raddish  veg
1  raddish  red

869

asked Jan 11 '16 17:01

Mark Miller

Video Answer

2 Answers

An answer using pandas.

# Test data
food2 = {}
food2["apple"]   = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"]  = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]

df = DataFrame(dict([ (k,Series(v)) for k,v in food2.items() ]))
# pivoting to long format
df = pd.melt(df, var_name='item', value_name='categ')
# cross-tabulation
df = pd.crosstab(df['item'], df['categ'])
# sorting index, maybe not necessary    
df.sort_index(inplace=True)
df

# categ    fruit  long  orange  red  round  veg  yellow
# item                                                 
# apple        1     0       0    0      1    0       0
# bananna      1     1       0    0      0    0       1
# carrot       0     1       1    0      0    1       0
# raddish      0     0       0    1      0    1       0

answered Nov 14 '22 22:11

Romain

Pure python:

from itertools import chain

def count(d):
    cols = set(chain(*d.values()))
    yield ['name'] + list(cols)
    for row, values in d.items():
        yield [row] + [(col in values) for col in cols]

Testing:

>>> food2 = {           
    "apple": ["fruit", "round"],
    "bananna": ["fruit", "yellow", "long"],
    "carrot": ["veg", "orange", "long"],
    "raddish": ["veg", "red"]
}

>>> list(count(food2))
[['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'],
 ['bananna', True, False, True, True, False, False, False],
 ['carrot', True, True, False, False, True, False, False],
 ['apple', False, False, True, False, False, True, False],
 ['raddish', False, True, False, False, False, False, True]]

[update]

Performance test:

>>> from itertools import product
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
>>> attrs = labels[:1000]
>>> import random
>>> sample = {}
>>> for k in labels:
...     sample[k] = random.sample(attrs, 5)
>>> import time
>>> n = time.time(); list(count(sample)); print time.time() - n                                                                
62.0367980003

It took less than 2 minutes, for 279936 rows by 1000 columns on my busy machine (lots of chrome tabs open). Let me know if the performance is unacceptable.

[update]

Testing the performance from the other answer:

>>> n = time.time(); \
...     df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \
...     print time.time() - n
72.0512290001

The next line (df = pd.melt(...)) was taking too long so I canceled the test. Take this result with a grain of salt because it was running on a busy machine.

answered Nov 14 '22 22:11

Paulo Scardine

Related questions
                            
                                How to set Many to Many field blank=True on both sides in django
                            
                                UnitTest Python Mock first call to method, second call go as usual
                            
                                How can I mock/patch a decorator in python?
                            
                                Detect star shape in opencv-python
                            
                                skimage resize giving weird output
                            
                                Removing duplicates from a list of lists based on a comparison of an element of the inner lists
                            
                                Sublime Text 3: Write text to output panel
                            
                                Django 1.9 can't modify unique_together (ValueError) wrong number of constrains
                            
                                Etags used in RESTful APIs are still susceptible to race conditions
                            
                                Browsers close socket before the response is fully downloaded
                            
                                Completing Spotify Authorization Code Flow via desktop application without using browser
                            
                                combine values of several objects into a single dictionary
                            
                                Checking divisibility for (sort of) big numbers in python
                            
                                Upgrading a Python 3 virtual environment [duplicate]
                            
                                how to pass context data with django redirect function?
                            
                                How to calculate the click-through rate
                            
                                JavaScript/Ajax to Dynamically Update WTForms Select Field
                            
                                Python building cython extension with setup creates subfolder when __init__.py exists
                            
                                How to resample a Pandas dataframe of mixed type?
                            
                                Shared x axes in Pandas Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pivot irregular dictionary of lists into pandas dataframe

Tags:

python

pandas

pivot-table