(Or a list of lists... I just edited)
Is there an existing python/pandas method for converting a structure like this
food2 = {}
food2["apple"] = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"] = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]
into a pivot table like this?
+---------+-------+-----+-------+------+--------+--------+-----+
| | fruit | veg | round | long | yellow | orange | red |
+---------+-------+-----+-------+------+--------+--------+-----+
| apple | 1 | | 1 | | | | |
+---------+-------+-----+-------+------+--------+--------+-----+
| bananna | 1 | | | 1 | 1 | | |
+---------+-------+-----+-------+------+--------+--------+-----+
| carrot | | 1 | | 1 | | 1 | |
+---------+-------+-----+-------+------+--------+--------+-----+
| raddish | | 1 | | | | | 1 |
+---------+-------+-----+-------+------+--------+--------+-----+
Naively, I would probably just loop through the dictionary. I see how I can use a map on each inner list, but I don't know how to join/stack them over the dictionary. Once I did join them, I could just use pandas.pivot_table
for key in food2:
attrlist = food2[key]
onefruit_pairs = map(lambda x: [key, x], attrlist)
one_fruit_frame = pd.DataFrame(onefruit_pairs, columns=['fruit', 'attr'])
print(one_fruit_frame)
fruit attr
0 bananna fruit
1 bananna yellow
2 bananna long
fruit attr
0 carrot veg
1 carrot orange
2 carrot long
fruit attr
0 apple fruit
1 apple round
fruit attr
0 raddish veg
1 raddish red
You can convert a dictionary to Pandas Dataframe using df = pd. DataFrame. from_dict(my_dict) statement.
We can create a pandas DataFrame object by using the python list of dictionaries. If we use a dictionary as data to the DataFrame function then we no need to specify the column names explicitly.
Example 1: As we know while creating a data frame from the dictionary, the keys will be the columns in the resulted Dataframe. When we create Dataframe from a list of dictionaries, matching keys will be the columns and corresponding values will be the rows of the Dataframe.
Overview: A pandas DataFrame can be converted into a Python dictionary using the DataFrame instance method to_dict(). The output can be specified of various orientations using the parameter orient.
An answer using pandas.
# Test data
food2 = {}
food2["apple"] = ["fruit", "round"]
food2["bananna"] = ["fruit", "yellow", "long"]
food2["carrot"] = ["veg", "orange", "long"]
food2["raddish"] = ["veg", "red"]
df = DataFrame(dict([ (k,Series(v)) for k,v in food2.items() ]))
# pivoting to long format
df = pd.melt(df, var_name='item', value_name='categ')
# cross-tabulation
df = pd.crosstab(df['item'], df['categ'])
# sorting index, maybe not necessary
df.sort_index(inplace=True)
df
# categ fruit long orange red round veg yellow
# item
# apple 1 0 0 0 1 0 0
# bananna 1 1 0 0 0 0 1
# carrot 0 1 1 0 0 1 0
# raddish 0 0 0 1 0 1 0
Pure python:
from itertools import chain
def count(d):
cols = set(chain(*d.values()))
yield ['name'] + list(cols)
for row, values in d.items():
yield [row] + [(col in values) for col in cols]
Testing:
>>> food2 = {
"apple": ["fruit", "round"],
"bananna": ["fruit", "yellow", "long"],
"carrot": ["veg", "orange", "long"],
"raddish": ["veg", "red"]
}
>>> list(count(food2))
[['name', 'long', 'veg', 'fruit', 'yellow', 'orange', 'round', 'red'],
['bananna', True, False, True, True, False, False, False],
['carrot', True, True, False, False, True, False, False],
['apple', False, False, True, False, False, True, False],
['raddish', False, True, False, False, False, False, True]]
[update]
Performance test:
>>> from itertools import product
>>> labels = list("".join(_) for _ in product(*(["ABCDEF"] * 7)))
>>> attrs = labels[:1000]
>>> import random
>>> sample = {}
>>> for k in labels:
... sample[k] = random.sample(attrs, 5)
>>> import time
>>> n = time.time(); list(count(sample)); print time.time() - n
62.0367980003
It took less than 2 minutes, for 279936 rows by 1000 columns on my busy machine (lots of chrome tabs open). Let me know if the performance is unacceptable.
[update]
Testing the performance from the other answer:
>>> n = time.time(); \
... df = pd.DataFrame(dict([(k, pd.Series(v)) for k,v in sample.items()])); \
... print time.time() - n
72.0512290001
The next line (df = pd.melt(...)
) was taking too long so I canceled the test. Take this result with a grain of salt because it was running on a busy machine.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With