I have a dict
like this:
data = {'1':{'a':10, 'b':30}, '2':{'a':20, 'b':60}}
and I would like to convert in into a dataframe like this:
x y z
1 a 10
1 b 30
2 a 20
2 b 60
Does anybody know how?
Use dictionary comprehension
with concat
:
df = pd.concat({k: pd.Series(v) for k, v in data.items()}).reset_index()
df.columns = list('xyz')
print (df)
x y z
0 1 a 10
1 1 b 30
2 2 a 20
3 2 b 60
For better performance use list compehension
with sorting:
L = sorted([(k,k1,v1) for k,v in data.items() for k1,v1 in v.items()],
key=lambda x: (x[0], x[1]))
print (L)
[('1', 'a', 10), ('1', 'b', 30), ('2', 'a', 20), ('2', 'b', 60)]
df = pd.DataFrame(L, columns=list('xyz'))
print (df)
x y z
0 1 a 10
1 1 b 30
2 2 a 20
3 2 b 60
Timings:
In [34]: %timeit jez1(data)
16.8 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [35]: %timeit jez(data)
1.96 s ± 90.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [37]: %timeit jp(data)
43 ms ± 353 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Same code as @jp:
data = {str(k): {'a': 10, 'b': 30} for k in range(10000)}
def jp(data):
return pd.melt(pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'x'}),
id_vars=['x'], value_vars=['a', 'b'], var_name='y', value_name='z')\
.sort_values(['x', 'y']).reset_index(drop=True)
def jez(data):
df = pd.concat({k: pd.Series(v) for k, v in data.items()}).reset_index()
df.columns = list('xyz')
return df
def jez1(data):
L = sorted([(k,k1,v1) for k,v in data.items() for k1,v1 in v.items()], key=lambda x: (x[0], x[1]))
df = pd.DataFrame(L, columns=list('xyz'))
return df
assert (jez1(data).values == jez(data).values).all()
Here is one way using pandas.melt
.
d = {'1':{'a':10, 'b':30}, '2':{'a':20, 'b':60}}
res = pd.melt(pd.DataFrame.from_dict(d, orient='index'),
value_vars=['a', 'b'], var_name='y', value_name='z')
print(res)
# y z
# 0 a 10
# 1 a 20
# 2 b 30
# 3 b 60
Performance benchmarking
I expected pandas.melt
to be inefficient, but applying pandas.concat
on a large number of dictionaries can be even more expensive.
data = {str(k): {'a': 10, 'b': 30} for k in range(10000)}
def jp(data):
return pd.melt(pd.DataFrame.from_dict(data, orient='index').reset_index().rename(columns={'index': 'x'}),
id_vars=['x'], value_vars=['a', 'b'], var_name='y', value_name='z')\
.sort_values(['x', 'y']).reset_index(drop=True)
def jez(data):
df = pd.concat({k: pd.Series(v) for k, v in data.items()}).reset_index()
df.columns = list('xyz')
return df
assert (jp(data).values == jez(data).values).all()
%timeit jp(data) # 51.8 ms per loop
%timeit jez(data) # 2.62 s per loop
Using Series
pd.Series(d).apply(pd.Series).stack().reset_index()
Out[359]:
level_0 level_1 0
0 1 a 10
1 1 b 30
2 2 a 20
3 2 b 60
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With