Let's say I have data in the following format:
group_id | entity_id | value
A a1 5
A a2 3
A a3 2
B b1 10
B b2 8
B b3 11
C c1 2
C c2 6
C c3 NaN
Table 1.
So each group (A/B/C) will have 3 entities, guaranteed. And each entity has a corresponding value (sometimes NaN if non-existent).
I want to make reshape this data from the existing format to...:
group_id | entity_1 | entity_2 | entity_3
A 5 3 2
B 10 8 11
C 2 6 NaN
Table 2.
Where entity_1/entity_2/entity_3 correspond to a1/a2/a3 (or b1/b2/b3, c1/c2/c3) respectively.
How do I do this?
One solution I found was to use the pivot function so...
df.pivot(index='group_id', columns='entity_id', values='value')
But as I understand it, the problem with this is that the columns for the entities in the resulting reshaped pivot table will not be in the format I wanted above in Table 2 -- this is important for some downstream stuff I'm doing with the data.
I might be asking a stupid question but I had trouble finding ways to use the existing pivot/melt functions to go from long to wide in the way that I described above. Can anyone help me out?
I'm happy to provide more details if necessary, just let me know!
You can use pivot
and new columns are last value of column entity_id
extracted by indexing with str:
df = pd.pivot(index=df.group_id, columns=df.entity_id.str[-1], values=df.value)
.add_prefix('entity_')
.rename_axis(None, axis=1)
.reset_index()
print (df)
group_id entity_1 entity_2 entity_3
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
Solution with cumcount
:
df = pd.pivot(index=df.group_id,
columns=df.groupby('group_id').cumcount() + 1,
values=df.value)
.add_prefix('entity_')
.reset_index()
print (df)
group_id entity_1 entity_2 entity_3
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
Another solution with groupby
and apply
, last reshape by unstack
:
df = df.groupby("group_id")["value"]
.apply(lambda x: pd.Series(x.values))
.unstack()
.add_prefix('entity_')
.reset_index()
print (df)
group_id entity_0 entity_1 entity_2
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
If need count from 1
:
df = df.groupby("group_id")["value"].apply(lambda x: pd.Series(x.values))
.unstack()
.rename(columns = lambda x: x+1)
.add_prefix('entity_')
.reset_index()
print (df)
group_id entity_1 entity_2 entity_3
0 A 5.0 3.0 2.0
1 B 10.0 8.0 11.0
2 C 2.0 6.0 NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With