Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshaping Pandas Dataframe with Grouped Data (Long to Wide)

Let's say I have data in the following format:

group_id | entity_id | value
    A          a1        5
    A          a2        3
    A          a3        2
    B          b1        10
    B          b2        8
    B          b3        11
    C          c1        2
    C          c2        6
    C          c3        NaN

Table 1.

So each group (A/B/C) will have 3 entities, guaranteed. And each entity has a corresponding value (sometimes NaN if non-existent).

I want to make reshape this data from the existing format to...:

group_id | entity_1 | entity_2 | entity_3
   A          5          3          2
   B          10         8          11
   C          2          6          NaN

Table 2.

Where entity_1/entity_2/entity_3 correspond to a1/a2/a3 (or b1/b2/b3, c1/c2/c3) respectively.

How do I do this?

One solution I found was to use the pivot function so...

df.pivot(index='group_id', columns='entity_id', values='value')

But as I understand it, the problem with this is that the columns for the entities in the resulting reshaped pivot table will not be in the format I wanted above in Table 2 -- this is important for some downstream stuff I'm doing with the data.

I might be asking a stupid question but I had trouble finding ways to use the existing pivot/melt functions to go from long to wide in the way that I described above. Can anyone help me out?

I'm happy to provide more details if necessary, just let me know!

like image 995
shishy Avatar asked Dec 20 '16 06:12

shishy


1 Answers

You can use pivot and new columns are last value of column entity_id extracted by indexing with str:

df = pd.pivot(index=df.group_id, columns=df.entity_id.str[-1], values=df.value)
       .add_prefix('entity_')
       .rename_axis(None, axis=1)
       .reset_index()
print (df)
  group_id  entity_1  entity_2  entity_3
0        A       5.0       3.0       2.0
1        B      10.0       8.0      11.0
2        C       2.0       6.0       NaN

Solution with cumcount:

df = pd.pivot(index=df.group_id,
              columns=df.groupby('group_id').cumcount() + 1, 
              values=df.value)
       .add_prefix('entity_')
       .reset_index()
print (df)
  group_id  entity_1  entity_2  entity_3
0        A       5.0       3.0       2.0
1        B      10.0       8.0      11.0
2        C       2.0       6.0       NaN

Another solution with groupby and apply, last reshape by unstack:

df = df.groupby("group_id")["value"]
       .apply(lambda x: pd.Series(x.values))
       .unstack()
       .add_prefix('entity_')
       .reset_index()
print (df)
  group_id  entity_0  entity_1  entity_2
0        A       5.0       3.0       2.0
1        B      10.0       8.0      11.0
2        C       2.0       6.0       NaN

If need count from 1:

df = df.groupby("group_id")["value"].apply(lambda x: pd.Series(x.values))
       .unstack()
       .rename(columns = lambda x: x+1)
       .add_prefix('entity_')
       .reset_index()
print (df)
  group_id  entity_1  entity_2  entity_3
0        A       5.0       3.0       2.0
1        B      10.0       8.0      11.0
2        C       2.0       6.0       NaN
like image 149
jezrael Avatar answered Sep 23 '22 06:09

jezrael