The problem I'm having doesn't seem too complex but I'm struggling to find a good way to go about it. All I need to do is a create a new dataframe from the data below the sums up the 'Minutes' column for each ID in 'P1'-'P5'. The only method I've come up with would be to create 5 different df's (1 for each 'P' column), append them and then use groupby to get the minute totals, but that seems wildy inefficient.
Any ideas on a better way to approach this would be greatly appreciated.
Original:
P1 P2 P3 P4 P5 Minutes
1627759 1628369 201143 202330 202681 7
1626179 1628369 202330 203382 203935 5
1627759 1628369 201143 202681 203382 5
1627759 201143 202681 202694 203935 5
1626179 1628369 201143 202330 203935 4
1626179 1627759 202681 202694 203382 4
1626179 1628369 202694 203382 203935 3
1626179 1627759 201143 202681 202694 2
1626179 1628464 202330 202694 203935 2
1627759 201143 202330 202681 202694 2
1628369 201143 202330 202681 203935 2
1626179 1627759 1628369 201143 202681 1
1626179 202330 202694 203382 203935 1
1626179 1627759 201143 202694 203935 1
1626179 1627824 1628400 1628464 202954 1
1626179 1628369 1628464 202330 203935 1
1626179 1627824 1628400 1628464 203935 1
1627759 202330 202681 202694 203382 0
Expected outcome:
P Minutes
1627759 27
1626179 26
1628369 28
201143 29
1628464 5
202330 24
1627824 2
202681 28
202694 20
1628400 2
203382 18
203935 25
202954 1
I am not sure if melting the dataframe is more efficient, but at least it does not involve explicit loops:
pd.melt(df, id_vars='Minutes', value_vars=df.columns[:-1])\
.groupby('value')['Minutes'].sum()
This solution "stretches" the first five columns into a tall one and uses it for grouping. According to %%timeit, this solution is twice as fast as yours (for the given dataframe).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With