I'm trying to select rows out of groups by max value using df.loc[df.groupby(keys)['column'].idxmax()]
.
I'm finding, however, that df.groupby(keys)['column'].idxmax()
takes a really long time on my dataset of about 27M rows. Interestingly, running df.groupby(keys)['column'].max()
on my dataset takes only 13 seconds while running df.groupby(keys)['column'].idxmax()
takes 55 minutes. I don't understand why returning the indexes of the rows takes 250 times longer than returning a value from the row. Maybe there is something I can do to speed up idxmax?
If not, is there an alternative way of selecting rows out of groups by max value that might be faster than using idxmax?
For additional info, I'm using two keys and sorted the dataframe on those keys prior to the groupby and idxmax operations. Here's what it looks like in Jupyter Notebook:
import pandas as pd
df = pd.read_csv('/data/Broadband Data/fbd_us_without_satellite_jun2019_v1.csv', encoding='ANSI', \
usecols=['BlockCode', 'HocoNum', 'HocoFinal', 'TechCode', 'Consumer', 'MaxAdDown', 'MaxAdUp'])
%%time
df = df[df.Consumer == 1]
df.sort_values(['BlockCode', 'HocoNum'], inplace=True)
print(df)
HocoNum HocoFinal BlockCode TechCode
4631064 130077 AT&T Inc. 10010201001000 10
4679561 130077 AT&T Inc. 10010201001000 11
28163032 130235 Charter Communications 10010201001000 43
11134756 131480 WideOpenWest Finance, LLC 10010201001000 42
11174634 131480 WideOpenWest Finance, LLC 10010201001000 50
... ... ... ... ...
15389917 190062 Broadband VI, LLC 780309900000014 70
10930322 130081 ATN International, Inc. 780309900000015 70
15389918 190062 Broadband VI, LLC 780309900000015 70
10930323 130081 ATN International, Inc. 780309900000016 70
15389919 190062 Broadband VI, LLC 780309900000016 70
Consumer MaxAdDown MaxAdUp
4631064 1 6.0 0.512
4679561 1 18.0 0.768
28163032 1 940.0 35.000
11134756 1 1000.0 50.000
11174634 1 1000.0 50.000
... ... ... ...
15389917 1 25.0 5.000
10930322 1 25.0 5.000
15389918 1 25.0 5.000
10930323 1 25.0 5.000
15389919 1 25.0 5.000
[26991941 rows x 7 columns]
Wall time: 21.6 s
%time df.groupby(['BlockCode', 'HocoNum'])['MaxAdDown'].max()
Wall time: 13 s
BlockCode HocoNum
10010201001000 130077 18.0
130235 940.0
131480 1000.0
10010201001001 130235 940.0
10010201001002 130077 6.0
...
780309900000014 190062 25.0
780309900000015 130081 25.0
190062 25.0
780309900000016 130081 25.0
190062 25.0
Name: MaxAdDown, Length: 20613795, dtype: float64
%time df.groupby(['BlockCode', 'HocoNum'])['MaxAdDown'].idxmax()
Wall time: 55min 24s
BlockCode HocoNum
10010201001000 130077 4679561
130235 28163032
131480 11134756
10010201001001 130235 28163033
10010201001002 130077 4637222
...
780309900000014 190062 15389917
780309900000015 130081 10930322
190062 15389918
780309900000016 130081 10930323
190062 15389919
Name: MaxAdDown, Length: 20613795, dtype: int64
You'll see in the very first rows of data there are two entries for AT&T in the same BlockCode, one for MaxAdDown of 6Mbps and one for 18Mbps. I want to keep the 18Mbps row and drop the 6Mbps row, so that there is one row per company per BlockCode that has the the maximum MaxAdDown value. I need the entire row, not just the MaxAdDown value.
sort and drop duplicates:
df.sort('MaxAdDown').drop_duplicates(['BlockCode', 'HocoNum'], keep='last')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With