I'm trying to find the month (column 'Month') that has the largest number (in DepDelay column)
Data
flightID Month ArrTime ActualElapsedTime DepDelay ArrDelay
BBYYEUVY67527 1 1514.0 58.0 NA 64.0
MUPXAQFN40227 1 37.0 120.0 13 52.0
LQLYUIMN79169 1 916.0 166.0 NA -25.0
KTAMHIFO10843 1 NaN NaN 5 NaN
BOOXJTEY23623 1 NaN NaN 4 NaN
BBYYEUVY67527 2 1514.0 58.0 NA 64.0
MUPXAQFN40227 2 37.0 120.0 NA 52.0
LQLYUIMN79169 2 916.0 166.0 NA -25.0
KTAMHIFO10843 2 NaN NaN 15 NaN
BOOXJTEY23623 2 NaN NaN 4 NaN
I tried:
data = pd.read_csv('data.csv', sep='\t')
dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
print(dep_delay)
Error:
AttributeError Traceback (most recent call last)
<ipython-input-14-2ea6213009d6> in <module>()
----> 1 dep_delay = all_data.groupby(["Month"].DepDelay.count().max())
2
3 print(dep_delay)
AttributeError: 'list' object has no attribute 'DepDelay'
Good output:
Month DepDelay
1 22
You need sum instead of count to sum values by group. Here's one way using GroupBy + sum, and then idxmax:
res = df.groupby('Month')['DepDelay'].sum().reset_index()
res = res.loc[[res['DepDelay'].idxmax()]]
print(res)
Month DepDelay
0 1 22.0
Alternatively, you can group and sort, then extract the first row:
res = df.groupby('Month')['DepDelay'].sum()\
.sort_values(ascending=False).head(1)\
.reset_index()
print(res)
Month DepDelay
0 1 22.0
Another approach:
pd.DataFrame(df.loc[df['DepDelay'].idxmax(), ['Month', 'DepDelay']]).T
# Month DepDelay
#8 2 15
And you can reset index to change 8 to 0:
pd.DataFrame(df.loc[df['DepDelay'].idxmax(), ['Month', 'DepDelay']]).T.reset_index(drop=True)
# Month DepDelay
#0 2 15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With