I've got a fun one! And I've tried to find a duplicate question but was unsuccessful...
My dataframe consists of all United States and territories for years 2013-2016 with several attributes.
>>> df.head(2)
state enrollees utilizing enrol_age65 util_age65 year
1 Alabama 637247 635431 473376 474334 2013
2 Alaska 30486 28514 21721 20457 2013
>>> df.tail(2)
state enrollees utilizing enrol_age65 util_age65 year
214 Puerto Rico 581861 579514 453181 450150 2016
215 U.S. Territories 24329 16979 22608 15921 2016
I want to groupby year and state, and show the top 3 states (by 'enrollees' or 'utilizing' - does not matter) for each year.
Desired Output:
enrollees utilizing
year state
2013 California 3933310 3823455
New York 3133980 3002948
Florida 2984799 2847574
...
2016 California 4516216 4365896
Florida 4186823 3984756
New York 4009829 3874682
So far I've tried the following:
df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)
Which yields just the first 3 rows in the GroupBy object:
enrollees utilizing
year state
2013 Alabama 637247 635431
Alaska 30486 28514
Arizona 707683 683273
I've also tried a lambda function:
df.groupby(['year','state'])['enrollees','utilizing']\
.apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')
Which yields the absolute largest 3 in the GroupBy object:
enrollees utilizing
year state
2016 California 4516216 4365896
2015 California 4324304 4191704
2014 California 4133532 4011208
I think it may have to do with the indexing of the GroupBy object, but I am not sure...Any guidance would be appreciated!
Well, you could do something not that pretty.
First getting a list of unique years using set()
:
years_list = list(set(df.year))
Create a dummy dataframe and a function to concat that I've made in the past:
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
creating the dummy final df
df_final = pd.DataFrame()
Now you'll loop for each year and concating into the new DF:
for year in years_list:
# The query function does a search for where
# the @year means the external variable, in this case the input from loop
# then you'll have a temporary DF with only the year and sorting and getting top3
df2 = df.query("year == @year")
df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
# finally you'll call our function that will keep concating the tmp DFs
df_final = concatenate_loop_dfs(df_temp, df_final)
and done.
print(df_final)
You then need to sort your GroupBy object .sort_values('enrollees), ascending=False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With