Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fillna with most frequent if most frequent occurs else fillna with most frequent value of the entire column

I have a panda dataframe

        City    State
0    Cambridge    MA
1          NaN    DC
2       Boston    MA
3   Washignton    DC
4          NaN    MA
5        Tampa    FL
6      Danvers    MA
7        Miami    FL
8    Cambridge    MA
9        Miami    FL
10         NaN    FL
11  Washington    DC

I want to fill NaNs based on most frequent state if the state appears before so I group by state and apply the following code:

df['City'] = df.groupby('State').transform(lambda x:x.fillna(x.value_counts().idxmax()))

The above code works for if all states have occurred before the output will be

          City State
0    Cambridge    MA
1   Washignton    DC
2       Boston    MA
3   Washignton    DC
4    Cambridge    MA
5        Tampa    FL
6      Danvers    MA
7        Miami    FL
8    Cambridge    MA
9        Miami    FL
10       Miami    FL
11  Washington    DC

However I want to add a condtion so that if a state never occur its city will be the most frequent in the entire City column ie if the dataframe is

          City State
0    Cambridge    MA
1          NaN    DC
2       Boston    MA
3   Washignton    DC
4          NaN    MA
5        Tampa    FL
6      Danvers    MA
7        Miami    FL
8    Cambridge    MA
9        Miami    FL
10         NaN    FL
11  Washington    DC
12         NaN    NY 

NY has never occurred before I want output to be

          City State
0    Cambridge    MA
1   Washignton    DC
2       Boston    MA
3   Washignton    DC
4    Cambridge    MA
5        Tampa    FL
6      Danvers    MA
7        Miami    FL
8    Cambridge    MA
9        Miami    FL
10       Miami    FL
11  Washington    DC
12   Cambridge    NY 

The code above gives a ValueError: ('attempt to get argmax of an empty sequence') because "NY" has never occurred before.

like image 578
Aya Abdelsalam Avatar asked Oct 17 '22 07:10

Aya Abdelsalam


1 Answers

IIUC:

def f(x):
    if x.count()<=0:
        return np.nan
    return x.value_counts().index[0]

df['City'] = df.groupby('State')['City'].transform(f)

df['City'] = df['City'].fillna(df['City'].value_counts().idxmax())

Output:

          City State
0    Cambridge    MA
1   Washignton    DC
2    Cambridge    MA
3   Washignton    DC
4    Cambridge    MA
5        Miami    FL
6    Cambridge    MA
7        Miami    FL
8    Cambridge    MA
9        Miami    FL
10       Miami    FL
11  Washignton    DC
12   Cambridge    NY
like image 185
Scott Boston Avatar answered Oct 20 '22 11:10

Scott Boston