Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill missing values based on condition in duplicated column

I have Pandas dataframe with two columns, such as:

df = ID state
      255 NJ
      255 NaN
      266 CT
      266 CT
      277 NaN
      277 NY
      277 NaN

I want to fill missing values in state.

Desired output is the following:

df = ID state
      255 NJ
      255 NJ
      266 CT
      266 CT
      277 NY
      277 NY
      277 NY

How can I overcome this? Trying but without success. Tried, numpy.where creating masks but getting this error operands could not be broadcast together with shapes (26229,) (2053,) () and many more. Any help is appreciated.

like image 726
Okroshiashvili Avatar asked Feb 04 '26 15:02

Okroshiashvili


2 Answers

Use DataFrame.sort_values with GroupBy.ffill:

df['state'] = df.sort_values('state').groupby('ID')['state'].ffill()
print (df)
    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY

If necessary filling multiple columns use:

cols = ['state', ...]
df.loc[:, cols] = df.sort_values('state').groupby('ID')[cols].ffill()
like image 182
jezrael Avatar answered Feb 07 '26 05:02

jezrael


IIUC, each ID has a unique state, so:

df['state'] = df.groupby('ID')['state'].transform('first')

output:

    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY
like image 45
Quang Hoang Avatar answered Feb 07 '26 04:02

Quang Hoang