Fill missing values based on condition in duplicated column

Question

I have Pandas dataframe with two columns, such as:

df = ID state
      255 NJ
      255 NaN
      266 CT
      266 CT
      277 NaN
      277 NY
      277 NaN

I want to fill missing values in state.

Desired output is the following:

df = ID state
      255 NJ
      255 NJ
      266 CT
      266 CT
      277 NY
      277 NY
      277 NY

How can I overcome this? Trying but without success. Tried, numpy.where creating masks but getting this error operands could not be broadcast together with shapes (26229,) (2053,) () and many more. Any help is appreciated.

jezrael · Accepted Answer

Use DataFrame.sort_values with GroupBy.ffill:

df['state'] = df.sort_values('state').groupby('ID')['state'].ffill()
print (df)
    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY

If necessary filling multiple columns use:

cols = ['state', ...]
df.loc[:, cols] = df.sort_values('state').groupby('ID')[cols].ffill()

Quang Hoang · Answer

IIUC, each ID has a unique state, so:

df['state'] = df.groupby('ID')['state'].transform('first')

output:

    ID state
0  255    NJ
1  255    NJ
2  266    CT
3  266    CT
4  277    NY
5  277    NY
6  277    NY

Fill missing values based on condition in duplicated column

Tags:

python

replace

pandas

missing-data

Okroshiashvili

2 Answers

jezrael

Quang Hoang

Recent Activity

Donate For Us

Fill missing values based on condition in duplicated column

Tags:

python

replace

pandas

missing-data

Okroshiashvili

2 Answers

jezrael

Quang Hoang

Related questions

Recent Activity

Donate For Us