I have Pandas dataframe with two columns, such as:
df = ID state
255 NJ
255 NaN
266 CT
266 CT
277 NaN
277 NY
277 NaN
I want to fill missing values in state.
Desired output is the following:
df = ID state
255 NJ
255 NJ
266 CT
266 CT
277 NY
277 NY
277 NY
How can I overcome this? Trying but without success. Tried, numpy.where creating masks but getting this error operands could not be broadcast together with shapes (26229,) (2053,) () and many more. Any help is appreciated.
Use DataFrame.sort_values with GroupBy.ffill:
df['state'] = df.sort_values('state').groupby('ID')['state'].ffill()
print (df)
ID state
0 255 NJ
1 255 NJ
2 266 CT
3 266 CT
4 277 NY
5 277 NY
6 277 NY
If necessary filling multiple columns use:
cols = ['state', ...]
df.loc[:, cols] = df.sort_values('state').groupby('ID')[cols].ffill()
IIUC, each ID has a unique state, so:
df['state'] = df.groupby('ID')['state'].transform('first')
output:
ID state
0 255 NJ
1 255 NJ
2 266 CT
3 266 CT
4 277 NY
5 277 NY
6 277 NY
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With