Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace value with the value of nearest neighbor in Pandas dataframe

I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.

data sample I have:

id   su_id  r_value  match_v

A      A1      0        1
A      A2      0        1
A      A3      70       2
A      A4      120      100
A      A5      250      3
A      A6      250      100
B      B1      0        1
B      B2      30       2

The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)

Expected output

id   su_id  r_value  match_v

A      A1      0        1
A      A2      0        1
A      A3      70       2
A      A4      120      2
A      A5      250      3
A      A6      250      3
B      B1      0        1
B      B2      30       2

I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values. I haven't tried anything else cause I really don't have any idea.

Any help or hint is welcomed and I if you need any additional info, I'm here.

Thanks in advance.

like image 764
jovicbg Avatar asked May 30 '19 12:05

jovicbg


2 Answers

More like merge_asof

s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]: 
  id su_id  r_value  match_v
0  A    A1        0        1
1  A    A2        0        1
2  A    A3       70        2
3  A    A4      120        2
4  A    A5      250        3
5  A    A6      250        3
6  B    B1        0        1
7  B    B2       30        2

Here is another way using numpy broadcast , build for speed up calculation

l=[]
for x , y in df.groupby('id'): 
    s1=y.r_value.values
    s=abs((s1-s1[:,None])).astype(float)
    s[np.tril_indices(s.shape[0], 0)] = 999999
    s=s.argmin(0)
    s2=y.match_v.values
    l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]: 
  id su_id  r_value  match_v
0  A    A1        0        1
1  A    A2        0        1
2  A    A3       70        2
3  A    A4      120        2
4  A    A5      250        3
5  A    A6      250        3
6  B    B1        0        1
7  B    B2       30        2
like image 143
BENY Avatar answered Sep 24 '22 03:09

BENY


You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.

def mysubstitution(x):
    for i in x.index[x['match_v'] == 100]:
        diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
        exclude = x.index.isin([i])
        closer_idx = diff[~exclude].idxmin()
        x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
    return x

ddf = df.groupby('id').apply(mysubstitution)

ddf is:

  id su_id  r_value  match_v
0  A    A1        0        1
1  A    A2        0        1
2  A    A3       70        2
3  A    A4      120        2
4  A    A5      250        3
5  A    A6      250        3
6  B    B1        0        1
7  B    B2       30        2
like image 24
Valentino Avatar answered Sep 24 '22 03:09

Valentino