With reference to the test data below and the function I use to identify values within variable thresh of each other.
Can anyone please help me modify this to show the desired output I have shown?
Test data
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11],
'DDD' : [98,50,25,10,10,11],
'EEE' : [103,50,25,10,10,11]});
Function:
thresh = 5
def closeCols2(df):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
Data Before function applied:
AAA BBB CCC DDD EEE
0 4 10 100 98 103
1 5 20 50 50 50
2 6 30 25 25 25
3 7 40 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
Current series output after applied:
df2.apply(closeCols2, axis=1)
0 103
1 50
2 25
3 10
4 11
5 11
dtype: int64
Desired output is a dataframe showing all values within thresh and a nan for any not within thresh
AAA BBB CCC DDD EEE
0 nan nan 100 98 103
1 nan nan 50 50 50
2 nan 30 25 25 25
3 7 nan 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
use mask and sub with axis=1
df2.mask(df2.sub(df2.apply(closeCols2, 1), 0).abs() > thresh)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11
note:
I'd redefine closeCols to include thresh as a parameter. Then you could pass it in the apply call.
def closeCols2(df, thresh):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
df2.apply(closeCols2, 1, thresh=5)
extra credit
I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply
numpy broadcasting to get all combinations of columns subtracted from each other.np.abs<= 5sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.v = df2.values
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With