I have a pandas column like so:
index colA
1 10.2
2 10.8
3 11.6
4 10.7
5 9.5
6 6.2
7 12.9
8 10.6
9 6.4
10 20.5
I want to search the current row value and find matches from previous rows that are close. For example index4 (10.7) would return a match of 1 because it is close to index2 (10.8). Similarly index8 (10.6) would return a match of 2 because it is close to both index2 and 4.
Using a threshold of +/- 5% for this example would output the below:
index colA matches
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
With a large dataframe I would like to limit this to the previous X (300?) number of rows to search over rather than an entire dataframe.
Using triangle indices to ensure we only look backwards. Then use np.bincount
to accumulate the matches.
a = df.colA.values
i, j = np.tril_indices(len(a), -1)
mask = np.abs(a[i] - a[j]) / a[i] <= .05
df.assign(matches=np.bincount(i[mask], minlength=len(a)))
colA matches
index
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
If you are having resource issues, consider using good 'ol fashion loops. However, if you have access to numba
you make this considerably faster.
from numba import njit
@njit
def counter(a):
c = np.arange(len(a)) * 0
for i, x in enumerate(a):
for j, y in enumerate(a):
if j < i:
if abs(x - y) / x <= .05:
c[i] += 1
return c
df.assign(matches=counter(a))
colA matches
index
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With