Matching two lists containing slightly differing float values by allowing a tolerance

Tags:

I have two sorted lists containing float values. The first contains the values I am interested in (l1) and the second list contains values I want to search (l2). However, I am not looking for exact matches and I am tolerating differences based on a function. Since I have do this search very often (>>100000) and the lists can be quite large (~5000 and ~200000 elements), I am really interested in runtime. At first, I thought I could somehow use numpy.isclose(), but my tolerance is not fixed, but depending on the value of interest. Several nested for loops work, but are really slow. I am sure that there is some efficient way to do this.

#check if two floats are close enough to match
def matching(mz1, mz2):
    if abs( (1-mz1/mz2) * 1000000) <= 2:
        return True
    return False

#imagine another huge for loop around everything
l1  = [132.0317, 132.8677, 132.8862, 133.5852, 133.7507]
l2  = [132.0317, 132.0318, 132.8678, 132.8861, 132.8862, 133.5851999, 133.7500]

d = {i:[] for i in l1}
for i in l1:
    for j in l2:
        if matching(i, j):
            d[i].append(j)

fyi: As an alternative to the matching function, I could also create a dictionary first, mapping the values of interest from l1 to the window (min ,max) I would allow. e.g. {132.0317:(132.0314359366, 132.0319640634), ...}, but I think checking for each value from l2 if it lies within one of the windows from this dictionary would be even slower...

This would be how to generate the dictionary containing min/max values for each value from l1:

def calcMinMaxMZ(mz, delta_ppm=2):
    minmz = mz- (mz* +delta_ppm)/1000000
    maxmz = mz- (mz* -delta_ppm)/1000000
    return minmz, maxmz

minmax_d = {mz:calcMinMaxMZ(mz, delta_ppm=2) for mz in l1}

The result may be a dictionary like this: d = {132.0317: [132.0317, 132.0318], 132.8677: [132.8678], 132.8862: [132.8862, 132.8861], 133.5852: [133.5851999], 133.7507: []} But I actually do much more, when there is a match.

Any help is appreciated!

256

asked Aug 05 '19 15:08

Godrebh

1 Answers

I re-implemented the for loop using itertools. For it working, the inputs must be sorted. For benchmark I generated 1000 items from <130.0, 135.0> for l1 and 100_000 items from <130.0, 135.0> for l2:

from timeit import timeit
from itertools import tee
from random import uniform

#check if two floats are close enough to match
def matching(mz1, mz2):
    if abs( (1-mz1/mz2) * 1000000) <= 2:
        return True
    return False

#imagine another huge for loop around everything
l1 = sorted([uniform(130.00, 135.00) for _ in range(1000)])
l2 = sorted([uniform(130.00, 135.00) for _ in range(100_000)])

def method1():
    d = {i:[] for i in l1}
    for i in l1:
        for j in l2:
            if matching(i, j):
                d[i].append(j)
    return d

def method2():
    iter_2, last_match = tee(iter(l2))
    d = {}
    for i in l1:
        d.setdefault(i, [])
        found = False
        while True:
            j = next(iter_2, None)
            if j is None:
                break
            if matching(i, j):
                d[i].append(j)
                if not found:
                    iter_2, last_match = tee(iter_2)
                    found = True
            else:
                if found:
                    break
        iter_2, last_match = tee(last_match)
    return d

print(timeit(lambda: method1(), number=1))
print(timeit(lambda: method2(), number=1))

Prints:

16.900722101010615
0.030588202003855258

195

answered Sep 28 '22 09:09

Andrej Kesely

Related questions
                            
                                Unable to hide “Chrome is being controlled by automated software” infobar within Chrome v76
                            
                                Not getting expected delegate calls when trying to restore in-app purchases with StoreKit
                            
                                What's implicit contracts of the filter streams in java?
                            
                                jsonDoclet task: Javadoc generation failed
                            
                                Multi label encoding for classes with duplicates
                            
                                Why I'm getting Swift_TransportException: Unable to connect with TLS encryption in Server?
                            
                                How to vectorize increments in Python
                            
                                VSCode - open remote from cli
                            
                                Can I control the formatting of multiline strings?
                            
                                Handling event streams in haskell
                            
                                Is it possible to create a Bluetooth Mesh Network with iOS and Android devices
                            
                                use plugin International Telephone Input in react

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With