Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the most similar numbers across multiple lists in Python

Tags:

python

list

In Python, I have 3 lists of floating-point numbers (angles), in the range 0-360, and the lists are not the same length. I need to find the triplet (with 1 number from each list) in which the numbers are the closest. (It's highly unlikely that any of the numbers will be identical, since this is real-world data.) I was thinking of using a simple lowest-standard-deviation method to measure agreement, but I'm not sure of a good way to implement this. I could loop through each list, comparing the standard deviation of every possible combination using nested for loops, and have a temporary variable save the indices of the triplet that agrees the best, but I was wondering if anyone had a better or more elegant way to do something like this. Thanks!

like image 659
new_sysadmin Avatar asked Sep 27 '12 23:09

new_sysadmin


1 Answers

I wouldn't be surprised if there is an established algorithm for doing this, and if so, you should use it. But I don't know of one, so I'm going to speculate a little.

If I had to do it, the first thing I would try would be just to loop through all possible combinations of all the numbers and see how long it takes. If your data set is small enough, it's not worth the time to invent a clever algorithm. To demonstrate the setup, I'll include the sample code:

# setup
def distance(nplet):
    '''Takes a pair or triplet (an "n-plet") as a list, and returns its distance.
    A smaller return value means better agreement.'''
    # your choice of implementation here. Example:
    return variance(nplet)

# algorithm
def brute_force(*lists):
    return min(itertools.product(*lists), key = distance)

For a large data set, I would try something like this: first create one triplet for each number in the first list, with its first entry set to that number. Then go through this list of partially-filled triplets and for each one, pick the number from the second list that is closest to the number from the first list and set that as the second member of the triplet. Then go through the list of triplets and for each one, pick the number from the third list that is closest to the first two numbers (as measured by your agreement metric). Finally, take the best of the bunch. This sample code demonstrates how you could try to keep the runtime linear in the length of the lists.

def item_selection(listA, listB, listC):
    # make the list of partially-filled triplets
    triplets = [[a] for a in listA]
    iT = 0
    iB = 0
    while iT < len(triplets):
        # make iB the index of a value in listB closes to triplets[iT][0]
        while iB < len(listB) and listB[iB] < triplets[iT][0]:
            iB += 1
        if iB == 0:
            triplets[iT].append(listB[0])
        elif iB == len(listB)
            triplets[iT].append(listB[-1])
        else:
            # look at the values in listB just below and just above triplets[iT][0]
            # and add the closer one as the second member of the triplet
            dist_lower = distance([triplets[iT][0], listB[iB]])
            dist_upper = distance([triplets[iT][0], listB[iB + 1]])
            if dist_lower < dist_upper:
                triplets[iT].append(listB[iB])
            elif dist_lower > dist_upper:
                triplets[iT].append(listB[iB + 1])
            else:
                # if they are equidistant, add both
                triplets[iT].append(listB[iB])
                iT += 1
                triplets[iT:iT] = [triplets[iT-1][0], listB[iB + 1]]
        iT += 1
    # then another loop while iT < len(triplets) to add in the numbers from listC
    return min(triplets, key = distance)

The thing is, I can imagine situations where this wouldn't actually find the best triplet, for instance if a number from the first list is close to one from the second list but not at all close to anything in the third list. So something you could try is to run this algorithm for all 6 possible orderings of the lists. I can't think of a specific situation where that would fail to find the best triplet, but one might still exist. In any case the algorithm will still be O(N) if you use a clever implementation, assuming the lists are sorted.

def symmetrized_item_selection(listA, listB, listC):
    best_results = []
    for ordering in itertools.permutations([listA, listB, listC]):
        best_results.extend(item_selection(*ordering))
    return min(best_results, key = distance)

Another option might be to compute all possible pairs of numbers between list 1 and list 2, between list 1 and list 3, and between list 2 and list 3. Then sort all three lists of pairs together, from best to worst agreement between the two numbers. Starting with the closest pair, go through the list pair by pair and any time you encounter a pair which shares a number with one you've already seen, merge them into a triplet. For a suitable measure of agreement, once you find your first triplet, that will give you a maximum pair distance that you need to iterate up to, and once you get up to it, you just choose the closest triplet of the ones you've found. I think that should consistently find the best possible triplet, but it will be O(N^2 log N) because of the requirement for sorting the lists of pairs.

def pair_sorting(listA, listB, listC):
    # make all possible pairs of values from two lists
    # each pair has the structure ((number, origin_list),(number, origin_list))
    # so we know which lists the numbers came from
    all_pairs = []
    all_pairs += [((nA,0), (nB,1)) for (nA,nB) in itertools.product(listA,listB)]
    all_pairs += [((nA,0), (nC,2)) for (nA,nC) in itertools.product(listA,listC)]
    all_pairs += [((nB,1), (nC,2)) for (nB,nC) in itertools.product(listB,listC)]
    all_pairs.sort(key = lambda p: distance(p[0][0], p[1][0]))
    # make a dict to track which (number, origin_list)s we've already seen
    pairs_by_number_and_list = collections.defaultdict(list)
    min_distance = INFINITY
    min_triplet = None
    # start with the closest pair
    for pair in all_pairs:
        # for the first value of the current pair, see if we've seen that particular
        # (number, origin_list) combination before
        for pair2 in pairs_by_number_and_list[pair[0]]:
            # if so, that means the current pair shares its first value with
            # another pair, so put the 3 unique values together to make a triplet
            this_triplet = (pair[1][0], pair2[0][0], pair2[1][0])
            # check if the triplet agrees more than the previous best triplet
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # do the same thing but checking the second element of the current pair
        for pair2 in pairs_by_number_and_list[pair[1]]:
            this_triplet = (pair[0][0], pair2[0][0], pair2[1][0])
            this_distance = distance(this_triplet)
            if this_distance < min_distance:
                min_triplet = this_triplet
                min_distance = this_distance
        # finally, add the current pair to the list of pairs we've seen
        pairs_by_number_and_list[pair[0]].append(pair)
        pairs_by_number_and_list[pair[1]].append(pair)
    return min_triplet

N.B. I've written all the code samples in this answer out a little more explicitly than you'd do it in practice to help you to understand how they work. But when doing it for real, you'd use more list comprehensions and such things.

N.B.2. No guarantees that the code works :-P but it should get the rough idea across.

like image 152
David Z Avatar answered Nov 26 '22 04:11

David Z