Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to align two lists of numbers

I have two sorted lists of numbers A and B with B being at least as long as A. Say:

A = [1.1, 2.3, 5.6, 5.7, 10.1]
B = [0, 1.9, 2.4, 2.7, 8.4, 9.1, 10.7, 11.8]

I want to associate each number in A with a different number in B but preserving order. For any such mapping we define the total distance to be the sum of the squared distances between mapped numbers.

For example:

If we map 1.1 to 0 0 then 2.3 can be mapped to any number from 1.9 onwards. But if we had mapped 1.1 to 2.7, then 2.3 could only be mapped to a number in B from 8.4 onwards.

Say we map 1.1->0, 2.3->1.9, 5.6->8.4, 5.7->9.1, 10.1->10.7. This is a valid mapping and has distance (1.1^2+0.4^2+2.8^2+3.4^2+0.6^2).

Another example to show a greedy approach will not work:

 A = [1, 2]
 B = [0, 1, 10000]

If we map 1->1 then we have to map 2->10000 which is bad.

The task is to find the valid mapping with minimal total distance.

Is hard to do? I am interested in a method that is fast when the lists are of length a few thousand.

like image 468
graffe Avatar asked Jul 23 '19 20:07

graffe


2 Answers

And here is a O(n) solution! (This is the original attempt, see below for a fixed version.)

The idea is as follows. We first solve the problem for every other element, turn that into a very close solution, then use dynamic programming to find the real solution. This is solving a problem that is half the size first, followed by O(n) work. Using the fact that x + x/2 + x/4 + ... = 2x this turns out to be O(n) work.

This very, very much requires sorted lists. And doing a band that is 5 across is overkill, it very much looks like a band that is 3 across always gives the right answer, but I wasn't confident enough to go with that.

def improve_matching (list1, list2, matching):
    # We do DP forward, trying a band that is 5 across, building up our
    # answer as a linked list.  If our answer changed by no more than 1
    # anywhere, we are done.  Else we recursively improve again.
    best_j_last = -1
    last = {-1: (0.0, None)}
    for i in range(len(list1)):
        best_j = None
        best_cost = None
        this = {}
        for delta in (-2, 2, -1, 1, 0):
            j = matching[i] + delta
            # Bounds sanity checks.
            if j < 0:
                continue
            elif len(list2) <= j:
                continue

            j_prev = best_j_last
            if j <= j_prev:
                if j-1 in last:
                    j_prev = j-1
                else:
                    # Can't push back this far.
                    continue

            cost = last[j_prev][0] + (list1[i] - list2[j])**2
            this[j] = (cost, [j, last[j_prev][1]])
            if (best_j is None) or cost <= best_cost:
                best_j = j
                best_cost = cost

        best_j_last = best_j
        last = this

    (final_cost, linked_list) = last[best_j_last]
    matching_rev = []
    while linked_list is not None:
        matching_rev.append( linked_list[0])
        linked_list = linked_list[1]
    matching_new = [x for x in reversed(matching_rev)]
    for i in range(len(matching_new)):
        if 1 < abs(matching[i] - matching_new[i]):
            print "Improving further" # Does this ever happen?
            return improve_matching(list1, list2, matching_new)

    return matching_new

def match_lists (list1, list2):
    if 0 == len(list1):
        return []
    elif 1 == len(list1):
        best_j = 0
        best_cost = (list1[0] - list2[0])**2
        for j in range(1, len(list2)):
            cost = (list1[0] - list2[j])**2
            if cost < best_cost:
                best_cost = cost
                best_j = j
        return [best_j]
    elif 1 < len(list1):
        # Solve a smaller problem first.
        list1_smaller = [list1[2*i] for i in range((len(list1)+1)//2)]
        list2_smaller = [list2[2*i] for i in range((len(list2)+1)//2)]
        matching_smaller = match_lists(list1_smaller, list2_smaller)

        # Start with that matching.
        matching = [None] * len(list1)
        for i in range(len(matching_smaller)):
            matching[2*i] = 2*matching_smaller[i]

        # Fill in the holes between
        for i in range(len(matching) - 1):
            if matching[i] is None:
                best_j = matching[i-1] + 1
                best_cost = (list1[i] - list2[best_j])**2
                for j in range(best_j+1, matching[i+1]):
                    cost = (list1[i] - list2[j])**2
                    if cost < best_cost:
                        best_cost = cost
                        best_j = j
                matching[i] = best_j

        # And fill in the last one if needed
        if matching[-1] is None:
            if matching[-2] + 1 == len(list2):
                # This will be an invalid matching, but improve will fix that.
                matching[-1] = matching[-2]
            else:
                best_j = matching[-2] + 1
                best_cost = (list1[-2] - list2[best_j])**2
                for j in range(best_j+1, len(list2)):
                    cost = (list1[-1] - list2[j])**2
                    if cost < best_cost:
                        best_cost = cost
                        best_j = j
                matching[-1] = best_j

        # And now improve.
        return improve_matching(list1, list2, matching)

def best_matching (list1, list2):
    matching = match_lists(list1, list2)
    cost = 0.0
    result = []
    for i in range(len(matching)):
        pair = (list1[i], list2[matching[i]])
        result.append(pair)
        cost = cost + (pair[0] - pair[1])**2
    return (cost, result)

UPDATE

There is a bug in the above. It can be demonstrated with match_lists([1, 3], [0, 0, 0, 0, 0, 1, 3]). However the solution below is also O(n) and I believe has no bugs. The difference is that instead of looking for a band of fixed width, I look for a band of width dynamically determined by the previous matching. Since no more than 5 entries can look to match at any given spot, it again winds up O(n) for this array and a geometrically decreasing recursive call. But long stretches of the same value cannot cause a problem.

def match_lists (list1, list2):
    prev_matching = []

    if 0 == len(list1):
        # Trivial match
        return prev_matching
    elif 1 < len(list1):
        # Solve a smaller problem first.
        list1_smaller = [list1[2*i] for i in range((len(list1)+1)//2)]
        list2_smaller = [list2[2*i] for i in range((len(list2)+1)//2)]
        prev_matching = match_lists(list1_smaller, list2_smaller)

    best_j_last = -1
    last = {-1: (0.0, None)}
    for i in range(len(list1)):
        lowest_j = 0
        highest_j = len(list2) - 1
        if 3 < i:
            lowest_j = 2 * prev_matching[i//2 - 2]
        if i + 4 < len(list1):
            highest_j = 2 * prev_matching[i//2 + 2]

        if best_j_last == highest_j:
            # Have to push it back.
            best_j_last = best_j_last - 1

        best_cost = last[best_j_last][0] + (list1[i] - list2[highest_j])**2
        best_j = highest_j
        this = {best_j: (best_cost, [best_j, last[best_j_last][1]])}

        # Now try the others.
        for j in range(lowest_j, highest_j):
            prev_j = best_j_last
            if j <= prev_j:
                prev_j = j - 1

            if prev_j not in last:
                continue
            else:
                cost = last[prev_j][0] + (list1[i] - list2[j])**2
                this[j] = (cost, [j, last[prev_j][1]])
                if cost < best_cost:
                    best_cost = cost
                    best_j = j

        last = this
        best_j_last = best_j

    (final_cost, linked_list) = last[best_j_last]
    matching_rev = []
    while linked_list is not None:
        matching_rev.append( linked_list[0])
        linked_list = linked_list[1]
    matching_new = [x for x in reversed(matching_rev)]

    return matching_new

def best_matching (list1, list2):
    matching = match_lists(list1, list2)
    cost = 0.0
    result = []
    for i in range(len(matching)):
        pair = (list1[i], list2[matching[i]])
        result.append(pair)
        cost = cost + (pair[0] - pair[1])**2
    return (cost, result)

Note

I was asked to explain why this works.

Here is my heuristic understanding. In the algorithm we solve the half-problem. Then we have to solve the full problem.

The question is how far can an optimal solution for the full problem be forced to be from the optimal solution to the half problem? We push it to the right by having every element in list2 that wasn't in the half problem be large as possible, and every element in list1 that wasn't in the half problem be small as possible. But if we shove the ones from the half problem to the right, and put the duplicate elements where they were then modulo boundary effects, we've got 2 optimal solutions to the half problem and nothing moved by more than to where the next element right was in the half problem. Similar reasoning applies to trying to force the solution left.

Now let's discuss those boundary effects. Those boundary effects are at the end by 1 element. So when we try to shove an element off the end, we can't always. By looking 2 elements instead of 1 over, we add enough wiggle room to account for that as well.

Hence there has to be an optimal solution that is fairly close to the half problem doubled in an obvious way. There may be others, but there is at least one. And the DP step will find it.

I would need to do some work to capture this intuition into a formal proof, but I'm confident that it could be done.

like image 184
btilly Avatar answered Oct 06 '22 03:10

btilly


Here's a recursive solution. Pick the middle element of a; map that to each possible element of b (leave enough on each end to accommodate the left and right sections of a). For each such mapping, compute the single-element cost; then recur on each of the left and right fragments of a and b.

Here's the code; I'll leave memoization as an exercise for the student.

test_case = [
    [ [1, 2], [0, 1, 10] ],
    [ [1.1, 2.3, 5.6, 5.7, 10.1], [0, 1.9, 2.4, 2.7, 8.4, 9.1, 10.7, 11.8] ],
]

import math
indent = ""


def best_match(a, b):
    """
    Find the best match for elements in a mapping to b, preserving order
    """

    global indent
    indent += "  "
    # print(indent, "ENTER", a, b)

    best_cost = math.inf
    best_map = []

    if len(a) == 0:
        best_cost = 0
        best_map = []

    else:

        # Match the middle element of `a` to each eligible element of `b`
        a_midpt = len(a) // 2
        a_elem = a[a_midpt]
        l_margin = a_midpt
        r_margin = a_midpt + len(b) - len(a) 

        for b_pos in range(l_margin, r_margin+1):
            # For each match ...
            b_elem = b[b_pos]
            # print(indent, "TRACE", a_elem, b_elem)

            # ... compute the element cost ...
            mid_cost = (a_elem - b_elem)**2

            # ... and recur for similar alignments on left & right list fragments
            l_cost, l_map = best_match(a[:l_margin], b[:b_pos])
            r_cost, r_map = best_match(a[l_margin+1:], b[b_pos+1:])

            # Check total cost against best found; keep the best
            cand_cost = l_cost + mid_cost + r_cost
            # print(indent, " COST", mid_cost, l_cost, r_cost)
            if cand_cost < best_cost:
                best_cost = cand_cost
                best_map = l_map[:] + [(a_elem, b_elem)]
                best_map.extend(r_map[:])

    # print(indent, "LEAVE", best_cost, best_map)
    return best_cost, best_map


for a, b in test_case:
    print('\n', a, b)
    print(best_match(a, b))

Output:

 a = [1, 2] 
 b = [0, 1, 10]
2 [(1, 0), (2, 1)]

 a = [1.1, 2.3, 5.6, 5.7, 10.1] 
 b = [0, 1.9, 2.4, 2.7, 8.4, 9.1, 10.7, 11.8]
16.709999999999997 [(1.1, 1.9), (2.3, 2.4), (5.6, 2.7), (5.7, 8.4), (10.1, 10.7)]
like image 41
Prune Avatar answered Oct 06 '22 05:10

Prune