Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find max overlap in list of lists

I have two lists of lists:

a = [[0, 1, 5], [2], [3], [4], [6, 7], [8, 9, 10, 11], [12], [13], [14], [15]]
b = [[0, 1], [2, 3], [4], [5], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

How can I find the maximum overlap between the values of the lists and build a new list of lists with this maximum overlap. In other words, I'm looking for a function f which maximizes the list sizes by merging lists with overlap.

The desired result of function f for this example would be:

f(a,b) = [[0, 1, 5], [2, 3], [4], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]] 
like image 889
elcombato Avatar asked Mar 09 '23 17:03

elcombato


1 Answers

You can use a variant of the disjoint-set structure to solve this problem: for each list [a,b,c] you unify a with b and a with c. You do this for both lists and then derive the resulting roots.

Here there is a simply disjunct-set algorithm we can modify:

from collections import defaultdict

def parent(u,mapping):
    if mapping[u] == u:
        return u
    mapping[u] = parent(mapping[u],mapping)
    return mapping[u]

def relation(array,mapping=None):
    if mapping is None:
        mapping = {}

    for e in array:
        if len(e) > 0:
            u = e[0]
            if u not in mapping:
                mapping[u] = u
            for v in e[1:]:
                if v not in mapping:
                    mapping[v] = v
                mapping[parent(u,mapping)] = parent(v,mapping)
    return mapping

def f(a,b):
    mapping = {}
    relation(a,mapping)
    relation(b,mapping)

    results = defaultdict(set)
    for u in mapping.keys():
        results[parent(u,mapping)].add(u)
    return [list(x) for x in results.values()]

(boldface added for the semantical differences with the original union-set algorithm).

This produces:

>>> f(a,b)
[[2, 3], [4], [0, 1, 5], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

The result is not sorted, since we work with a set. Nevertheless, you can easily sort it on the first element of each tuple if you want by altering f to:

def f(a,b):
    mapping = {}
    relation(a,mapping)
    relation(b,mapping)

    results = defaultdict(set)
    for u in mapping.keys():
        results[parent(u,mapping)].add(u)
    return sorted([list(x) for x in results.values()],key=lambda t:t[0])

which produces:

>>> f(a,b)
[[0, 1, 5], [2, 3], [4], [6, 7], [8, 9, 10, 11], [12], [13, 14], [15]]

The nice thing with this solution is that it also works if there is overlap in a or b itself, and you can easily generalize the solution to work with an arbitrary amount of lists (for instance a, b and c).

like image 178
Willem Van Onsem Avatar answered Mar 15 '23 20:03

Willem Van Onsem