Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest union for multiple sorted lists removing duplicates and get an ordered result

Tags:

When having a list of lists in which all the sublists are ordered e.g.: [[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]] what is the fastest way to get the union of these lists without having duplicates in it and also get an ordered result? So the resulting list should be: [1,2,3,4,5,6,7,20,25,26,31]. I know I could just union them all without duplicates and then sort them but are there faster ways (like: do the sorting while doing the union) built into python?

EDIT:

Is the proposed answer faster than executing the following algorithm pairwise with all the sublists? enter image description here

like image 695
yzrzrbnz zrnrynzr Avatar asked Dec 16 '19 17:12

yzrzrbnz zrnrynzr


2 Answers

You can use heapq.merge for this:

def mymerge(v):
    from heapq import merge
    last = None
    for a in merge(*v):
        if a != last:  # remove duplicates
            last = a
            yield a

print(list(mymerge([[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]])))
# [1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
like image 152
arshajii Avatar answered Oct 11 '22 18:10

arshajii


(EDITED)

The asymptotic theoretical best approach to the problem is to use the priority queue, like, for example, the one implemented in heapq.merge() (thanks to @kaya3 for pointing this out).

However, in practice, a number of things can go wrong. For example the constant factors in the complexity analysis are large enough that a theoretically-optimal approach is, in real-life scenarios, slower. This is fundamentally dependent on the implementation. For example, Python suffer some speed penalty for explicit looping.

So, let's consider a couple of approaches and how the do perform for some concrete inputs.

Approaches

Just to give you some idea of the numbers we are discussing, here are a few approaches:

  • merge_sorted() which uses the most naive approach of flatten the sequence, reduce it to a set() (removing duplicates) and sort it as required
import itertools


def merge_sorted(seqs):
    return sorted(set(itertools.chain.from_iterable(seqs)))

  • merge_heapq() which essentially @arshajii's answer. Note that the itertools.groupby() variation is slightly (less than ~1%) faster.
import heapq


def i_merge_heapq(seqs):
    last_item = None
    for item in heapq.merge(*seqs):
        if item != last_item:
            yield item
            last_item = item

def merge_heapq(seqs):
    return list(i_merge_heapq(seqs))

  • merge_bisect_set() is substantially the same algorithm as merge_sorted() except that the result is now constructed explicitly using the efficient bisect module for sorted insertions. Since sorted() is doing fundamentally the same thing but looping in Python, this is not going to be faster.
import itertools
import bisect


def merge_bisect_set(seqs):
    result = []
    for item in set(itertools.chain.from_iterable(seqs)):
        bisect.insort(result, item)
    return result

  • merge_bisect_cond() is similar to merge_bisect_set() but now the non-repeating constraint is explicitly done using the final list. However, this is much more expensive than just using set() (in fact it is so slow that it was excluded from the plots).
def merge_bisect_cond(seqs):
    result = []
    for item in itertools.chain.from_iterable(seqs):
        if item not in result:
            bisect.insort(result, item)
    return result

  • merge_pairwise() explicitly implements the the theoretically efficient algorithm, similar to what you outlined in your question.
def join_sorted(seq1, seq2):
    result = []
    i = j = 0
    len1, len2 = len(seq1), len(seq2)
    while i < len1 and j < len2:
        if seq1[i] < seq2[j]:
            result.append(seq1[i])
            i += 1
        elif seq1[i] > seq2[j]:
            result.append(seq2[j])
            j += 1
        else:  # seq1[i] == seq2[j]
            result.append(seq1[i])
            i += 1
            j += 1
    if i < len1:
        result.extend(seq1[i:])
    elif j < len2:
        result.extend(seq2[j:])
    return result


def merge_pairwise(seqs):
    result = []
    for seq in seqs:
        result = join_sorted(result, seq)
    return result

  • merge_loop() implements a generalization of the above, where now pass is done only once for all sequences, instead of doing this pairwise.
def merge_loop(seqs):
    result = []
    lengths = list(map(len, seqs))
    idxs = [0] * len(seqs)
    while any(idx < length for idx, length in zip(idxs, lengths)):
        item = min(
            seq[idx]
            for idx, seq, length in zip(idxs, seqs, lengths) if idx < length)
        result.append(item)
        for i, (idx, seq, length) in enumerate(zip(idxs, seqs, lengths)):
            if idx < length and seq[idx] == item:
                idxs[i] += 1
    return result

Benchmarks

By generating the input using:

def gen_input(n, m=100, a=None, b=None):
    if a is None and b is None:
        b = 2 * n * m
        a = -b
    return tuple(tuple(sorted(set(random.randint(int(a), int(b)) for _ in range(n)))) for __ in range(m))

one can plot the timings for varying n:

bm_full bm_zoom

Note that, in general, the performances will vary for different values of n (the size of each sequence) and m (the number of sequences), but also of a and b (the minimum and the maximum number generated). For brevity, this was not explored in this answer, but feel free to play around with it here, which also includes some other implementations, notably some tentative speed-ups with Cython that were only partially successful.

like image 40
norok2 Avatar answered Oct 11 '22 16:10

norok2