When having a list of lists in which all the sublists are ordered e.g.:
[[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]]
what is the fastest way to get the union of these lists without having duplicates in it and also get an ordered result?
So the resulting list should be:
[1,2,3,4,5,6,7,20,25,26,31]
.
I know I could just union them all without duplicates and then sort them but are there faster ways (like: do the sorting while doing the union) built into python?
EDIT:
Is the proposed answer faster than executing the following algorithm pairwise with all the sublists?
You can use heapq.merge
for this:
def mymerge(v):
from heapq import merge
last = None
for a in merge(*v):
if a != last: # remove duplicates
last = a
yield a
print(list(mymerge([[1,3,7,20,31], [1,2,5,6,7], [2,4,25,26]])))
# [1, 2, 3, 4, 5, 6, 7, 20, 25, 26, 31]
(EDITED)
The asymptotic theoretical best approach to the problem is to use the priority queue, like, for example, the one implemented in heapq.merge()
(thanks to @kaya3 for pointing this out).
However, in practice, a number of things can go wrong. For example the constant factors in the complexity analysis are large enough that a theoretically-optimal approach is, in real-life scenarios, slower. This is fundamentally dependent on the implementation. For example, Python suffer some speed penalty for explicit looping.
So, let's consider a couple of approaches and how the do perform for some concrete inputs.
Just to give you some idea of the numbers we are discussing, here are a few approaches:
merge_sorted()
which uses the most naive approach of flatten the sequence, reduce it to a set()
(removing duplicates) and sort it as requiredimport itertools
def merge_sorted(seqs):
return sorted(set(itertools.chain.from_iterable(seqs)))
merge_heapq()
which essentially @arshajii's answer. Note that the itertools.groupby()
variation is slightly (less than ~1%) faster.import heapq
def i_merge_heapq(seqs):
last_item = None
for item in heapq.merge(*seqs):
if item != last_item:
yield item
last_item = item
def merge_heapq(seqs):
return list(i_merge_heapq(seqs))
merge_bisect_set()
is substantially the same algorithm as merge_sorted()
except that the result is now constructed explicitly using the efficient bisect
module for sorted insertions. Since sorted()
is doing fundamentally the same thing but looping in Python, this is not going to be faster.import itertools
import bisect
def merge_bisect_set(seqs):
result = []
for item in set(itertools.chain.from_iterable(seqs)):
bisect.insort(result, item)
return result
merge_bisect_cond()
is similar to merge_bisect_set()
but now the non-repeating constraint is explicitly done using the final list
. However, this is much more expensive than just using set()
(in fact it is so slow that it was excluded from the plots).def merge_bisect_cond(seqs):
result = []
for item in itertools.chain.from_iterable(seqs):
if item not in result:
bisect.insort(result, item)
return result
merge_pairwise()
explicitly implements the the theoretically efficient algorithm, similar to what you outlined in your question.def join_sorted(seq1, seq2):
result = []
i = j = 0
len1, len2 = len(seq1), len(seq2)
while i < len1 and j < len2:
if seq1[i] < seq2[j]:
result.append(seq1[i])
i += 1
elif seq1[i] > seq2[j]:
result.append(seq2[j])
j += 1
else: # seq1[i] == seq2[j]
result.append(seq1[i])
i += 1
j += 1
if i < len1:
result.extend(seq1[i:])
elif j < len2:
result.extend(seq2[j:])
return result
def merge_pairwise(seqs):
result = []
for seq in seqs:
result = join_sorted(result, seq)
return result
merge_loop()
implements a generalization of the above, where now pass is done only once for all sequences, instead of doing this pairwise.def merge_loop(seqs):
result = []
lengths = list(map(len, seqs))
idxs = [0] * len(seqs)
while any(idx < length for idx, length in zip(idxs, lengths)):
item = min(
seq[idx]
for idx, seq, length in zip(idxs, seqs, lengths) if idx < length)
result.append(item)
for i, (idx, seq, length) in enumerate(zip(idxs, seqs, lengths)):
if idx < length and seq[idx] == item:
idxs[i] += 1
return result
By generating the input using:
def gen_input(n, m=100, a=None, b=None):
if a is None and b is None:
b = 2 * n * m
a = -b
return tuple(tuple(sorted(set(random.randint(int(a), int(b)) for _ in range(n)))) for __ in range(m))
one can plot the timings for varying n
:
Note that, in general, the performances will vary for different values of n
(the size of each sequence) and m
(the number of sequences), but also of a
and b
(the minimum and the maximum number generated).
For brevity, this was not explored in this answer, but feel free to play around with it here, which also includes some other implementations, notably some tentative speed-ups with Cython that were only partially successful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With