Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently remove partial duplicates in a list of tuples

I have a list of tuples, the list can vary in length between ~8 - 1000 depending on the length of the tuples. Each tuple in the list is unique. A tuple is of length N where each entry is a generic word.

An example tuple can be of length N (Word 1, Word 2, Word 3, ..., Word N)

For any tuple in the list, element j in said tuple will either be '' or Word j

A very simplified example with alphabetic letters would be

l = [('A', 'B', '', ''), ('A', 'B', 'C', ''), 
     ('', '', '', 'D'), ('A', '', '', 'D'), 
     ('', 'B', '', '')]

Every position at each tuple will either have the same value or be empty. I want to remove all the tuples which have all their non '' values in another tuple at the same position. As an example, (A,B,'','') has all its non '' values in (A,B,C,'') and should therefore be removed.

filtered_l = [(A,B,C,''),(A,'','',D)]

The length of the tuples is always of the same length (not necessarily 4). The length of the tuples would be between 2-10.

What is the fastest way to do this?

like image 957
kspr Avatar asked Sep 29 '20 15:09

kspr


2 Answers

Let's conceptualize each tuple as a binary array, where 1 is "contains something" and 2 is "contains an empty string". Since the item at each position will be the same, we don't need to care what is at each position, only that something is.

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
l_bin = [sum(2**i if k else 0 for i,k in enumerate(tup)) for tup in l]
# [3, 7, 8, 9, 2]
# [0b0011, 0b0111, 0b1000, 0b1001, 0b0010]
# that it's backwards doesn't really matter, since it's consistent

Now, we can walk through that list and build a new datastructure without 'duplicates'. Since we have our tuples encoded as binary, we can determine a duplicate, 'encompassed' by another, by doing bitwise operations - given a and b, if a | b == a, then a must contain b.

codes = {}
for tup, b in zip(l, l_bin):
    # check if any existing code contains the potential new one
    # in this case, skip adding the new one
    if any(a | b == a for a in codes):
        continue
    # check if the new code contains a potential existing one or more
    # in which case, replace the existing code(s) with the new code
    for a in list(codes):
        if b | a == b:
            codes.pop(a)
    # and finally, add this code to our datastructure
    codes[b] = tup

Now we can withdraw our 'filtered' list of tuples:

output = list(codes.values())
# [('A', 'B', 'C', ''), ('A', '', '', 'D')]

Note that (A, B, C, '') contains both (A, B, '', '') and ('', B, '', ''), and that (A, '', '', D') contains ('', '', '', D), so this should be correct.

As of python 3.8, dict preserves insertion order, so the output should be in the same order that the tuples originally appeared in the list.

This solution wouldn't be perfectly efficient, since the number of codes might stack up, but it should be between O(n) and O(n^2), depending on the number of unique codes left at the end (and since the length of each tuple is significantly less than the length of l, it should be closer to O(n) than to O(n^2).

like image 163
Green Cloak Guy Avatar answered Nov 09 '22 02:11

Green Cloak Guy


For that limit in particular, the obvious solution would be to convert each tuple to bit mask, accumulate them in a counter array, perform subset sum transformation, then filter the array l.

See detailed code explanation in the comment.

Time complexity is obviously n + m * 2^m, where n is the number of tuples and m is the length of each tuple. For n == 1000 and m == 10, this is obviously faster than n^2.

l = [('A','B','',''),('A','B','C',''),('','','','D'),('A','','','D'),('','B','','')]
# assumes that l is not empty. (to access l[0])
# The case where l is empty is trivial to handle.

def tuple_to_mask(tuple_):
    # convert the information whether each value in (tuple_) is empty to a bit mask
    # (1 is empty, 0 is not empty)
    return sum((value == '') << index for index, value in enumerate(tuple_))


count = [0] * (1 << len(l[0]))
for tuple_ in l:
    # tuple_ is a tuple.
    count[tuple_to_mask(tuple_)] += 1

# now count[mask] is the number of tuples in l with that mask

# transform the count array.
for dimension in range(len(l[0])):
    for mask in range(len(count)):
        if mask >> dimension & 1:
            count[mask] += count[mask - (1 << dimension)]

# now count[mask] is the number of tuples in l with a mask (mask_) such that (mask) contains (mask_)
# (i.e. all the bits that are set in mask_ are also set in mask)


filtered_l = [tuple_ for tuple_ in l if count[tuple_to_mask(tuple_)] == 1]
print(filtered_l)
like image 5
user202729 Avatar answered Nov 09 '22 03:11

user202729