Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicates from a list of lists based on a comparison of an element of the inner lists

Tags:

python

I have a large list of lists and need to remove duplicate elements based on specific criteria:

  1. Uniqueness is determined by the first element of the lists.
  2. Removal of duplicates is determined by comparing the value of the second element of the duplicate lists, namely keep the list with the lowest second element.

[[1, 4, 5], [1, 3, 4], [1, 2, 3]]

All the above lists are considered duplicates since their first elements are equal. The third list needs to be kept since it's second element is the smallest. Note the actual list of lists has over 4 million elements, is double sorted and ordering needs to be preserved.

The list is first sorted based on the second element of the inner lists and in reverse (descending) order, followed by normal (ascending) order based on the first element:

sorted(sorted(the_list, key=itemgetter(1), reverse=True), key=itemgetter(0))

An example of three duplicate lists in their actual ordering:

[...
[33554432, 50331647, 1695008306],
[33554432, 34603007, 1904606324],
[33554432, 33554687, 2208089473],
...]

The goal is to prepare the list for bisect searching. Can someone provide me with insight on how this might be achieved using Python?

like image 694
Aaron Avatar asked Dec 17 '15 12:12

Aaron


1 Answers

You can group the elements using a dict, always keeping the sublist with the smaller second element:

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = {}
for sub in l:
    k = sub[0]
    if k not in d or sub[1] < d[k][1]:
        d[k] = sub

Also you can pass two keys to sorted, you don't need to call sorted twice:

In [3]:  l = [[1,4,6,2],[2,2,4,6],[1,2,4,5]]
In [4]: sorted(l,key=lambda x: (-x[1],x[0]))
Out[4]: [[1, 4, 6, 2], [1, 2, 4, 5], [2, 2, 4, 6]]

If you wanted to maintain order in the dict as per ordering needs to be preserved.:

from collections import OrderedDict

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = OrderedDict()
for sub in l:
    k = sub[0]
    if k not in d or sub[1] < d[k][1]:
        d[sub[0]] = sub

But not sure how that fits as you are sorting the data after so you will lose any order.

What you may find very useful is a sortedcontainers.sorteddict:

A SortedDict provides the same methods as a dict. Additionally, a SortedDict efficiently maintains its keys in sorted order. Consequently, the keys method will return the keys in sorted order, the popitem method will remove the item with the highest key, etc.

An optional key argument defines a callable that, like the key argument to Python’s sorted function, extracts a comparison key from each dict key. If no function is specified, the default compares the dict keys directly. The key argument must be provided as a positional argument and must come before all other arguments.

from sortedcontainers import SortedDict

l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = SortedDict()
for sub in l:
    k = sub[0]
    if k not in d or sub[1] < d[k][1]:
        d[k] = sub


print(list(d.values()))

It has all the methods you want bisect, bisect_left etc..

like image 146
Padraic Cunningham Avatar answered Sep 22 '22 03:09

Padraic Cunningham