I have a large list of lists and need to remove duplicate elements based on specific criteria:
[[1, 4, 5], [1, 3, 4], [1, 2, 3]]
All the above lists are considered duplicates since their first elements are equal. The third list needs to be kept since it's second element is the smallest. Note the actual list of lists has over 4 million elements, is double sorted and ordering needs to be preserved.
The list is first sorted based on the second element of the inner lists and in reverse (descending) order, followed by normal (ascending) order based on the first element:
sorted(sorted(the_list, key=itemgetter(1), reverse=True), key=itemgetter(0))
An example of three duplicate lists in their actual ordering:
[...
[33554432, 50331647, 1695008306],
[33554432, 34603007, 1904606324],
[33554432, 33554687, 2208089473],
...]
The goal is to prepare the list for bisect searching. Can someone provide me with insight on how this might be achieved using Python?
You can group the elements using a dict, always keeping the sublist with the smaller second element:
l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = {}
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[k] = sub
Also you can pass two keys to sorted, you don't need to call sorted twice:
In [3]: l = [[1,4,6,2],[2,2,4,6],[1,2,4,5]]
In [4]: sorted(l,key=lambda x: (-x[1],x[0]))
Out[4]: [[1, 4, 6, 2], [1, 2, 4, 5], [2, 2, 4, 6]]
If you wanted to maintain order in the dict as per ordering needs to be preserved.:
from collections import OrderedDict
l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = OrderedDict()
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[sub[0]] = sub
But not sure how that fits as you are sorting the data after so you will lose any order.
What you may find very useful is a sortedcontainers.sorteddict:
A SortedDict provides the same methods as a dict. Additionally, a SortedDict efficiently maintains its keys in sorted order. Consequently, the keys method will return the keys in sorted order, the popitem method will remove the item with the highest key, etc.
An optional key argument defines a callable that, like the key argument to Python’s sorted function, extracts a comparison key from each dict key. If no function is specified, the default compares the dict keys directly. The key argument must be provided as a positional argument and must come before all other arguments.
from sortedcontainers import SortedDict
l = [[1, 2, 3], [1, 3, 4], [1, 4, 5], [2, 4, 3], [2, 5, 6], [2, 1, 3]]
d = SortedDict()
for sub in l:
k = sub[0]
if k not in d or sub[1] < d[k][1]:
d[k] = sub
print(list(d.values()))
It has all the methods you want bisect, bisect_left etc..
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With