Filtering list of tuples based on condition

Tags:

python-3.x

For a given list of tuples, if multiple tuples in the list have the first element of tuple the same - among them select only the tuple with the maximum last element.

For example:

sample_list = [(5,16,2),(5,10,3),(5,8,1),(21,24,1)]

In the sample_list above since the first 3 tuples has the similar first element 5 in this case among them only the 2nd tuple should be retained since it has the max last element => 3.

Expected op:

op = [(5,10,3),(21,24,1)]

Code:

op = []
for m in range(len(sample_list)):
    li = [sample_list[m]]
    for n in range(len(sample_list)):
        if(sample_list[m][0] == sample_list[n][0]
           and sample_list[m][2] != sample_list[n][2]):
            li.append(sample_list[n])
    op.append(sorted(li,key=lambda dd:dd[2],reverse=True)[0])

print (list(set(op)))

This works. But it is very slow for long list. Is there a more pythonic or efficient way to do this?

509

asked Sep 02 '21 06:09

Video Answer

4 Answers

TL;DR

Use collections.defaultdict is the fastest alternative and arguably the most pythonic:

from collections import defaultdict

sample_list = [(5, 16, 2), (5, 10, 3), (5, 8, 1), (21, 24, 1)]

d = defaultdict(lambda: (0, 0, float("-inf")))
for e in sample_list:
    first, _, last = e
    if d[first][2] < last:
        d[first] = e

res = [*d.values()]
print(res)

Output

[(5, 10, 3), (21, 24, 1)]

This is a single pass O(n) which is not only asymptotically optimal but also performant in practice.

Detailed Explanation

Performance

To show that is performant one could design an experiment considering the two main variables of the problem, the number of unique keys (values in the firs position of the tuple) and the length of the input list and the following alternatives approaches:

def defaultdict_max_approach(lst):
    d = defaultdict(lambda: (0, 0, float("-inf")))
    for e in lst:
        first, _, last = e
        if d[first][2] < last:
            d[first] = e
    return [*d.values()]


def dict_max_approach(lst):
    # https://stackoverflow.com/a/69025193/4001592
    d = {}
    for tpl in lst:
        first, *_, last = tpl
        if first not in d or last > d[first][-1]:
            d[first] = tpl

    return [*d.values()]


def groupby_max_approach(lst):
    # https://stackoverflow.com/a/69025193/4001592
    return [max(g, key=ig(-1)) for _, g in groupby(sorted(lst), key=ig(0))]

As shown in the plots below the approach using defaultdict is the most performant method for a varying number of unique keys (500, 1000, 5000, 10000) and also for collections up to 1000000 elements (note that the x axis in is in thousands).

Experiments

The above experiments are in concordance with experiments done by others (1, 2). The code for reproducing the experiments can be found here.

Pythonic

Stating that is the most pythonic is subjective, but here are the main arguments in favor:

Is a well known Python idiom

Using a defaultdict for grouping a sequence key-value pairs, and aggregating afterwards, is a well known Python idiom. Read the defaultdict examples in the Python documentation.

In the PyCon 2013 talk Transforming Code into Beautiful, Idiomatic Python by Raymond Hettinger also says that using defaultdict for such operations is the better way.

Is compliant with the Zen of Python

In the Zen of Python it can be read that

Flat is better than nested.
Sparse is better than dense.

Using a defaultdict is as flat as using a plain dict only a for-loop and a simple if statement. In the case of defaultdict the if condition is even simpler.

Both solutions are sparser than using itertools.groupby, notice this approach also involves calling sorted, itemgetter and max all inside a list comprehension.

Original Answer

You could use a collections.defaultdict to group tuples that have the same first element and then take the maximum of each group based on the third:

from collections import defaultdict

sample_list = [(5,16,2),(5,10,3),(5,8,1),(21,24,1)]

d = defaultdict(list)
for e in sample_list:
    d[e[0]].append(e)

res = [max(val, key=lambda x: x[2]) for val in d.values()]
print(res)

Output

[(5, 10, 3), (21, 24, 1)]

This approach is O(n).

113

answered Oct 19 '22 01:10

Dani Mesejo

Try itertools.groupby:

from itertools import groupby
sample_list.sort()
print([max(l, key=lambda x: x[-1]) for _, l in groupby(sample_list, key=lambda x: x[0])])

Or also with operator.itemgetter:

from itertools import groupby
from operator import itemgetter
sample_list.sort()
print([max(l, key=itemgetter(-1)) for _, l in groupby(sample_list, key=itemgetter(0))])

For performance try:

from operator import itemgetter
dct = {}
for i in sample_list:
    if i[0] in dct:
        dct[i[0]].append(i)
    else:
        dct[i[0]] = [i]
print([max(v, key=itemgetter(-1)) for v in dct.values()])

All output:

[(5, 10, 3), (21, 24, 1)]

answered Oct 19 '22 00:10

U12-Forward

Use itertools.groupby and operator.itemgetter for readability. Within the groups, apply max with an appropriate key function, again using itemgetter for brevity:

from itertools import groupby
from operator import itemgetter as ig

lst = [(5, 10, 3), (21, 24, 1), (5, 8, 1), (5, 16, 2)]

[max(g, key=ig(-1)) for _, g in groupby(sorted(lst), key=ig(0))]
# [(5, 10, 3), (21, 24, 1)]

For a linear-time solution, with extra-space only bound the number of unique first elements, you may use a dict:

d = {}
for tpl in lst:
    first, *_, last = tpl
    if first not in d or last > d[first][-1]:
        d[first] = tpl

[*d.values()]
# [(5, 10, 3), (21, 24, 1)]

answered Oct 19 '22 00:10

user2390182

Here is a linear-time method which I think qualifies as more Pythonic:

highest = dict()
for a, b, c in sample_list:
     if a not in highest or c >= highest[a][2]:
         highest[a] = (a, b, c)
op = list(highest.values())

You can change the >= to > if you care about how to choose between triples with the same first and last elements but different middle elements.

As pointed out by @AlexWaygood, dicts have yielded their elements according to insertion order since Python 3.7. The code above therefore causes the elements of op to be in the same order the elements of sample_list.

In Python 3.6 or older, on the other hand, the order may change. If you want a solution that works in Python 3.6 too, you will need to use an OrderedDict, as in:

from collections import OrderedDict

highest = OrderedDict()
for a, b, c in sample_list:
     if a not in highest or c >= highest[a][2]:
         highest[a] = (a, b, c)
op = list(highest.values())

answered Oct 19 '22 00:10

Mathias

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filtering list of tuples based on condition

Tags:

python

python-3.x

data_person

People also ask

Video Answer

4 Answers

TL;DR

Detailed Explanation

Performance

Pythonic

Original Answer

Dani Mesejo

U12-Forward

user2390182

Mathias

Recent Activity

Donate For Us

Filtering list of tuples based on condition

Tags:

python

python-3.x

data_person

People also ask

Video Answer

4 Answers

TL;DR

Detailed Explanation

Performance

Pythonic

Original Answer

Dani Mesejo

U12-Forward

user2390182

Mathias

Related questions

Recent Activity

Donate For Us