Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering list of tuples based on condition

For a given list of tuples, if multiple tuples in the list have the first element of tuple the same - among them select only the tuple with the maximum last element.

For example:

sample_list = [(5,16,2),(5,10,3),(5,8,1),(21,24,1)]

In the sample_list above since the first 3 tuples has the similar first element 5 in this case among them only the 2nd tuple should be retained since it has the max last element => 3.

Expected op:

op = [(5,10,3),(21,24,1)]

Code:

op = []
for m in range(len(sample_list)):
    li = [sample_list[m]]
    for n in range(len(sample_list)):
        if(sample_list[m][0] == sample_list[n][0]
           and sample_list[m][2] != sample_list[n][2]):
            li.append(sample_list[n])
    op.append(sorted(li,key=lambda dd:dd[2],reverse=True)[0])

print (list(set(op)))

This works. But it is very slow for long list. Is there a more pythonic or efficient way to do this?

like image 509
data_person Avatar asked Sep 02 '21 06:09

data_person


People also ask

How do you filter a tuple element in Python?

To filter items of a Tuple in Python, call filter() builtin function and pass the filtering function and tuple as arguments. The filtering function should take an argument and return True if the item has to be in the result, or False if the item has to be not in the result.

How do you filter items in a list in Python?

Python has a built-in function called filter() that allows you to filter a list (or a tuple) in a more beautiful way. The filter() function iterates over the elements of the list and applies the fn() function to each element. It returns an iterator for the elements where the fn() returns True .

How to filter a list of tuples in Python?

The third technique we are going to use is by using the set () function. Set () functions help us in filtering the list of tuples. We pass a list to this function and get the desired results. view source print?

How to select only the tuple with the maximum last element?

For a given list of tuples, if multiple tuples in the list have the first element of tuple the same - among them select only the tuple with the maximum last element. In the sample_list above since the first 3 tuples has the similar first element 5 in this case among them only the 2nd tuple should be retained since it has the max last element => 3.

How do I filter a list based on a condition?

You want to filter based on some condition, then display a representation of those items. There are a few ways to do this. List comprehension with filtering. This is usually considered idiomatic or “pythonic” Filter and map. This is lazy and useful if your list is very large and you don’t want to hold it all in memory at once.

What is the difference between list and tuple in Python?

A list can be used to store heterogeneous values (i.e data of any data type like integer, floating point, strings, and so on). A list of tuple basically contains tuples enclosed in a list.


Video Answer


4 Answers

TL;DR

Use collections.defaultdict is the fastest alternative and arguably the most pythonic:

from collections import defaultdict

sample_list = [(5, 16, 2), (5, 10, 3), (5, 8, 1), (21, 24, 1)]

d = defaultdict(lambda: (0, 0, float("-inf")))
for e in sample_list:
    first, _, last = e
    if d[first][2] < last:
        d[first] = e

res = [*d.values()]
print(res)

Output

[(5, 10, 3), (21, 24, 1)]

This is a single pass O(n) which is not only asymptotically optimal but also performant in practice.

Detailed Explanation

Performance

To show that is performant one could design an experiment considering the two main variables of the problem, the number of unique keys (values in the firs position of the tuple) and the length of the input list and the following alternatives approaches:

def defaultdict_max_approach(lst):
    d = defaultdict(lambda: (0, 0, float("-inf")))
    for e in lst:
        first, _, last = e
        if d[first][2] < last:
            d[first] = e
    return [*d.values()]


def dict_max_approach(lst):
    # https://stackoverflow.com/a/69025193/4001592
    d = {}
    for tpl in lst:
        first, *_, last = tpl
        if first not in d or last > d[first][-1]:
            d[first] = tpl

    return [*d.values()]


def groupby_max_approach(lst):
    # https://stackoverflow.com/a/69025193/4001592
    return [max(g, key=ig(-1)) for _, g in groupby(sorted(lst), key=ig(0))]  

As shown in the plots below the approach using defaultdict is the most performant method for a varying number of unique keys (500, 1000, 5000, 10000) and also for collections up to 1000000 elements (note that the x axis in is in thousands).

Experiments

The above experiments are in concordance with experiments done by others (1, 2). The code for reproducing the experiments can be found here.

Pythonic

Stating that is the most pythonic is subjective, but here are the main arguments in favor:

Is a well known Python idiom

Using a defaultdict for grouping a sequence key-value pairs, and aggregating afterwards, is a well known Python idiom. Read the defaultdict examples in the Python documentation.

In the PyCon 2013 talk Transforming Code into Beautiful, Idiomatic Python by Raymond Hettinger also says that using defaultdict for such operations is the better way.

Is compliant with the Zen of Python

In the Zen of Python it can be read that

Flat is better than nested.
Sparse is better than dense.

Using a defaultdict is as flat as using a plain dict only a for-loop and a simple if statement. In the case of defaultdict the if condition is even simpler.

Both solutions are sparser than using itertools.groupby, notice this approach also involves calling sorted, itemgetter and max all inside a list comprehension.

Original Answer

You could use a collections.defaultdict to group tuples that have the same first element and then take the maximum of each group based on the third:

from collections import defaultdict

sample_list = [(5,16,2),(5,10,3),(5,8,1),(21,24,1)]

d = defaultdict(list)
for e in sample_list:
    d[e[0]].append(e)

res = [max(val, key=lambda x: x[2]) for val in d.values()]
print(res)

Output

[(5, 10, 3), (21, 24, 1)]

This approach is O(n).

like image 113
Dani Mesejo Avatar answered Oct 19 '22 01:10

Dani Mesejo


Try itertools.groupby:

from itertools import groupby
sample_list.sort()
print([max(l, key=lambda x: x[-1]) for _, l in groupby(sample_list, key=lambda x: x[0])])

Or also with operator.itemgetter:

from itertools import groupby
from operator import itemgetter
sample_list.sort()
print([max(l, key=itemgetter(-1)) for _, l in groupby(sample_list, key=itemgetter(0))])

For performance try:

from operator import itemgetter
dct = {}
for i in sample_list:
    if i[0] in dct:
        dct[i[0]].append(i)
    else:
        dct[i[0]] = [i]
print([max(v, key=itemgetter(-1)) for v in dct.values()])

All output:

[(5, 10, 3), (21, 24, 1)]
like image 33
U12-Forward Avatar answered Oct 19 '22 00:10

U12-Forward


Use itertools.groupby and operator.itemgetter for readability. Within the groups, apply max with an appropriate key function, again using itemgetter for brevity:

from itertools import groupby
from operator import itemgetter as ig

lst = [(5, 10, 3), (21, 24, 1), (5, 8, 1), (5, 16, 2)]

[max(g, key=ig(-1)) for _, g in groupby(sorted(lst), key=ig(0))]
# [(5, 10, 3), (21, 24, 1)]

For a linear-time solution, with extra-space only bound the number of unique first elements, you may use a dict:

d = {}
for tpl in lst:
    first, *_, last = tpl
    if first not in d or last > d[first][-1]:
        d[first] = tpl

[*d.values()]
# [(5, 10, 3), (21, 24, 1)]
like image 23
user2390182 Avatar answered Oct 19 '22 00:10

user2390182


Here is a linear-time method which I think qualifies as more Pythonic:

highest = dict()
for a, b, c in sample_list:
     if a not in highest or c >= highest[a][2]:
         highest[a] = (a, b, c)
op = list(highest.values())

You can change the >= to > if you care about how to choose between triples with the same first and last elements but different middle elements.

As pointed out by @AlexWaygood, dicts have yielded their elements according to insertion order since Python 3.7. The code above therefore causes the elements of op to be in the same order the elements of sample_list.

In Python 3.6 or older, on the other hand, the order may change. If you want a solution that works in Python 3.6 too, you will need to use an OrderedDict, as in:

from collections import OrderedDict

highest = OrderedDict()
for a, b, c in sample_list:
     if a not in highest or c >= highest[a][2]:
         highest[a] = (a, b, c)
op = list(highest.values())
like image 1
Mathias Avatar answered Oct 19 '22 00:10

Mathias