I am working in Python (2.7.9) and am trying to filter a list of tuples by a list of elements of those tuples. In particular, my objects have the following form:
tuples = [('a', ['a1', 'a2']), ('b',['b1', 'b2']), ('c',['c1', 'c2'])]
filter = ['a', 'c']
I am new to Python and the easiest way to filter the tuples that I could discover was with the following list comprehension:
tuples_filtered = [(x,y) for (x,y) in tuples if x in filter]
The resulting filtered list looks like:
tuples_filtered = [('a', ['a1', 'a2']), ('c',['c1', 'c2'])]
Unfortunately, this list comprehension seems to be very inefficient. I suspect this is because my list of tuples is much larger than my filter, the list of strings. In particular, the filter list contains 30,000 words and the list of tuples contains about 134,000 2-tuples.
The first elements of the 2-tuples are largely distinct, but there are a few instances of duplicate first elements (not sure how many, actually, but by comparison to the cardinality of the list it's not many).
My question: Is there a more efficient way to filter a list of tuples by a list of elements of those tuples?
(Apologies if this is off-topic or a dupe.)
Related question (which does not mention efficiency):
Filter a list of lists of tuples
To filter a list of tuples in Python: Use the filter() function to filter the list. The filter function returns an iterator containing the results. Pass the filter object to the list() class to convert it to a list.
In the majority of programming languages when you need to access a nested data type (such as arrays, lists, or tuples), you append the brackets to get to the innermost item. The first bracket gives you the location of the tuple in your list. The second bracket gives you the location of the item in the tuple.
Python has a built-in function called filter() that allows you to filter a list (or a tuple) in a more beautiful way. The filter() function iterates over the elements of the list and applies the fn() function to each element. It returns an iterator for the elements where the fn() returns True .
Use the key argument of the sorted() function to sort a list of tuples by the second element, e.g. sorted_list = sorted(list_of_tuples, key=lambda t: t[1]) . The function will return a new list, sorted by the second tuple element.
In a comment you write:
The filter list contains 30,000 words and the list of tuples contains about 134,000 2-tuples.
in
containment tests against a list takes O(N) linear time, which is slow when you do this 134k times. Each time you have to iterate over all those elements to find a match. Given that you are filtering, not all those first elements are going to be present in the 30k list, so you are executing up to 30k * 134k == 4 billion comparisons.
Use a set instead:
filter_set = set(filter)
Set containment tests are O(1) constant time; now you reduced your problem to 134k tests.
A much smaller component of time you can avoid spending is the tuple assignment; use indexing to extract just the one element you are testing with:
tuples_filtered = [tup for tup in tuples if tup[0] in filter_set]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With