The input is an unsorted list of tuples:
x = [('herr', 1),
('dapao', 1),
('cino', 1),
('o', 38),
('tiao', 2),
('tut', 1),
('poh', 6),
('micheal', 1),
('orh', 1),
('horlick', 3),
('si', 1),
('tai', 1),
('titlo', 1),
('siew', 17),
('da', 1),
('halia', 2)]
The goal is to find the last n
keys with the least counts, i.e. desired output:
['orh', 'si', 'tai', 'titlo', 'da']
I've tried doing this by:
[-n:]
list of tuples from the Counter.most_common()
[-n:]
to a dicti.e.
n = 5
list(dict(Counter(dict(x)).most_common()[-n:]).keys())
Is there a less convoluted way to get the same output?
I could also do this:
from operator import itemgetter
output, *_ = zip(*sorted(x, key=itemgetter(1))[n:])
list(output)
But now I've merely swapped out the Counter.most_common
with sorted
and itemgetter
. Then I would still need to zip(*list)
to extract the keys through unpacking the first value from each list of tuples after the zip.
There must be a simpler way.
Note that the question is not asking to sort, it's to extract the list first element in the original list of tuples given. And the criterion to extract is based on the last nth items with the lowest value in the 2nd element.
The answers from the possible duplicate linked still requires the step to unpack the list of sorted tuples and and the extract the top nth of the list of first elements.
The goal is to find the last
n
keys with the least counts
Given this goal's definition neither of your two solutions fit. In one with Counter
you use dict
which will make the order of keys undefined and you will not get the last keys, but some n
keys with least value. The second solution has incorrect slicing, and if it's fixed it returns first n
keys with least value.
Taking into account that sorted
implementation is stable it can be rewritten like this to fit the goal:
def author_2():
output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
return list(reversed(output))
But it's a better idea to use heapq
, which is the stdlib tool for questions like "n smallest/greatest values from an iterable" (as Martijn Pieters pointed out, nlargest
and nsmallest
are also stable and the docs really say that, but in implicit way). Especially if the real list you have to deal with is big (for small n
it should be faster that sorted
as docs describe).
def prop_1():
rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
return [item[0] for item in rev_result][::-1]
You can improve performance even further, but at the cost of order (sorting stability), i.e. some n
keys with least value instead of last n
keys with least value. To do that you need to keep a "heapified" list and use it as your internal data structure instead of plain list
(if you don't change the list and need the bottom-n only once, it will not give a performance benefit). You can push and pop from the list, for example:
_p2_heap = None
def prop_2():
global _p2_heap
if not _p2_heap:
_p2_heap = []
for item in l:
heapq.heappush(_p2_heap, item[::-1])
return [item[1] for item in heapq.nsmallest(n, _p2_heap)]
Here's the complete module you can use to benchmark the solutions.
import heapq
from collections import Counter
l = [
('herr', 1), ('dapao', 1),
('cino', 1), ('o', 38),
('tiao', 2), ('tut', 1),
('poh', 6), ('micheal', 1),
('orh', 1), ('horlick', 3),
('si', 1), ('tai', 1),
('titlo', 1), ('siew', 17),
('da', 1), ('halia', 2)
]
n = 5
def author_1():
return list(dict(Counter(dict(l)).most_common()[-n:]).keys())
def author_2():
output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
return list(reversed(output))
def prop_1():
rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
return [item[0] for item in rev_result][::-1]
_p2_heap = None
def prop_2():
global _p2_heap
if not _p2_heap:
_p2_heap = []
for item in l:
heapq.heappush(_p2_heap, item[::-1])
return [item[1] for item in heapq.nsmallest(n, _p2_heap)][::-1]
And here are the timeit
results:
$ python -m timeit -s "import tst" "tst.author_1()"
100000 loops, best of 3: 7.72 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100000 loops, best of 3: 3.7 usec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
100000 loops, best of 3: 5.51 usec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
100000 loops, best of 3: 3.96 usec per loop
But if we make l = l * 1000
the difference will become noticeable:
$ python -m timeit -s "import tst" "tst.author_1()"
1000 loops, best of 3: 263 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100 loops, best of 3: 2.72 msec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
1000 loops, best of 3: 1.65 msec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
1000 loops, best of 3: 767 usec per loop
Just use a heap, it will give you the desired output.
import heapq
x = [('herr', 1),
('dapao', 1),
('cino', 1),
('o', 38),
('tiao', 2),
('tut', 1),
('poh', 6),
('micheal', 1),
('orh', 1),
('horlick', 3),
('si', 1),
('tai', 1),
('titlo', 1),
('siew', 17),
('da', 1),
('halia', 2)]
heap = [(item[1],-index,item[0]) for index, item in enumerate(x)]
heapq.heapify(heap)
print(list(map(lambda item : item[2], heapq.nsmallest(5, heap))))
heapq.nsmallest(n, iterable, key=None)
has a key argument,you can use it inside of using -index
like me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With