Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting the keys of items with the least counts from a list of tuples of key-value pairs - Python

The input is an unsorted list of tuples:

x = [('herr', 1),
     ('dapao', 1),
     ('cino', 1),
     ('o', 38),
     ('tiao', 2),
     ('tut', 1),
     ('poh', 6),
     ('micheal', 1),
     ('orh', 1),
     ('horlick', 3),
     ('si', 1),
     ('tai', 1),
     ('titlo', 1),
     ('siew', 17),
     ('da', 1),
     ('halia', 2)]

The goal is to find the last n keys with the least counts, i.e. desired output:

['orh', 'si', 'tai', 'titlo', 'da']

I've tried doing this by:

  • first convert the list of tuples to a dict
  • cast the dict into a Counter
  • then find the [-n:] list of tuples from the Counter.most_common()
  • cast the list of tuples from the [-n:] to a dict
  • get the keys and then convert it into a list

i.e.

n = 5
list(dict(Counter(dict(x)).most_common()[-n:]).keys())

Is there a less convoluted way to get the same output?


I could also do this:

from operator import itemgetter
output, *_ = zip(*sorted(x, key=itemgetter(1))[n:])
list(output)

But now I've merely swapped out the Counter.most_common with sorted and itemgetter. Then I would still need to zip(*list) to extract the keys through unpacking the first value from each list of tuples after the zip.

There must be a simpler way.


NOTE

Note that the question is not asking to sort, it's to extract the list first element in the original list of tuples given. And the criterion to extract is based on the last nth items with the lowest value in the 2nd element.

The answers from the possible duplicate linked still requires the step to unpack the list of sorted tuples and and the extract the top nth of the list of first elements.

like image 291
alvas Avatar asked Feb 12 '18 11:02

alvas


2 Answers

The goal is to find the last n keys with the least counts

Given this goal's definition neither of your two solutions fit. In one with Counter you use dict which will make the order of keys undefined and you will not get the last keys, but some n keys with least value. The second solution has incorrect slicing, and if it's fixed it returns first n keys with least value.

Taking into account that sorted implementation is stable it can be rewritten like this to fit the goal:

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

But it's a better idea to use heapq, which is the stdlib tool for questions like "n smallest/greatest values from an iterable" (as Martijn Pieters pointed out, nlargest and nsmallest are also stable and the docs really say that, but in implicit way). Especially if the real list you have to deal with is big (for small n it should be faster that sorted as docs describe).

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

You can improve performance even further, but at the cost of order (sorting stability), i.e. some n keys with least value instead of last n keys with least value. To do that you need to keep a "heapified" list and use it as your internal data structure instead of plain list (if you don't change the list and need the bottom-n only once, it will not give a performance benefit). You can push and pop from the list, for example:

_p2_heap = None

def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)]

Here's the complete module you can use to benchmark the solutions.

import heapq
from collections import Counter  

l = [
    ('herr', 1), ('dapao', 1),
    ('cino', 1), ('o', 38),
    ('tiao', 2), ('tut', 1),
    ('poh', 6), ('micheal', 1),
    ('orh', 1), ('horlick', 3),
    ('si', 1), ('tai', 1),
    ('titlo', 1), ('siew', 17),
    ('da', 1), ('halia', 2)
]
n = 5    

def author_1():
    return list(dict(Counter(dict(l)).most_common()[-n:]).keys())

def author_2():
    output, *_ = zip(*sorted(reversed(l), key=lambda v: v[1])[:n])
    return list(reversed(output))

def prop_1():
    rev_result = heapq.nsmallest(n, reversed(l), key=lambda v: v[1])
    return [item[0] for item in rev_result][::-1]

_p2_heap = None    
def prop_2():
    global _p2_heap
    if not _p2_heap:
        _p2_heap = []
        for item in l:
            heapq.heappush(_p2_heap, item[::-1])

    return [item[1] for item in heapq.nsmallest(n, _p2_heap)][::-1]

And here are the timeit results:

$ python -m timeit -s "import tst" "tst.author_1()"
100000 loops, best of 3: 7.72 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100000 loops, best of 3: 3.7 usec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
100000 loops, best of 3: 5.51 usec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
100000 loops, best of 3: 3.96 usec per loop

But if we make l = l * 1000 the difference will become noticeable:

$ python -m timeit -s "import tst" "tst.author_1()"
1000 loops, best of 3: 263 usec per loop
$ python -m timeit -s "import tst" "tst.author_2()"
100 loops, best of 3: 2.72 msec per loop
$ python -m timeit -s "import tst" "tst.prop_1()"
1000 loops, best of 3: 1.65 msec per loop
$ python -m timeit -s "import tst" "tst.prop_2()"
1000 loops, best of 3: 767 usec per loop
like image 99
saaj Avatar answered Nov 14 '22 14:11

saaj


Just use a heap, it will give you the desired output.

import heapq

x = [('herr', 1),
('dapao', 1),
('cino', 1),
('o', 38),
('tiao', 2),
('tut', 1),
('poh', 6),
('micheal', 1),
('orh', 1),
('horlick', 3),
('si', 1),
('tai', 1),
('titlo', 1),
('siew', 17),
('da', 1),
('halia', 2)]

heap = [(item[1],-index,item[0]) for index, item in enumerate(x)]
heapq.heapify(heap)

print(list(map(lambda item : item[2], heapq.nsmallest(5, heap))))

heapq.nsmallest(n, iterable, key=None)has a key argument,you can use it inside of using -index like me.

like image 20
obgnaw Avatar answered Nov 14 '22 15:11

obgnaw