Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using list.count to sort a list in-place using .sort() does not work. Why?

I am trying to sort a list by frequency of its elements.

>>> a = [5, 5, 4, 4, 4, 1, 2, 2]
>>> a.sort(key = a.count)
>>> a
[5, 5, 4, 4, 4, 1, 2, 2]

a is unchanged. However:

>>> sorted(a, key = a.count)
[1, 5, 5, 2, 2, 4, 4, 4]

Why does this method not work for .sort()?

like image 299
g999 Avatar asked Jun 20 '19 23:06

g999


2 Answers

What you see is the result of a certain CPython implementation detail of list.sort. Try this again, but create a copy of a first:

a.sort(key=a.copy().count)
a
# [1, 5, 5, 2, 2, 4, 4, 4]

.sort modifies a internally, so a.count is going to produce un-predictable results. This is documented as an implementation detail.

What copy call does is it creates a copy of a and uses that list's count method as the key. You can see what happens with some debug statements:

def count(x):
    print(a)
    return a.count(x)

a.sort(key=count)
[]
[]
[]
...

a turns up as an empty list when accessed inside .sort, and [].count(anything) will be 0. This explains why the output is the same as the input - the predicates are all the same (0).

OTOH, sorted creates a new list, so it doesn't have this problem.


If you really want to sort by frequency counts, the idiomatic method is to use a Counter:

from collections import Counter

a.sort(key=Counter(a).get)
a
# [1, 5, 5, 2, 2, 4, 4, 4]
like image 187
cs95 Avatar answered Oct 19 '22 11:10

cs95


It doesn't work with the list.sort method because CPython decides to "empty the list" temporarily (the other answer already presents this). This is mentioned in the documentation as implementation detail:

CPython implementation detail: While a list is being sorted, the effect of attempting to mutate, or even inspect, the list is undefined. The C implementation of Python makes the list appear empty for the duration, and raises ValueError if it can detect that the list has been mutated during a sort.

The source code contains a similar comment with a bit more explanation:

    /* The list is temporarily made empty, so that mutations performed
     * by comparison functions can't affect the slice of memory we're
     * sorting (allowing mutations during sorting is a core-dump
     * factory, since ob_item may change).
     */

The explanation isn't straight-forward but the problem is that the key-function and the comparisons could change the list instance during sorting which is very likely to result in undefined behavior of the C-code (which may crash the interpreter). To prevent that the list is emptied during the sorting, so that even if someone changes the instance it won't result in an interpreter crash.

This doesn't happen with sorted because sorted copies the list and simply sorts the copy. The copy is still emptied during the sorting but there's no way to access it, so it isn't visible.


However you really shouldn't sort like this to get a frequency sort. That's because for each item you call the key function once. And list.count iterates over each item, so you effectively iterate the whole list for each element (what is called O(n**2) complexity). A better way would be to calculate the frequency once for each element (can be done in O(n)) and then just access that in the key.

However since CPython has a Counter class that also supports most_common you could really just use that:

>>> from collections import Counter
>>> [item for item, count in reversed(Counter(a).most_common()) for _ in range(count)]
[1, 2, 2, 5, 5, 4, 4, 4]

This may change the order of the elements with equal counts but since you're doing a frequency count that shouldn't matter to much.

like image 9
MSeifert Avatar answered Oct 19 '22 13:10

MSeifert