Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python sort and sorted -- How are lists of lists sorted precisely?

What is the precise rule used in Python in order to sort lists where the elements are lists? Can this be expressed as a 'key' or 'cmp' function? The problems comes from the fact that there are two things to consider: length and values at their position.

sorted([
   [ 0, 1, 2, 3 ],  # 1st line: longer list
   [ 0, 1 ],        # 2nd line: shorter list
   [ 0, 2 ]         # 3rd line: suspected last
])

Is it safe to assume that the second line will sort before the first? Is it safe to assume that the third line will sort always last?

Note, this is not about stability! The specific case above behaves that way as described. But, can the rules there be considered to be general? What are the precise rules which python applies here?

Relying on the following definition Lexicographical Order (Thanks to Ashniwi):

To compare sequences of different lengths, the shorter sequence is usually padded at the end with enough "blanks" (a special symbol that is treated as smaller than every element of A). This way of comparing sequences of different lengths is always used in dictionaries. However, in combinatorics, another convention is frequently used, whereby a shorter sequence is always smaller than a longer sequence. This variant of the lexicographical order is sometimes called shortlex order.

Is Python using 'shortlex order'. Where is proof for that assumption, beyond practical examples?

like image 912
Frank-Rene Schäfer Avatar asked Oct 15 '25 16:10

Frank-Rene Schäfer


2 Answers

By default, sorted uses the __lt__ method of the items that are compared. And lists with comparable elements are compared lexicographically, according to the Python documentation. So yes, the language guarantees that in the shorter string will be sorted before the longer string.

like image 80
Ruud de Jong Avatar answered Oct 18 '25 06:10

Ruud de Jong


Quoting from docs:

In particular, tuples and lists are compared lexicographically by comparing corresponding elements. This means that to compare equal, every element must compare equal and the two sequences must be of the same type and have the same length.

Lexicographical comparison between built-in collections works as follows:

  • For two collections to compare equal, they must be of the same type, have the same length, and each pair of corresponding elements must compare equal (for example, [1,2] == (1,2) is false because the type is not the same).
  • Collections that support order comparison are ordered the same as their first unequal elements (for example, [1,2,x] <= [1,2,y] has the same value as x <= y). If a corresponding element does not exist, the shorter collection is ordered first (for example, [1,2] < [1,2,3] is true).

The basic comparison done for lists can be expressed using this function:

def cmp(list_1, list_2):
    length_1 = len(list_1)
    length_2 = len(list_2)
    min_length = min(length_1, length_2)

    # Compare individual items till there's a different item found
    for i in xrange(min_length):
        if list_1[i] > list_2[i]:
            return 1
        elif list_1[i] < list_2[i]:
            return -1

    # All items were same so far, let's compare sizes.
    if length_1 > length_2:
        return 1
    elif length_1 < length_2:
        return -1
    elif length_1 == length_2:
        return 0

Demo:

>>> lst = [[ 0, 1, 2, 3 ], [ 0, 1 ], [ 0, 2 ]]
>>> print sorted(lst) == sorted(lst, cmp=cmp)
True

Related CPython code for reference:

/* Search for the first index where items are different */
for (i = 0; i < Py_SIZE(vl) && i < Py_SIZE(wl); i++) {
    int k = PyObject_RichCompareBool(vl->ob_item[i],
                                     wl->ob_item[i], Py_EQ);
    if (k < 0)
        return NULL;
    if (!k)
        break;
}

if (i >= Py_SIZE(vl) || i >= Py_SIZE(wl)) {
    /* No more items to compare -- compare sizes */
    Py_ssize_t vs = Py_SIZE(vl);
    Py_ssize_t ws = Py_SIZE(wl);
    int cmp;
    PyObject *res;
    switch (op) {
    case Py_LT: cmp = vs <  ws; break;
    case Py_LE: cmp = vs <= ws; break;
    case Py_EQ: cmp = vs == ws; break;
    case Py_NE: cmp = vs != ws; break;
    case Py_GT: cmp = vs >  ws; break;
    case Py_GE: cmp = vs >= ws; break;
    default: return NULL; /* cannot happen */
    }
    if (cmp)
        res = Py_True;
    else
        res = Py_False;
    Py_INCREF(res);
    return res;
}
like image 37
Ashwini Chaudhary Avatar answered Oct 18 '25 08:10

Ashwini Chaudhary