Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to keep a list of lists sorted as it is created

I'm reading in a file and pulling in data that includes some strings and some numbers, in Python. I'm storing this information as lists of lists, like this:

dataList = [

['blah', 2, 3, 4],

['blahs', 6, 7, 8],

['blaher', 10, 11, 12],

]

I want to keep dataList sorted by the second element of the sub list: dataList[][1]

I thought I could use insort or bisect right when I want to add them in, but I cannot figure out how to make it look at the second element of the sub list.

Any thoughts here? I was just appending data to the end and then doing a linear sort to find things back later on. But, throw a few 10's of thousands of sub-lists in here and then search for 100k items and it takes a while.

like image 724
ErikS Avatar asked Sep 07 '12 19:09

ErikS


People also ask

Do lists in Python preserve order?

You will use these extensively in your Python programming. One of the chief characteristics of a list is that it is ordered. The order of the elements in a list is an intrinsic property of that list and does not change, unless the list itself is modified.

Does sort change original list?

The original list is not changed. It's most common to pass a list into the sorted() function, but in fact it can take as input any sort of iterable collection.


1 Answers

Depends on a few things, but the first thing that comes to mind is using the heapq module:

import heapq
heap = []
for row in rows:
    heapq.heappush(heap, (row[1], row))

This would create a heap full of tuples, where the first element is the element you want to sort by, and the second element is the row.

The simplest method to read them back from the heap would be to copy it then pop items:

new_heap = list(heap)
while new_heap:
    _, row = heapq.heappop(new_heap)
    print row

The runtime of inserting each item into the heap is O(lg N), so creating the heap will require O(N lg N) time, and popping items from the heap also requires O(lg N) time, so O(N lg N) time will be required to traverse it.

If these tradeoffs are not ideal, you could use a binary search tree (none exist in the standard library, but they are easy to find), or as other commenters have suggested, sort the rows after reading them: rows.sort(key=lambda row: row[1]).

Now, in practice, unless you're dealing with a very large number of rows, it will almost certainly be faster to sort the list in-place after loading it (ie, using the .sort() method)… So try a few things out and see what works best.

Finally, bisect is a poor idea, because inserting into Python lists requires O(N) time, so inserting items with bisect would require O(N lg N) time per item, so a total time of O((N lg N) * N) = O(N**2) time.

like image 105
David Wolever Avatar answered Sep 30 '22 07:09

David Wolever