Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I partially sort a Python list?

Tags:

python

sorting

I wrote a compiler cache for MSVC (much like ccache for gcc). One of the things I have to do is to remove the oldest object files in my cache directory to trim the cache to a user-defined size.

Right now, I basically have a list of tuples, each of which is the last access time and the file size:

# First tuple element is the access time, second tuple element is file size
items = [ (1, 42341),
          (3, 22),
          (0, 3234),
          (2, 42342),
          (4, 123) ]

Now I'd like to do a partial sort on this list so that the first N elements are sorted (where N is the number of elements so that the sum of their sizes exceeds 45000). The result should be basically this:

# Partially sorted list; only first two elements are sorted because the sum of
# their second field is larger than 45000.
items = [ (0, 3234),
          (1, 42341),
          (3, 22),
          (2, 42342),
          (4, 123) ]

I don't really care about the order of the unsorted entries, I just need the N oldest items in the list whose cumulative size exceeds a certain value.

like image 561
Frerich Raabe Avatar asked Dec 29 '10 16:12

Frerich Raabe


2 Answers

You could use the heapq module. Call heapify() on the list, followed by heappop() until your condition is met. heapify() is linear and heappop() logarithmic, so it's likely as fast as you can get.

heapq.heapify(items)
size = 0
while items and size < 45000:
  item = heapq.heappop(items)
  size += item[1]
  print item

Output:

(0, 3234)
(1, 42341)
like image 140
moinudin Avatar answered Oct 13 '22 04:10

moinudin


I don't know of anything canned, but you could do this with a variant of any sort which incrementally builds the sorted list from one end to the other, but which simply stops when enough elements have been sorted. Quicksort would be the obvious choice. Selection sort would do, but it's a terrible sort. Heapsort, as Marco suggests, would also do it, taking the heapify of the whole array as a sunk cost. Mergesort couldn't be used this way.

To look at quicksort specifically, you would simply need to track a high water mark of how far into the array has been sorted so far, and the total file size of those elements. At the end of each sub-sort, you update those numbers by adding in the newly-sorted elements. Abandon the sort when it passes the target.

You might also find performance was improved by changing the partition-selection step. You might prefer lopsided partitioning elements if you only expect to sort a small fraction of the array.

like image 44
Tom Anderson Avatar answered Oct 13 '22 05:10

Tom Anderson