Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to sort 1M records in Python

Tags:

python

I have a service that runs that takes a list of about 1,000,000 dictionaries and does the following

myHashTable = {}
myLists = { 'hits':{}, 'misses':{}, 'total':{} }
sorted = { 'hits':[], 'misses':[], 'total':[] }
for item in myList:
  id = item.pop('id')
  myHashTable[id] = item
  for k, v in item.iteritems():
    myLists[k][id] = v

So, if I had the following list of dictionaries:

[ {'id':'id1', 'hits':200, 'misses':300, 'total':400},
  {'id':'id2', 'hits':300, 'misses':100, 'total':500},
  {'id':'id3', 'hits':100, 'misses':400, 'total':600}
]

I end up with

myHashTable =
{ 
  'id1': {'hits':200, 'misses':300, 'total':400},
  'id2': {'hits':300, 'misses':100, 'total':500},
  'id3': {'hits':100, 'misses':400, 'total':600}
}

and

myLists = 

    {
      'hits': {'id1':200, 'id2':300, 'id3':100},
      'misses': {'id1':300, 'id2':100, 'id3':400},
      'total': {'id1':400, 'id2':500, 'id3':600}
    }

I then need to sort all of the data in each of the myLists dictionaries.

What I doing currently is something like the following:

def doSort(key):
  sorted[key] = sorted(myLists[key].items(), key=operator.itemgetter(1), reverse=True)

which would yield, in the case of misses:
[('id3', 400), ('id1', 300), ('id2', 200)] 

This works great when I have up to 100,000 records or so, but with 1,000,000 it is taking at least 5 - 10 minutes to sort each with a total of 16 (my original list of dictionaries actually has 17 fields including id which is popped)

* EDIT * This service is a ThreadingTCPServer which has a method allowing a client to connect and add new data. The new data may include new records (meaning dictionaries with unique 'id's to what is already in memory) or modified records (meaning the same 'id' with different data for the other key value pairs

So, once this is running I would pass in

[
  {'id':'id1', 'hits':205, 'misses':305, 'total':480},
  {'id':'id4', 'hits':30, 'misses':40, 'total':60},
  {'id':'id5', 'hits':50, 'misses':90, 'total':20
]

I have been using dictionaries to store the data so that I don't end up with duplicates. After the dictionaries are updated with the new/modified data I resort each of them.

* END EDIT *

So, what is the best way for me to sort these? Is there a better method?

like image 730
sberry Avatar asked Jul 24 '09 21:07

sberry


People also ask

How do you sort an array with 1 million records efficiently?

Integer Sorting with limited range is achieved efficiently with Bucket Sorting. Bucket sort, counting sort, radix sort, and van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithm.

Which sorting technique is best in Python?

The Merge Sort Algorithm in Python. Merge sort is a very efficient sorting algorithm. It's based on the divide-and-conquer approach, a powerful algorithmic technique used to solve complex problems.

How do you sort by smallest to largest in Python?

To sort list items in descending order, you need to use the optional reverse parameter with the sort() method, and set its value to True . Now all the numbers are arranged in reverse, with the largest value being on the left hand side and the smallest on the right.


2 Answers

You may find this related answer from Guido: Sorting a million 32-bit integers in 2MB of RAM using Python

like image 105
OscarRyz Avatar answered Sep 24 '22 08:09

OscarRyz


What you really want is an ordered container, instead of an unordered one. That would implicitly sort the results as they're inserted. The standard data structure for this is a tree.

However, there doesn't seem to be one of these in Python. I can't explain that; this is a core, fundamental data type in any language. Python's dict and set are both unordered containers, which map to the basic data structure of a hash table. It should definitely have an optimized tree data structure; there are many things you can do with them that are impossible with a hash table, and they're quite tricky to implement well, so people generally don't want to be doing it themselves.

(There's also nothing mapping to a linked list, which also should be a core data type. No, a deque is not equivalent.)

I don't have an existing ordered container implementation to point you to (and it should probably be implemented natively, not in Python), but hopefully this will point you in the right direction.

A good tree implementation should support iterating across a range by value ("iterate all values from [2,100] in order"), find next/prev value from any other node in O(1), efficient range extraction ("delete all values in [2,100] and return them in a new tree"), etc. If anyone has a well-optimized data structure like this for Python, I'd love to know about it. (Not all operations fit nicely in Python's data model; for example, to get next/prev value from another value, you need a reference to a node, not the value itself.)

like image 31
Glenn Maynard Avatar answered Sep 24 '22 08:09

Glenn Maynard