Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to sorting a corpus dictionary into an OrderedDict - python

Given a corpus/texts as such:

Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You have requested a debate on this subject in the course of the next few days , during this part @-@ session .
In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .

I could simply do this to get a dictionary with word frequencies:

>>> word_freq = Counter()
>>> for line in text.split('\n'):
...     for word in line.split():
...             word_freq[word]+=1
... 

But if the aim is to achieve an ordered dictionary from highest to lowest frequency, I will have to do this:

>>> from collections import OrderedDict
>>> sorted_word_freq = OrderedDict()
>>> for word, freq in word_freq.most_common():
...     sorted_word_freq[word] = freq
... 

Imagine that I have 1 billion keys in the Counter object, iterating through the most_common() would have a complexity of going through a corpus (non-unique instances) once and the vocabulary (unique key).

Note: The Counter.most_common() would call an ad-hoc sorted(), see https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472

Given this, I have seen the following code that uses numpy.argsort():

>>> import numpy as np
>>> words = word_freq.keys()
>>> freqs = word_freq.values()
>>> sorted_word_index = np.argsort(freqs) # lowest to highest
>>> sorted_word_freq_with_numpy = OrderedDict()
>>> for idx in reversed(sorted_word_index):
...     sorted_word_freq_with_numpy[words[idx]] = freqs[idx]
... 

Which is faster?

Is there any other faster way to get such an OrderedDict from a Counter?

Other than OrderedDict, is there other python objects that achieves the same sorted key-value pair?

Assume that memory is not an issue. Given 120 GB of RAM, there shouldn't be much issue to keep 1 billion key-value pairs right? Assume an average of 20 chars per key for 1 billion keys and a single integer for each value.

like image 312
alvas Avatar asked Aug 02 '15 10:08

alvas


2 Answers

The Series object in Pandas is an array of key-value pairs (which can have non-unique keys) which may be of interest. It has a sort method which sorts by the values and is implemented in Cython. Here's an example sorting an array of length one million:

In [39]:
import pandas as pd
import numpy as np

arr = np.arange(1e6)
np.random.shuffle(arr)
s = pd.Series(arr, index=np.arange(1e6))
%timeit s.sort()
%timeit sorted(arr)

1 loops, best of 3: 85.8 ms per loop
1 loops, best of 3: 1.15 s per loop

Given a normal Python dict you can construct a Series by calling:

my_series = pd.Series(my_dict)

Then sort by value by

my_series.sort()
like image 103
JoeCondron Avatar answered Sep 23 '22 10:09

JoeCondron


One step toward improving the speed is to fill the counter in the optimal way.

For example, with your txt (802 char).

mycounter=Counter(txt.split())

produces the same thing as your word_counter, but in 1/3 the time.

Or if you have to read the text line by line from a file, then use:

word_freq=Counter()
for line in txt.splitlines():
    word_freq.update(line.split())

Similarly the ordered dictionary can be created without the loop:

mydict = OrderedDict(sorted(mycounter.items(), key=operator.itemgetter(1), reverse=True))

Here I am calling sorted in the same way that most_common does (as per your link). And I'm passing the list of sorted items directly to the OrderedDict creator.

When I look at mycounter in ipython, I get the values in sorted order:

In [160]: mycounter
Out[160]: Counter({'the': 13, ',': 10, 'of': 9, 'a': 7, '.': 4, 'in': 4, 'to': 3, 'have': 3, 'session': 3, ''': 3, 'on': 3, 'you': 3, 'I': 3, 'that': 2, 'requested': 2, 'like': 2, 'European': 2, 'this': 2, 'countries': 2, 'as': 2, 'number': 2, 's': 1, 'various': 1, 'wish': 1, 'will': 1, 'Parliament': 1, 'meantime': 1, 'Resumption': 1, 'natural': 1, 'days': 1, 'debate': 1, 'You': 1, 'Members': 1, 'next': 1, '@-@': 1, 'hope': 1, 'enjoyed': 1, 'December': 1, 'victims': 1, 'particularly': 1, 'millennium': 1, .... 'behalf': 1, 'were': 1, 'failed': 1})

That's because its __repr__ method calls most_common. Again this is from your link.

items = ', '.join(map('%r: %r'.__mod__, self.most_common()))

On further testing I see that calling sorted directly doesn't save time:

In [166]: timeit mycounter.most_common()
10000 loops, best of 3: 31.1 µs per loop

In [167]: timeit sorted(mycounter.items(),key=operator.itemgetter(1),reverse=True)
10000 loops, best of 3: 30.5 µs per loop

In [168]: timeit OrderedDict(mycounter.most_common())
1000 loops, best of 3: 225 µs per loop

In this case, loading the dictionary directly doesn't save time either. Your iteration does just as well:

In [174]: %%timeit 
   .....: sorteddict=OrderedDict()
   .....: for word,freq in word_freq.most_common():
    sorteddict[word]=freq
   .....: 
1000 loops, best of 3: 224 µs per loop

For this sample, using np.argsort does not help (timewise). Just calling argsort is slower than most_common.

In [178]: timeit np.argsort(list(mycounter.values()))
10000 loops, best of 3: 34.2 µs per loop

Most of that time is in converting the list to an array, x=np.array(list(mycounter.values())). np.argsort(x) is much faster. That's true of a lot of numpy functionality. When operating on arrays numpy is fast. But there's a lot of overhead when converting lists to arrays.

I can create the OrderedDict via numpy in one line with:

OrderedDict(np.sort(np.array(list(mycounter.items()), dtype='a12,i'), order='f1')[::-1])

or in pieces:

lla = np.array(list(mycounter.items()),dtype='a12,i')
lla.sort(order='f1')
OrderedDict(lla[::-1])

I'm making a structured array from the items(), sorting that by the 2nd field, and then making the dictionary. No time savings though. See https://stackoverflow.com/a/31837513/901925 for another recent example of using order to sort a structured array.

like image 31
hpaulj Avatar answered Sep 26 '22 10:09

hpaulj