def removeDuplicatesFromList(seq):
# Not order preserving
keys = {}
for e in seq:
keys[e] = 1
return keys.keys()
def countWordDistances(li):
'''
If li = ['that','sank','into','the','ocean']
This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
However, if there is a duplicate term, take the average of their positions
'''
wordmap = {}
unique_words = removeDuplicatesFromList(li)
for w in unique_words:
distances = [i+1 for i,x in enumerate(li) if x == w]
wordmap[w] = float(sum(distances)) / float(len(distances)) #take average
return wordmap
How do I make this function faster?
List comprehension is basically just a "syntactic sugar" for the regular for loop. In this case the reason that it performs better is because it doesn't need to load the append attribute of the list and call it as a function at each iteration.
Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables.
Tuples are immutable so, It doesn't require extra space to store new objects. Lists are allocated in two blocks: the fixed one with all the Python object information and a variable-sized block for the data. It is the reason creating a tuple is faster than List.
import collections
def countWordDistances(li):
wordmap = collections.defaultdict(list)
for i, w in enumerate(li, 1):
wordmap[w].append(i)
for k, v in wordmap.iteritems():
wordmap[k] = sum(v)/float(len(v))
return wordmap
This makes only one pass through the list, and keeps operations to a minimum. I timed this on a word list with 1.1M entries, 29k unique words, and it was almost twice as fast as Patrick's answer. On a list of 10k words, 2k unique, it was more than 300x faster than the OP's code.
To make Python code go faster, there are two rules to keep in mind: use the best algorithm, and avoid Python.
On the algorithm front, iterating the list once instead of N+1 times (N= number of unique words) is the main thing that will speed this up.
On the "avoid Python" front, I mean: you want your code to be executing in C as much as possible. So using defaultdict
is better than a dict where you explicitly check if the key is present. defaultdict
does that check for you, but does it in C, in the Python implementation. enumerate
is better than for i in range(len(li))
, again because it's fewer Python steps. And enumerate(li, 1)
makes the counting start at 1 instead of having to have a Python +1 somewhere in the loop.
Edited: Third rule: use PyPy. My code goes twice as fast on PyPy as on 2.7.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With