Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I make this list function faster?

def removeDuplicatesFromList(seq): 
    # Not order preserving 
    keys = {}
    for e in seq:
        keys[e] = 1
    return keys.keys()

def countWordDistances(li):
    '''
    If li = ['that','sank','into','the','ocean']    
    This function would return: { that:1, sank:2, into:3, the:4, ocean:5 }
    However, if there is a duplicate term, take the average of their positions
    '''
    wordmap = {}
    unique_words = removeDuplicatesFromList(li)
    for w in unique_words:
        distances = [i+1 for i,x in enumerate(li) if x == w]
        wordmap[w] = float(sum(distances)) / float(len(distances)) #take average
    return wordmap

How do I make this function faster?

like image 383
user849364 Avatar asked Jul 18 '11 04:07

user849364


People also ask

How can list comprehension be faster?

List comprehension is basically just a "syntactic sugar" for the regular for loop. In this case the reason that it performs better is because it doesn't need to load the append attribute of the list and call it as a function at each iteration.

What is faster list or set?

Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables.

What is faster than a list in Python?

Tuples are immutable so, It doesn't require extra space to store new objects. Lists are allocated in two blocks: the fixed one with all the Python object information and a variable-sized block for the data. It is the reason creating a tuple is faster than List.


1 Answers

import collections
def countWordDistances(li):
    wordmap = collections.defaultdict(list)
    for i, w in enumerate(li, 1):
        wordmap[w].append(i)
    for k, v in wordmap.iteritems():
        wordmap[k] = sum(v)/float(len(v))

    return wordmap

This makes only one pass through the list, and keeps operations to a minimum. I timed this on a word list with 1.1M entries, 29k unique words, and it was almost twice as fast as Patrick's answer. On a list of 10k words, 2k unique, it was more than 300x faster than the OP's code.

To make Python code go faster, there are two rules to keep in mind: use the best algorithm, and avoid Python.

On the algorithm front, iterating the list once instead of N+1 times (N= number of unique words) is the main thing that will speed this up.

On the "avoid Python" front, I mean: you want your code to be executing in C as much as possible. So using defaultdict is better than a dict where you explicitly check if the key is present. defaultdict does that check for you, but does it in C, in the Python implementation. enumerate is better than for i in range(len(li)), again because it's fewer Python steps. And enumerate(li, 1) makes the counting start at 1 instead of having to have a Python +1 somewhere in the loop.

Edited: Third rule: use PyPy. My code goes twice as fast on PyPy as on 2.7.

like image 78
Ned Batchelder Avatar answered Sep 30 '22 13:09

Ned Batchelder