Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: remove lots of items from a list

Tags:

python

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.

I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.

So, I might have

myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}

I want to remove each (v, k) tuple from the list.

Currently I am doing:

for k, v in myDict.iteritems():
    myList.remove((v, k))

Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.

Is there a better way to do this?

I can provide code used to test, plus pickled data from the actual application if needed.

like image 887
sberry Avatar asked Aug 12 '09 16:08

sberry


4 Answers

You'll have to measure, but I can imagine this to be more performant:

myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)

because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.

Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."

like image 158
balpha Avatar answered Oct 14 '22 09:10

balpha


To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:

totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]

The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).

I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!

If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:

sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]

or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):

sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]

In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).

like image 23
Alex Martelli Avatar answered Oct 14 '22 10:10

Alex Martelli


Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.

Have you tried doing the "inverse" operation of:

newMyList = [(v,k) for (v,k) in myList if not k in myDict]

But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.

Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.

like image 23
Mark Rushakoff Avatar answered Oct 14 '22 08:10

Mark Rushakoff


[(i, j) for i, j in myList if myDict.get(j) != i]
like image 32
SilentGhost Avatar answered Oct 14 '22 08:10

SilentGhost