Python: remove lots of items from a list

Question

I am in the final stretch of a project I have been working on. Everything is running smoothly but I have a bottleneck that I am having trouble working around.

I have a list of tuples. The list ranges in length from say 40,000 - 1,000,000 records. Now I have a dictionary where each and every (value, key) is a tuple in the list.

So, I might have

myList = [(20000, 11), (16000, 4), (14000, 9)...]
myDict = {11:20000, 9:14000, ...}

I want to remove each (v, k) tuple from the list.

Currently I am doing:

for k, v in myDict.iteritems():
    myList.remove((v, k))

Removing 838 tuples from the list containing 20,000 tuples takes anywhere from 3 - 4 seconds. I will most likely be removing more like 10,000 tuples from a list of 1,000,000 so I need this to be faster.

Is there a better way to do this?

I can provide code used to test, plus pickled data from the actual application if needed.

balpha · Accepted Answer

You'll have to measure, but I can imagine this to be more performant:

myList = filter(lambda x: myDict.get(x[1], None) != x[0], myList)

because the lookup happens in the dict, which is more suited for this kind of thing. Note, though, that this will create a new list before removing the old one; so there's a memory tradeoff. If that's an issue, rethinking your container type as jkp suggest might be in order.

Edit: Be careful, though, if None is actually in your list -- you'd have to use a different "placeholder."

Alex Martelli · Answer

To remove about 10,000 tuples from a list of about 1,000,000, if the values are hashable, the fastest approach should be:

totoss = set((v,k) for (k,v) in myDict.iteritems())
myList[:] = [x for x in myList if x not in totoss]

The preparation of the set is a small one-time cost, wich saves doing tuple unpacking and repacking, or tuple indexing, a lot of times. Assignign to myList[:] instead of assigning to myList is also semantically important (in case there are any other references to myList around, it's not enough to rebind just the name -- you really want to rebind the contents!-).

I don't have your test-data around to do the time measurement myself, alas!, but, let me know how it plays our on your test data!

If the values are not hashable (e.g. they're sub-lists, for example), fastest is probably:

sentinel = object()
myList[:] = [x for x in myList if myDict.get(x[0], sentinel) != x[1]]

or maybe (shouldn't make a big difference either way, but I suspect the previous one is better -- indexing is cheaper than unpacking and repacking):

sentinel = object()
myList[:] = [(a,b) for (a,b) in myList if myDict.get(a, sentinel) != b]

In these two variants the sentinel idiom is used to ward against values of None (which is not a problem for the preferred set-based approach -- if values are hashable!) as it's going to be way cheaper than if a not in myDict or myDict[a] != b (which requires two indexings into myDict).

Mark Rushakoff · Answer

Every time you call myList.remove, Python has to scan over the entire list to search for that item and remove it. In the worst case scenario, every item you look for would be at the end of the list each time.

Have you tried doing the "inverse" operation of:

newMyList = [(v,k) for (v,k) in myList if not k in myDict]

But I'm really not sure how well that would scale, either, since you would be making a copy of the original list -- could potentially be a lot of memory usage there.

Probably the best alternative here is to wait for Alex Martelli to post some mind-blowingly intuitive, simple, and efficient approach.

SilentGhost · Answer

[(i, j) for i, j in myList if myDict.get(j) != i]

Python: remove lots of items from a list

Tags:

python

sberry

4 Answers

balpha

Alex Martelli

Mark Rushakoff

SilentGhost

Recent Activity

Donate For Us

Python: remove lots of items from a list

Tags:

python

sberry

4 Answers

balpha

Alex Martelli

Mark Rushakoff

SilentGhost

Related questions

Recent Activity

Donate For Us