Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter items that only occurs once in a very large list

I have a large list(over 1,000,000 items), which contains english words:

tokens = ["today", "good", "computer", "people", "good", ... ]

I'd like to get all the items that occurs only once in the list

now I'm using:

tokens_once = set(word for word in set(tokens) if tokens.count(word) == 1)

but it's really slow. how could I make this faster?

like image 858
wong2 Avatar asked May 06 '12 07:05

wong2


1 Answers

You iterate over a list and then for each element you do it again, which makes it O(N²). If you replace your count by a Counter, you iterate once over the list and then once again over the list of unique elements, which makes it, in the worst case, O(2N), i.e. O(N).

from collections import Counter

tokens = ["today", "good", "computer", "people", "good"]
single_tokens = [k for k, v in Counter(tokens).iteritems() if v == 1 ]
# single_tokens == ['today', 'computer', 'people']
like image 146
eumiro Avatar answered Nov 09 '22 23:11

eumiro