Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is using "not in" faster than using "in" in Python3?

Let's say we're solving a simple word count problem. There's a list and we're trying to find the word count of each word occurring in the list. Which pattern is faster here?

book_title =  ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
word_count_dict = {}

pattern 1

for word in book_title:
    if word in word_count_dict:
        word_count_dict[word] += 1
    else:
        word_count_dict[word] = 1

pattern 2

for word in book_title:
    if word not in word_counter:
        word_counter[word] = 1
    else:
        word_counter[word] += 1
like image 947
IronMan Avatar asked Dec 31 '22 09:12

IronMan


2 Answers

Six of one, half a dozen of the other. They should be roughly equivalent to each other - in computational terms, a not operation is nearly negligible (literally the cheapest possible operation), and the in operation in a hashtable like a dictionary runs in constant time (either the hash is there, or it's not). If we were dealing with a list, it would run in linear time, but still the between in and not in. See also computational complexity of python data structures.

So basically, use whichever makes your code more understandable.


That said, have you considered using a collections.Counter, a data structure specifically designed for this purpose?

import collections
book_title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
word_counts = collections.Counter(book_title)
print(word_counts)
# Counter({'great': 2, 'the': 2, 'adventures': 2, 'of': 2, 'expectations': 1, 'sherlock': 1, 'holmes': 1, 'gasby': 1, 'hamlet': 1, 'huckleberry': 1, 'fin': 1})

You can typecast a collections.Counter to a dict if you need to, and in fact collections.Counter is a subclass of dict. It even has an .update() method specifically designed to work with other counters - if you add another book title, just feed it into a Counter and then .update() the original with that.

like image 118
Green Cloak Guy Avatar answered Jan 18 '23 13:01

Green Cloak Guy


Using dis, you can look at the bytecode generated for each of your methods.

The only difference I could see running your code through the disassembler was right at in and not in, where the difference in bytecode was:

COMPARE_OP 7 (not in) or COMPARE_OP 6 (in)

And afterwards POP_JUMP_IF_FALSE (i.e. continue onto the next instruction for this condition)

All in all, both of the methods seems to have the same amount of instructions, regardless of the comparison returning true or false, and therefore most probably executes equally fast.


There might be, however, some underlying optimizations closer to CPU instructions which would cause one or the other methods to be faster, but I would consider that territory out of scope for this question. If this would be the case, then I believe a simple measure of execution time over a larger list would prove which one is faster.

Execution speed of both instructions, beneath Python bytecode, might differ between Python versions, build, OS or architecture. You'd probably be able to make a small change in the Python source code to make one or the other instruction execute faster.

like image 34
Christopher Janzon Avatar answered Jan 18 '23 14:01

Christopher Janzon