Let's say we're solving a simple word count problem. There's a list and we're trying to find the word count of each word occurring in the list. Which pattern is faster here?
book_title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
word_count_dict = {}
for word in book_title:
if word in word_count_dict:
word_count_dict[word] += 1
else:
word_count_dict[word] = 1
for word in book_title:
if word not in word_counter:
word_counter[word] = 1
else:
word_counter[word] += 1
Six of one, half a dozen of the other. They should be roughly equivalent to each other - in computational terms, a not
operation is nearly negligible (literally the cheapest possible operation), and the in
operation in a hashtable like a dictionary runs in constant time (either the hash is there, or it's not). If we were dealing with a list, it would run in linear time, but still the between in
and not in
. See also computational complexity of python data structures.
So basically, use whichever makes your code more understandable.
That said, have you considered using a collections.Counter
, a data structure specifically designed for this purpose?
import collections
book_title = ['great', 'expectations','the', 'adventures', 'of', 'sherlock','holmes','the','great','gasby','hamlet','adventures','of','huckleberry','fin']
word_counts = collections.Counter(book_title)
print(word_counts)
# Counter({'great': 2, 'the': 2, 'adventures': 2, 'of': 2, 'expectations': 1, 'sherlock': 1, 'holmes': 1, 'gasby': 1, 'hamlet': 1, 'huckleberry': 1, 'fin': 1})
You can typecast a collections.Counter
to a dict
if you need to, and in fact collections.Counter
is a subclass of dict
. It even has an .update()
method specifically designed to work with other counters - if you add another book title, just feed it into a Counter
and then .update()
the original with that.
Using dis
, you can look at the bytecode generated for each of your methods.
The only difference I could see running your code through the disassembler was right at in
and not in
, where the difference in bytecode was:
COMPARE_OP 7 (not in)
or
COMPARE_OP 6 (in)
And afterwards POP_JUMP_IF_FALSE
(i.e. continue onto the next instruction for this condition)
All in all, both of the methods seems to have the same amount of instructions, regardless of the comparison returning true or false, and therefore most probably executes equally fast.
There might be, however, some underlying optimizations closer to CPU instructions which would cause one or the other methods to be faster, but I would consider that territory out of scope for this question. If this would be the case, then I believe a simple measure of execution time over a larger list would prove which one is faster.
Execution speed of both instructions, beneath Python bytecode, might differ between Python versions, build, OS or architecture. You'd probably be able to make a small change in the Python source code to make one or the other instruction execute faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With