I had a quick question about efficiency of searching through large dictionaries in Python. I am reading a large comma-separated file and getting a key and value from each line. If my key is already in the dictionary, I'm adding the value to the value listed in the dictionary, if the key doesn't exist in the dictionary, I simply add the value. Previously I was using this:
if key in data_dict.keys(): add values else: data_dict[key] = value
This starts pretty fast, but as the dictionary grows it becomes slower and slower, to the point where I can't use it at all. I changed the way I search for the key in the dictionary to this:
try: # This will fail if key not present data_dict[keyStr] = input_data[keyStr] + load_val except: data_dict[keyStr] = load_val
This is infinitely faster, and can read / write over 350,000 lines of code in 3 seconds.
My question was why does the if key in data_dict.keys():
command take sooo much longer than the call to try: data_dict[keyStr]
? And why wouldn't Python utilize the try
statement when searching for a key in a dictionary?
The two times above for 100 and 10000000 are almost the same for a dictionary, which is because a dictionary can almost instantly jump to the key it is asked for thanks to the lookups. Lookups are faster in dictionaries because Python implements them using hash tables.
The list is an ordered collection of data, whereas the dictionaries store the data in the form of key-value pairs using the hashtable structure. Due to this, fetching the elements from the list data structure is quite complex compared to dictionaries in Python. Therefore, the dictionary is faster than a list in Python.
The reason is because a dictionary is a lookup, while a list is an iteration. Dictionary uses a hash lookup, while your list requires walking through the list until it finds the result from beginning to the result each time.
Speed. Lookups in lists are O(n), lookups in dictionaries are amortized O(1), with regard to the number of items in the data structure. If you don't need to associate values, use sets.
The problem is that for every test you're generating a new list of keys with .keys()
. As the list of keys gets longer, the time required goes up. Also as noted by dckrooney, the search for the key becomes linear instead of taking advantage of the hash-table structure of the dictionary.
Replace with:
if key in data_dict:
data_dict.keys()
returns an unsorted list of keys in the dictionary. Thus each time you check to see if a given key is in the dictionary, you're doing a linear search across the list of keys (an O(n) operation). The longer your list, the longer it takes to search for a given key.
Contrast this to data_dict[keyStr]
. This performs a hash lookup, which is an O(1) operation. It doesn't (directly) depend on the number of keys in the dictionary; even as you add more keys, the time to check if a given key is in the dictionary stays constant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With