I have few python scripts where I am storing 5-10 million string key value pairs in a dictionary and I query this dictionary around 5-10 million times. I noticed that python dict is not performing very well. Is there any other implementation best suited for string keys.
Edit:
I am having two large lists of person names and I want to match them, so I take one of them as the reference list and try applying different heuristics on each name in second list to figure out if that exists in the first list. So, I have to query first list 2-3 times for every name in second list. Hope, this makes sense.
Wow. A hashmap (dictionary) might not be the structure you're looking for.
Instead of using strings, try a representation that can give you good and fast hashing. Or are you really storing strings? If so, scracth the "might" in the previous sentence.
Could you give us details about the problem you're tackling?
Questions: Is this a scaling issue ? Are you finding that the code runs more than twice as slow when you have twice as much data? Is it possible that you are running out of physical memory and using swap memory ?
10 million strings of 100 characters each is a gigabyte. If you have 2 sets of those, that would be 2 gigabytes, which is close to the limit of a 32 bit WinXP process.
If you don't already know the answer to this question, I would recommend running a test with the database at various sizes ( powers of 10 or 2 ) and see if the performance curve has a discontinuity.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With