Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Best Dictionary implementation [closed]

I have few python scripts where I am storing 5-10 million string key value pairs in a dictionary and I query this dictionary around 5-10 million times. I noticed that python dict is not performing very well. Is there any other implementation best suited for string keys.

Edit:

I am having two large lists of person names and I want to match them, so I take one of them as the reference list and try applying different heuristics on each name in second list to figure out if that exists in the first list. So, I have to query first list 2-3 times for every name in second list. Hope, this makes sense.

like image 623
Boolean Avatar asked Nov 06 '22 00:11

Boolean


2 Answers

Wow. A hashmap (dictionary) might not be the structure you're looking for.

Instead of using strings, try a representation that can give you good and fast hashing. Or are you really storing strings? If so, scracth the "might" in the previous sentence.

Could you give us details about the problem you're tackling?

like image 67
slezica Avatar answered Nov 09 '22 16:11

slezica


Questions: Is this a scaling issue ? Are you finding that the code runs more than twice as slow when you have twice as much data? Is it possible that you are running out of physical memory and using swap memory ?

10 million strings of 100 characters each is a gigabyte. If you have 2 sets of those, that would be 2 gigabytes, which is close to the limit of a 32 bit WinXP process.

If you don't already know the answer to this question, I would recommend running a test with the database at various sizes ( powers of 10 or 2 ) and see if the performance curve has a discontinuity.

like image 40
GeePokey Avatar answered Nov 09 '22 14:11

GeePokey