Say, I'm going to construct a probably large dictionary in Python 3 for in-memory operations. The dictionary keys are integers, but I'm going to read them from a file as string at first. As far as storage and retrieval are concerned, I wonder if it matters whether I store the dictionary keys as integers themselves, or as strings. In other words, would leaving them as integers help with hashing?

Dicts are fast but can be heavy on the memory. Normally it shouldn't be a problem but you will only know when you test. I would advise to first test 1.000 lines, 10.000 lines and so on and have a look on the memory footprint. If you run out of memory and your data structure allows it maybe try using named tuples. <pre class="prettyprint"><code>EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade') import csv for emp in map(EmployeeRecord._make, csv.reader(open("employees.csv", "rb"))): print(emp.name, emp.title) </code></pre> (Example taken from the link) If you have ascending integers you could also try to get more fancy by using the array module.

Actually the string hashing is rather efficient in Python 3. I expected this to has the opposite outcome: <pre class="prettyprint"><code>>>> timeit('d["1"];d["4"]', setup='d = {"1": 1, "4": 4}') 0.05167865302064456 >>> timeit('d[1];d[4]', setup='d = {1: 1, 4: 4}') 0.06110116100171581 </code></pre>

Trade-off in Python dictionary key types

Tags:

python

dictionary

Say, I'm going to construct a probably large dictionary in Python 3 for in-memory operations. The dictionary keys are integers, but I'm going to read them from a file as string at first.

As far as storage and retrieval are concerned, I wonder if it matters whether I store the dictionary keys as integers themselves, or as strings.
In other words, would leaving them as integers help with hashing?

764

asked Dec 22 '15 10:12

Jeenu

2 Answers

Dicts are fast but can be heavy on the memory. Normally it shouldn't be a problem but you will only know when you test. I would advise to first test 1.000 lines, 10.000 lines and so on and have a look on the memory footprint.

If you run out of memory and your data structure allows it maybe try using named tuples.

EmployeeRecord = namedtuple('EmployeeRecord', 'name, age, title, department, paygrade')
import csv
for emp in map(EmployeeRecord._make, csv.reader(open("employees.csv", "rb"))):
    print(emp.name, emp.title)

(Example taken from the link)

If you have ascending integers you could also try to get more fancy by using the array module.

103

answered Sep 30 '22 18:09

CausticHarmony

Actually the string hashing is rather efficient in Python 3. I expected this to has the opposite outcome:

>>> timeit('d["1"];d["4"]', setup='d = {"1": 1, "4": 4}')
0.05167865302064456
>>> timeit('d[1];d[4]', setup='d = {1: 1, 4: 4}')
0.06110116100171581

answered Sep 30 '22 18:09

Klaus D.

Related questions
                            
                                lxml can not parse xml (wether encoding is utf-8 or not) [python]
                            
                                bokeh, two y axis, disable one axis for zoom/ panning
                            
                                sklearn mask for onehotencoder does not work
                            
                                Determine the endianness of a numpy array
                            
                                Python optimization using sympy lambdify and scipy
                            
                                Accessing and altering a global array using python joblib
                            
                                Pandoc Syntax Highlighting in PDF not working
                            
                                sympy installed, however sympy.mpmath not found
                            
                                Is there a good reason for setting up virtualenv for python in Docker containers?
                            
                                Why setting a dict shallow copy to itself?
                            
                                How to get all the statistics for a Github repository using the API?
                            
                                python args not working unless it has a position reference [duplicate]
                            
                                Why a single Numpy array element is not a Python scalar?
                            
                                How to remove all attributes from element
                            
                                BeautifulSoup returns empty list when searching by compound class names
                            
                                Nested JSON from CSV
                            
                                Custom OrderedDict that returns itself
                            
                                Python replace 3 random characters in a string with no duplicates
                            
                                Python Requests POST not working
                            
                                Meld error "Cannot import: GTK+; No module named repository"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With