when does Python allocate new memory for identical strings?

Tags:

Two Python strings with the same characters, a == b, may share memory, id(a) == id(b), or may be in memory twice, id(a) != id(b). Try

ab = "ab" print id( ab ), id( "a"+"b" )

Here Python recognizes that the newly created "a"+"b" is the same as the "ab" already in memory -- not bad.

Now consider an N-long list of state names [ "Arizona", "Alaska", "Alaska", "California" ... ] (N ~ 500000 in my case).
I see 50 different id() s ⇒ each string "Arizona" ... is stored only once, fine.
BUT write the list to disk and read it back in again: the "same" list now has N different id() s, way more memory, see below.

How come -- can anyone explain Python string memory allocation ?

""" when does Python allocate new memory for identical strings ?     ab = "ab"     print id( ab ), id( "a"+"b" )  # same !     list of N names from 50 states: 50 ids, mem ~ 4N + 50S, each string once     but list > file > mem again: N ids, mem ~ N * (4 + S) """  from __future__ import division from collections import defaultdict from copy import copy import cPickle import random import sys  states = dict( AL = "Alabama", AK = "Alaska", AZ = "Arizona", AR = "Arkansas", CA = "California", CO = "Colorado", CT = "Connecticut", DE = "Delaware", FL = "Florida", GA = "Georgia", )  def nid(alist):     """ nr distinct ids """     return "%d ids  %d pickle len" % (         len( set( map( id, alist ))),         len( cPickle.dumps( alist, 0 )))  # rough est ? # cf http://stackoverflow.com/questions/2117255/python-deep-getsizeof-list-with-contents  N = 10000 exec( "\n".join( sys.argv[1:] ))  # var=val ... random.seed(1)      # big list of random names of states -- names = [] for j in xrange(N):     name = copy( random.choice( states.values() ))     names.append(name) print "%d strings in mem:  %s" % (N, nid(names) )  # 10 ids, even with copy()      # list to a file, back again -- each string is allocated anew joinsplit = "\n".join(names).split()  # same as > file > mem again assert joinsplit == names print "%d strings from a file:  %s" % (N, nid(joinsplit) )  # 10000 strings in mem:  10 ids  42149 pickle len   # 10000 strings from a file:  10000 ids  188080 pickle len # Python 2.6.4 mac ppc

Added 25jan:
There are two kinds of strings in Python memory (or any program's):

Ustrings, in a Ucache of unique strings: these save memory, and make a == b fast if both are in Ucache
Ostrings, the others, which may be stored any number of times.

intern(astring) puts astring in the Ucache (Alex +1); other than that we know nothing at all about how Python moves Ostrings to the Ucache -- how did "a"+"b" get in, after "ab" ? ("Strings from files" is meaningless -- there's no way of knowing.)
In short, Ucaches (there may be several) remain murky.

A historical footnote: SPITBOL uniquified all strings ca. 1970.

275

asked Jan 23 '10 17:01

denis

2 Answers

Each implementation of the Python language is free to make its own tradeoffs in allocating immutable objects (such as strings) -- either making a new one, or finding an existing equal one and using one more reference to it, are just fine from the language's point of view. In practice, of course, real-world implementation strike reasonable compromise: one more reference to a suitable existing object when locating such an object is cheap and easy, just make a new object if the task of locating a suitable existing one (which may or may not exist) looks like it could potentially take a long time searching.

So, for example, multiple occurrences of the same string literal within a single function will (in all implementations I know of) use the "new reference to same object" strategy, because when building that function's constants-pool it's pretty fast and easy to avoid duplicates; but doing so across separate functions could potentially be a very time-consuming task, so real-world implementations either don't do it at all, or only do it in some heuristically identified subset of cases where one can hope for a reasonable tradeoff of compilation time (slowed down by searching for identical existing constants) vs memory consumption (increased if new copies of constants keep being made).

I don't know of any implementation of Python (or for that matter other languages with constant strings, such as Java) that takes the trouble of identifying possible duplicates (to reuse a single object via multiple references) when reading data from a file -- it just doesn't seem to be a promising tradeoff (and here you'd be paying runtime, not compile time, so the tradeoff is even less attractive). Of course, if you know (thanks to application level considerations) that such immutable objects are large and quite prone to many duplications, you can implement your own "constants-pool" strategy quite easily (intern can help you do it for strings, but it's not hard to roll your own for, e.g., tuples with immutable items, huge long integers, and so forth).

answered Sep 30 '22 17:09

Alex Martelli

I strongly suspect that Python is behaving like many other languages here - recognising string constants within your source code and using a common table for those, but not applying the same rules when creating strings dynamically. This makes sense as there will only be a finite set of strings within your source code (although Python lets you evaluate code dynamically, of course) whereas it's much more likely that you'll be creating huge numbers of strings in the course of your program.

This process is generally called interning - and indeed by the looks of this page it's called interning in Python, too.

answered Sep 30 '22 19:09

Jon Skeet

Related questions
                            
                                Django ChoiceField
                            
                                Pylint to show only warnings and errors
                            
                                How to find table like structure in image
                            
                                Does Python have anything Like Capybara/Cucumber?
                            
                                Dynamically limiting queryset of related field
                            
                                'module' object is not callable - calling method in another file
                            
                                Python scikit-learn: exporting trained classifier
                            
                                numpy.r_ is not a function. What is it?
                            
                                Pre-populate an inline FormSet?
                            
                                How to build a single python file from multiple scripts?
                            
                                GridSearch for an estimator inside a OneVsRestClassifier
                            
                                Catch "socket.error: [Errno 111] Connection refused" exception
                            
                                How would I access variables from one class to another?
                            
                                Django equivalent of PHP's form value array/associative array
                            
                                Parentheses in Python Conditionals
                            
                                Merging a Python script's subprocess' stdout and stderr while keeping them distinguishable
                            
                                OpenCV Python: Draw minAreaRect ( RotatedRect not implemented)
                            
                                How to delete an instantiated object Python?
                            
                                Python & Pandas: How to query if a list-type column contains something?
                            
                                Basic method chaining

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

when does Python allocate new memory for identical strings?

Tags:

python

memory-management

memory

denis

People also ask

2 Answers

Alex Martelli

Jon Skeet

Recent Activity

Donate For Us