I've been working on a presentation for colleagues to explain the basic behavior of and reasoning behind the GIL, and found something I couldn't explain while putting together a quick explanation of reference counting. It appears that newly declared variables have four references, instead of the one I would expect. For example, the following code: <pre class="prettyprint"><code>the_var = 'Hello World!' print('Var created: {} references'.format(sys.getrefcount(the_var))) </code></pre> Results in the this output: <pre class="prettyprint"><code>Var created: 4 references </code></pre> I validated that the output was the same if I used an integer > 100 (< 100 are pre-created and have a larger ref-count) or a float and if I declared the variable within a function scope or in a loop. The outcome was the same. The behavior also seems to be the same in 2.7.11 and 3.5.1. I attempted to debug sys.getrefcount to see whether it was creating additional references, but was unable to step into the function (I'm assuming it is a direct thunk down to the C layer). I know I'm gonna get at least one question on this when I present, and I'm actually pretty puzzled by the output anyway. Can anyone explain this behavior to me?

There are several scenarios that will yield a different reference count. The most straightforward is the REPL console: <pre class="prettyprint"><code>>>> import sys >>> the_var = 'Hello World!' >>> print(sys.getrefcount(the_var)) 2 </code></pre> Understanding this result is pretty straight-forward - there is one reference in the local stack and another temporary/local to the <code>sys.getrefcount()</code> function (even the documentation warns about it - <code>The count returned is generally one higher than you might expect</code>). But when you run it as a standalone script: <pre class="prettyprint"><code>import sys the_var = 'Hello World!' print(sys.getrefcount(the_var)) # 4 </code></pre> as you've noticed, you get a <code>4</code>. So what gives? Well, lets investigate... There is a very helpful interface to the garbage collector - the <code>gc</code> module - so if we run it in the REPL console: <pre class="prettyprint"><code>>>> import gc >>> the_var = 'Hello World!' >>> gc.get_referrers(the_var) [{'__builtins__': <module '__builtin__' (built-in)>, '__package__': None, 'the_var': 'Hello World!', 'gc': <module 'gc' (built-in)>, '__name__': '__main__', '__doc__': None}] </code></pre> No wonders there, - that's essentially just the current namespace (<code>locals()</code>) as the variable doesn't exist anywhere else. But what happens when we run that as a standalone script: <pre class="prettyprint"><code>import gc import pprint the_var = 'Hello World!' pprint.pprint(gc.get_referrers(the_var)) </code></pre> this prints out (YMMV, based on your Python version): <pre class="prettyprint"><code>[['gc', 'pprint', 'the_var', 'Hello World!', 'pprint', 'pprint', 'gc', 'get_referrers', 'the_var'], (-1, None, 'Hello World!'), {'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': 'test.py', '__name__': '__main__', '__package__': None, 'gc': <module 'gc' (built-in)>, 'pprint': <module 'pprint' from 'D:\Dev\Python\Py27-64\lib\pprint.pyc'>, 'the_var': 'Hello World!'}] </code></pre> Sure enough, we have two more references in the list just as <code>sys.getrefcount()</code> told us, but what the hell are those? Well, when Python interpreter is parsing your script it first needs to compile it to bytecode - and while it does, it stores all the strings in a list which, since it mentions your variable as well, is declared as a reference to it. The second more cryptic entry (<code>(-1, None, 'Hello World!')</code>) comes from the peep-hole optimizer and is there just optimize access (string reference in this case). Both of those are purely temporary and optional - REPL console is doing context separation so you don't see these references, if you were to 'outsource' your compiling from your current context: <pre class="prettyprint"><code>import gc import pprint exec(compile("the_var = 'Hello World!'", "<string>", "exec")) pprint.pprint(gc.get_referrers(the_var)) </code></pre> you'd get: <pre class="prettyprint"><code>[{'__builtins__': <module '__builtin__' (built-in)>, '__doc__': None, '__file__': 'test.py', '__name__': '__main__', '__package__': None, 'gc': <module 'gc' (built-in)>, 'pprint': <module 'pprint' from 'D:\Dev\Python\Py27-64\lib\pprint.pyc'>, 'the_var': 'Hello World!'}] </code></pre> and if you were to go back to the original attempt at getting the reference count via <code>sys.getreferencecount()</code>: <pre class="prettyprint"><code>import sys exec(compile("the_var = 'Hello World!'", "<string>", "exec")) print(sys.getrefcount(the_var)) # 2 </code></pre> just like in the REPL console, and just as expected. The extra reference due to the peep-hole optimizing, since it happens in-place, can be immediately discarded by forcing garbage collection (<code>gc.collect()</code>) before counting your references. However, the string list that is created during compilation cannot be released until the whole file has been parsed and compiled, which is why if you were to import your script in an another script and then count the references to <code>the_var</code> from it you'd get <code>3</code> instead of <code>4</code> just when you thought it cannot confuse you any more ;)

Why does a newly created variable in Python have a ref-count of four?

Q: Why does Python use reference counting?

Reference counting deallocates objects sooner than garbage collection. But as reference counting can't handle reference cycles between unreachable objects, Python uses a garbage collector (really just a cycle collector) to collect those cycles when they exist.

Q: What is the importance of reference counting?

Reference counting allows clients of your library to keep reference objects created by your library on the heap and allows you to keep track of how many references are still active. When the reference count goes to zero you can safely free the memory used by the object.

Q: What is Getrefcount?

getrefcount(number) basically means that number is used in your current code but isn't used anywhere else in Python. So based on our experiments above, it looks like the integer 24601 isn't used anywhere by default in Python. What happens if we run sys.

Tags:

python

I've been working on a presentation for colleagues to explain the basic behavior of and reasoning behind the GIL, and found something I couldn't explain while putting together a quick explanation of reference counting. It appears that newly declared variables have four references, instead of the one I would expect. For example, the following code:

the_var = 'Hello World!'
print('Var created: {} references'.format(sys.getrefcount(the_var)))

Results in the this output:

Var created: 4 references

I validated that the output was the same if I used an integer > 100 (< 100 are pre-created and have a larger ref-count) or a float and if I declared the variable within a function scope or in a loop. The outcome was the same. The behavior also seems to be the same in 2.7.11 and 3.5.1.

I attempted to debug sys.getrefcount to see whether it was creating additional references, but was unable to step into the function (I'm assuming it is a direct thunk down to the C layer).

I know I'm gonna get at least one question on this when I present, and I'm actually pretty puzzled by the output anyway. Can anyone explain this behavior to me?

782

asked Jul 10 '17 21:07

J.E.Merrill

1 Answers

There are several scenarios that will yield a different reference count. The most straightforward is the REPL console:

>>> import sys
>>> the_var = 'Hello World!'
>>> print(sys.getrefcount(the_var))
2

Understanding this result is pretty straight-forward - there is one reference in the local stack and another temporary/local to the sys.getrefcount() function (even the documentation warns about it - The count returned is generally one higher than you might expect). But when you run it as a standalone script:

import sys

the_var = 'Hello World!'
print(sys.getrefcount(the_var))
# 4

as you've noticed, you get a 4. So what gives? Well, lets investigate... There is a very helpful interface to the garbage collector - the gc module - so if we run it in the REPL console:

>>> import gc
>>> the_var = 'Hello World!'
>>> gc.get_referrers(the_var)
[{'__builtins__': <module '__builtin__' (built-in)>, '__package__': None, 'the_var': 'Hello 
World!', 'gc': <module 'gc' (built-in)>, '__name__': '__main__', '__doc__': None}]

No wonders there, - that's essentially just the current namespace (locals()) as the variable doesn't exist anywhere else. But what happens when we run that as a standalone script:

import gc
import pprint

the_var = 'Hello World!'
pprint.pprint(gc.get_referrers(the_var))

this prints out (YMMV, based on your Python version):

[['gc',
  'pprint',
  'the_var',
  'Hello World!',
  'pprint',
  'pprint',
  'gc',
  'get_referrers',
  'the_var'],
 (-1, None, 'Hello World!'),
 {'__builtins__': <module '__builtin__' (built-in)>,
  '__doc__': None,
  '__file__': 'test.py',
  '__name__': '__main__',
  '__package__': None,
  'gc': <module 'gc' (built-in)>,
  'pprint': <module 'pprint' from 'D:\Dev\Python\Py27-64\lib\pprint.pyc'>,
  'the_var': 'Hello World!'}]

Sure enough, we have two more references in the list just as sys.getrefcount() told us, but what the hell are those? Well, when Python interpreter is parsing your script it first needs to compile it to bytecode - and while it does, it stores all the strings in a list which, since it mentions your variable as well, is declared as a reference to it.

The second more cryptic entry ((-1, None, 'Hello World!')) comes from the peep-hole optimizer and is there just optimize access (string reference in this case).

Both of those are purely temporary and optional - REPL console is doing context separation so you don't see these references, if you were to 'outsource' your compiling from your current context:

import gc
import pprint

exec(compile("the_var = 'Hello World!'", "<string>", "exec"))
pprint.pprint(gc.get_referrers(the_var))

you'd get:

[{'__builtins__': <module '__builtin__' (built-in)>,
  '__doc__': None,
  '__file__': 'test.py',
  '__name__': '__main__',
  '__package__': None,
  'gc': <module 'gc' (built-in)>,
  'pprint': <module 'pprint' from 'D:\Dev\Python\Py27-64\lib\pprint.pyc'>,
  'the_var': 'Hello World!'}]

and if you were to go back to the original attempt at getting the reference count via sys.getreferencecount():

import sys

exec(compile("the_var = 'Hello World!'", "<string>", "exec"))
print(sys.getrefcount(the_var))
# 2

just like in the REPL console, and just as expected. The extra reference due to the peep-hole optimizing, since it happens in-place, can be immediately discarded by forcing garbage collection (gc.collect()) before counting your references.

However, the string list that is created during compilation cannot be released until the whole file has been parsed and compiled, which is why if you were to import your script in an another script and then count the references to the_var from it you'd get 3 instead of 4 just when you thought it cannot confuse you any more ;)

answered Oct 12 '22 11:10

zwer

Related questions
                            
                                How do I correct this sqlalchemy.exc.NoForeignKeysError?
                            
                                Numpy equivalent of itertools.product [duplicate]
                            
                                Do Cython extension types support class attributes?
                            
                                'super' object has no attribute '__getattr__' in python3
                            
                                How to get data from a list Json with python?
                            
                                Convert Rust vector of tuples to a C compatible structure
                            
                                Using with sns.set in seaborn plots
                            
                                Cython: Buffer type mismatch, expected 'int' but got 'long'
                            
                                Implementing Bi-directional LSTM-CRF Network
                            
                                Why not use python's assert statement in tests, these days?
                            
                                Complete a multipart_upload with boto3?
                            
                                figure.add_subplot() vs pyplot.subplot()
                            
                                Passing arguments (for argparse) with unittest discover
                            
                                sqlalchemy, using check constraints
                            
                                TensorBoard: How to plot histogram for gradients?
                            
                                How to smooth by interpolation when using pcolormesh?
                            
                                Is there a comprehensive table of Python's "magic constants"?
                            
                                Simplifying / optimizing a chain of for-loops
                            
                                Heroku - No web process running
                            
                                Search and replace placeholder text in PDF with Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With