I've been working on a presentation for colleagues to explain the basic behavior of and reasoning behind the GIL, and found something I couldn't explain while putting together a quick explanation of reference counting. It appears that newly declared variables have four references, instead of the one I would expect. For example, the following code:
the_var = 'Hello World!'
print('Var created: {} references'.format(sys.getrefcount(the_var)))
Results in the this output:
Var created: 4 references
I validated that the output was the same if I used an integer > 100 (< 100 are pre-created and have a larger ref-count) or a float and if I declared the variable within a function scope or in a loop. The outcome was the same. The behavior also seems to be the same in 2.7.11 and 3.5.1.
I attempted to debug sys.getrefcount to see whether it was creating additional references, but was unable to step into the function (I'm assuming it is a direct thunk down to the C layer).
I know I'm gonna get at least one question on this when I present, and I'm actually pretty puzzled by the output anyway. Can anyone explain this behavior to me?
Reference counting deallocates objects sooner than garbage collection. But as reference counting can't handle reference cycles between unreachable objects, Python uses a garbage collector (really just a cycle collector) to collect those cycles when they exist.
Reference counting allows clients of your library to keep reference objects created by your library on the heap and allows you to keep track of how many references are still active. When the reference count goes to zero you can safely free the memory used by the object.
getrefcount(number) basically means that number is used in your current code but isn't used anywhere else in Python. So based on our experiments above, it looks like the integer 24601 isn't used anywhere by default in Python. What happens if we run sys.
There are several scenarios that will yield a different reference count. The most straightforward is the REPL console:
>>> import sys
>>> the_var = 'Hello World!'
>>> print(sys.getrefcount(the_var))
2
Understanding this result is pretty straight-forward - there is one reference in the local stack and another temporary/local to the sys.getrefcount()
function (even the documentation warns about it - The count returned is generally one higher than you might expect
). But when you run it as a standalone script:
import sys
the_var = 'Hello World!'
print(sys.getrefcount(the_var))
# 4
as you've noticed, you get a 4
. So what gives? Well, lets investigate... There is a very helpful interface to the garbage collector - the gc
module - so if we run it in the REPL console:
>>> import gc
>>> the_var = 'Hello World!'
>>> gc.get_referrers(the_var)
[{'__builtins__': <module '__builtin__' (built-in)>, '__package__': None, 'the_var': 'Hello
World!', 'gc': <module 'gc' (built-in)>, '__name__': '__main__', '__doc__': None}]
No wonders there, - that's essentially just the current namespace (locals()
) as the variable doesn't exist anywhere else. But what happens when we run that as a standalone script:
import gc
import pprint
the_var = 'Hello World!'
pprint.pprint(gc.get_referrers(the_var))
this prints out (YMMV, based on your Python version):
[['gc',
'pprint',
'the_var',
'Hello World!',
'pprint',
'pprint',
'gc',
'get_referrers',
'the_var'],
(-1, None, 'Hello World!'),
{'__builtins__': <module '__builtin__' (built-in)>,
'__doc__': None,
'__file__': 'test.py',
'__name__': '__main__',
'__package__': None,
'gc': <module 'gc' (built-in)>,
'pprint': <module 'pprint' from 'D:\Dev\Python\Py27-64\lib\pprint.pyc'>,
'the_var': 'Hello World!'}]
Sure enough, we have two more references in the list just as sys.getrefcount()
told us, but what the hell are those? Well, when Python interpreter is parsing your script it first needs to compile it to bytecode - and while it does, it stores all the strings in a list which, since it mentions your variable as well, is declared as a reference to it.
The second more cryptic entry ((-1, None, 'Hello World!')
) comes from the peep-hole optimizer and is there just optimize access (string reference in this case).
Both of those are purely temporary and optional - REPL console is doing context separation so you don't see these references, if you were to 'outsource' your compiling from your current context:
import gc
import pprint
exec(compile("the_var = 'Hello World!'", "<string>", "exec"))
pprint.pprint(gc.get_referrers(the_var))
you'd get:
[{'__builtins__': <module '__builtin__' (built-in)>,
'__doc__': None,
'__file__': 'test.py',
'__name__': '__main__',
'__package__': None,
'gc': <module 'gc' (built-in)>,
'pprint': <module 'pprint' from 'D:\Dev\Python\Py27-64\lib\pprint.pyc'>,
'the_var': 'Hello World!'}]
and if you were to go back to the original attempt at getting the reference count via sys.getreferencecount()
:
import sys
exec(compile("the_var = 'Hello World!'", "<string>", "exec"))
print(sys.getrefcount(the_var))
# 2
just like in the REPL console, and just as expected. The extra reference due to the peep-hole optimizing, since it happens in-place, can be immediately discarded by forcing garbage collection (gc.collect()
) before counting your references.
However, the string list that is created during compilation cannot be released until the whole file has been parsed and compiled, which is why if you were to import your script in an another script and then count the references to the_var
from it you'd get 3
instead of 4
just when you thought it cannot confuse you any more ;)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With