Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pympler summary doesn't seem to make sense

I'm doing some sanity checks with Pympler to make sure that I understand results when I try to profile an actual script, but I'm a bit puzzled at the results. Here are the sanity checks I've tried:

SANITY CHECK 1: I fire up a Python (3) console and do the following:

from pympler import summary, muppy
sum = summary.summarize(muppy.get_objects())
summary.print_(sum)

This results in the following summary:

                               types |   # objects |   total size
==================================== | =========== | ============
                         <class 'str |       16047 |      1.71 MB
                        <class 'dict |        2074 |      1.59 MB
                        <class 'type |         678 |    678.27 KB
                        <class 'code |        4789 |    673.68 KB
                         <class 'set |         464 |    211.00 KB
                        <class 'list |        1319 |    147.16 KB
                       <class 'tuple |        1810 |    120.03 KB
                     <class 'weakref |        1269 |     99.14 KB
          <class 'wrapper_descriptor |        1124 |     87.81 KB
  <class 'builtin_function_or_method |         918 |     64.55 KB
                 <class 'abc.ABCMeta |          64 |     62.25 KB
           <class 'method_descriptor |         877 |     61.66 KB
                         <class 'int |        1958 |     58.88 KB
           <class 'getset_descriptor |         696 |     48.94 KB
                 function (__init__) |         306 |     40.64 KB

If I've just fired up a new Python session, how are there all these strings, dictionaries, lists etc. in memory already? I don't think that Pympler is summarizing the results across all sessions (that would make no sense, but it's the only possibility I could think of).

SANITY CHECK 2: Since I don't quite understand the summary results of a tabula rasa Python session, let's look at the difference in summary after I've defined a few variables/data structures. I fire up another console and do the following:

from pympler import summary, muppy
sum = summary.summarize(muppy.get_objects())
a = {}
b = {}
c = {}
d = {'a': [0, 0, 1, 2], 't': [3, 3, 3, 1]}
sum1 = summary.summarize(muppy.get_objects())
summary.print_(summary.get_diff(sum, sum1))

This results in the following summary:

                         types |   # objects |   total size
============================== | =========== | ============
                  <class 'list |        3247 |    305.05 KB
                   <class 'str |        3234 |    226.04 KB
                   <class 'int |         552 |     15.09 KB
                  <class 'dict |           1 |    480     B
              function (_keys) |           0 |      0     B
           function (get_path) |           0 |      0     B
          function (http_open) |           0 |      0     B
            function (memoize) |           0 |      0     B
                function (see) |           0 |      0     B
           function (recvfrom) |           0 |      0     B
              function (rfind) |           0 |      0     B
      function (wm_focusmodel) |           0 |      0     B
    function (_parse_makefile) |           0 |      0     B
  function (_decode_pax_field) |           0 |      0     B
             function (__gt__) |           0 |      0     B

I thought I'd just initialized four new dictionaries (albeit 3 are empty), so why does Muppy show a difference of only 1 new dictionary object? Furthermore, why are there thousands of new strings and lists, not to mention the ints?

SANITY CHECK 3: Yet again, I start a new Python session but this time want to see how Pympler handles more complex data types like a list of dictionaries.

from pympler import muppy, summary
sum = summary.summarize(muppy.get_objects())
a = [{}, {}, {}, {'a': [0, 0, 1, 2], 't': [3, 3, 3, 1]}, {'a': [1, 2, 3, 4]}]
sum1 = summary.summarize(muppy.get_objects())
summary.print_(summary.get_diff(sum, sum1))

Which results in the following summary:

                                                types |   # objects |   total size
===================================================== | =========== | ============
                                         <class 'list |        3233 |    303.88 KB
                                          <class 'str |        3270 |    230.71 KB
                                          <class 'int |         554 |     15.16 KB
                                         <class 'dict |          10 |      5.53 KB
                                         <class 'code |          16 |      2.25 KB
                                         <class 'type |           2 |      1.98 KB
                                        <class 'tuple |           6 |    512     B
                            <class 'getset_descriptor |           4 |    288     B
                                  function (__init__) |           2 |    272     B
  <class '_frozen_importlib_external.SourceFileLoader |           3 |    168     B
                 <class '_frozen_importlib.ModuleSpec |           3 |    168     B
                                      <class 'weakref |           2 |    160     B
                                  function (__call__) |           1 |    136     B
                                      function (Find) |           1 |    136     B
                                  function (<lambda>) |           1 |    136     B

Even though the lists and dictionaries are nested a bit convoluted, by my count I added 5 new dictionaries and four new lists.

Can someone explain how Muppy is counting objects?

like image 311
itf Avatar asked Apr 13 '16 21:04

itf


Video Answer


1 Answers

1 get_objects in a new Python session

summary.summarize(muppy.get_objects()) returns any objects instantiated during the startup and while from pympler import summary, muppy ran, which explains the large counts.

2 The difference between two get_objects invocations

2.1. Lots of new objects we didn't create

Remember that the sum object generated by summary.summarize() was created after the first snapshot, which explains "thousands of new strings and lists". You can fix this by rewriting your test as:

from pympler import summary, muppy
o1 = muppy.get_objects()
a = {}
b = {}
c = {}
d = {'a': [0, 0, 1, 2], 't': [3, 3, 3, 1]}
o2 = muppy.get_objects()
summary.print_(summary.get_diff(summary.summarize(o1), summary.summarize(o2)))

This will reduce the extraneous diffs to the large list for o1, and a couple of other objects:

>>> for o in diff['+']:
...     print("%s - %s" % (type(o), o if len(o) < 10 else "long list"))
...
<class 'str'> - o2
<class 'list'> - long list
<class 'dict'> - {'a': [0, 0, 1, 2], 't': [3, 3, 3, 1]}
<class 'list'> - ['o2', 'muppy', 'get_objects']
<class 'list'> - [0, 0, 1, 2]
<class 'list'> - [3, 3, 3, 1]

2.2. The mismatch in the number of dicts created and reported

To understand this, we need to know what exactly pympler is inspecting.

muppy.get_objects implementation relies on

  • Python's gc.get_objects(), which is "a list of all objects tracked by the collector" (gc.is_tracked), except stack frames.

    instances of atomic types aren’t tracked and instances of non-atomic types (containers, user-defined objects…) are. However, some type-specific optimizations can be present in order to suppress the garbage collector footprint of simple instances (e.g. dicts containing only atomic keys and values)

  • Then it adds the objects that are referred (gc.get_referents) from the objects obtained at step 1, but excluding "container objects" - those that have Py_TPFLAGS_HAVE_GC in their type's __flags__. (This seems to be a bug, since excluding all container objects misses the "simple instances" of container types, that are not GC-tracked. update Should be fixed in v0.8 released 2019-11-12)

If you store the object list o2 as suggested above and check which objects are accounted for using:

def tracked(obj_list, obj):
    import gc
    return {"tracked_by_muppy": any(id(item) == id(obj) for item in obj_list),
            "gc_tracked": gc.is_tracked(obj)}

You'll see that:

  • Empty dicts are not GC-tracked and as they are only referred to from local variables, they are not accounted for by muppy:

      tracked(o2, a)  # => {'tracked_by_muppy': False, 'gc_tracked': False}
    
  • The non-trivial dict d is GC-tracked and thus appears in muppy report:

      tracked(o2, d)  # => {'tracked_by_muppy': True, 'gc_tracked': True}
    
like image 166
Nickolay Avatar answered Nov 02 '22 04:11

Nickolay