Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python OrderDict sputtering as compared to dict()

This one has me entirely baffled.

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        #row = collections.OrderedDict([("computer_name", key_host), ("id", index), ("hist_item", hist_item)])
        row = {"computer_name": key_host, "id": index, "hist_item": hist_item}
        asset_hist.append(row)

This code works perfectly with the collections line commented out. However, when I comment out the row = dict line and remove the comment from the collections line things get very strange. There are about 4 million of these rows being generated and appended to asset_hist.

So, when I use row=dict, the entire loop finishes in about 10 milliseconds, lightning fast. When I use the ordered dictionary, I've waited over 10 minutes and it still didn't finish. Now, I know OrderDict is supposed to be a little slower than a dict, but it's supposed to be about 10x slower at worst and by my math its actually about 100,000 times slower in this function.

I decided to print the index in the lowest loop to see what was happening. Interestingly enough, I noticed a sputtering in console output. The index would print very fast on the screen and then stop for about 3-5 seconds before continuing on.

am_output.asset_history is a dictionary which has one key, host, and every row is a list of strings. E.g.

am_output.asset_history = {"host1": ["string1", "string2", ...], "host2": ["string1", "string2", ...], ...}

EDIT: Sputter Analysis with OrderedDict

Total Memory on this VM Server: Only 8GB... need to get more provissioned.

LOOP NUM

184796 (~5 second wait, ~60% memory usage)

634481 (~5 second wait, ~65% memory usage)

1197564 (~5 second wait, ~70% memory usage)

1899247 (~5 second wait, ~75% memory usage)

2777296 (~5 second wait, ~80% memory usage)

3873730 (LONG WAIT... waited 20 minutes and gave up!, 88.3% memory usage, process is still running)

Where the wait happens changes with each run.

EDIT: Ran it again, this time it stop on 3873333, close to the spot it stopped before. It stopped after forming the row, while trying to append... I didn't notice this last attempt but it was there then too... the problem is with the append line, not the row line... I'm still baffled. Here's the row it produced right before the long stop (added the row to the print statement)... hostname changed to protect the innocent:

3873333: OrderedDict([('computer_name', 'bg-fd5612ea'), ('id', 1), ('hist_item', "sys1 Normalizer (sys1-4): Domain Name cannot be determined from sys1 Name 'bg-fd5612ea'.")])

like image 947
gunslingor Avatar asked Oct 30 '22 04:10

gunslingor


1 Answers

As your own tests prove, you're running out of memory. Even on CPython 3.6 (where plain dict is actually ordered, though not as a language guarantee yet), OrderedDict has significant memory overhead compared to dict; it's still implemented with a side-band linked list to preserve the order and support easy iteration, reordering with move_to_end, etc. You can tell just by checking with sys.getsizeof (exact results will differ by Python version and build bitwidth, 32 vs. 64 bit):

>>> od = OrderedDict([("a", 1), ("b", 2), ("c", 3)])
>>> d = {**od}
>>> sys.getsizeof(od)
464   # On 3.5 x64 it's 512
>>> sys.getsizeof(d)
240   # On 3.5 x64 it's 288

Ignoring the data stored, the overhead for the OrderedDict here is nearly twice that of the plain dict. If you're making 4 million of these items, on my machine that would add overhead of a titch over 850 MB (on both 3.5 and 3.6).

It's likely the combination of all the other programs on your system, plus your Python program, is exceeding the RAM allocated to your machine, and you're stuck swap thrashing. In particular, whenever asset_hist has to expand for new entries, it's likely needing to page in large parts of it (that got paged out for lack of use), and whenever a cyclic garbage collection run triggers (a full GC occurs roughly every 70,000 allocations and deallocations by default), all the OrderedDicts get paged back in to check if they're still referenced outside of cycles (you could check if the GC runs are the main problem by disabling cyclic GC via gc.disable()).

Given your particular use case, I'd strongly recommend avoiding both dict and OrderedDict though. The overhead of even dict, even the cheaper form on Python 3.6, is kind of extreme when you have a set of exactly three fixed keys over and over. Instead, use collections.namedtuple, which is designed for lightweight objects referenceable by either name or index (they act like regular tuples, but also allow accessing each value as a named attribute), which would dramatically reduce the memory cost of your program (and likely speed it up even when memory is not an issue).

For example:

from collections import namedtuple

ComputerInfo = namedtuple('ComputerInfo', ['computer_name', 'id', 'hist_item'])

asset_hist = []
for key_host, val_hist_list in am_output.asset_history.items():
    for index, hist_item in enumerate(val_hist_list):
        asset_hist.append(ComputerInfo(key_host, index, hist_item))

Only difference in use is that you replace row['computer_name'] with row.computer_name, or if you need all the values, you can unpack it like a regular tuple, e.g. comphost, idx, hist = row. If you need a true OrderedDict temporarily (don't store them for everything), you can call row._asdict() to get an OrderedDict with the same mapping as the namedtuple, but that's not normally needed. The memory savings are meaningful; on my system, the three element namedtuple drops the per-item overhead to 72 bytes, less than a third that of even the 3.6 dict and less than a sixth of a 3.6 OrderedDict (and three element namedtuple remains 72 bytes on 3.5, where dict/OrderedDict are larger pre-3.6). It may save even more than that; tuples (and namedtuple by extension) are allocated as a single contiguous C struct, while dict and company are at least two allocations (one for the object structure, one or more for the dynamically resizable parts of the structure), each of which may pay allocator overhead and alignment costs.

Either way, for your four million row scenario, using namedtuple would mean paying (beyond the cost of the values) overhead of around 275 MB total, vs. 915 (3.6) - 1100 (3.5) MB for dict and 1770 (3.6) - 1950 (3.5) MB for OrderedDict. When you're talking about an 8 GB system, shaving 1.5 GB off your overhead is a major improvement.

like image 123
ShadowRanger Avatar answered Nov 15 '22 06:11

ShadowRanger