Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

when to commit data in ZODB

Tags:

python

zodb

I am trying to handel the data generated by the following piece of code:

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        dic_score.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])

Since the dictionary is large (10000 keys X 10000 list with 3 elements each), it is difficult to keep it in memory. I was looking for a solution which stores the key:value (in the form of list) pair as soon as they are generated. It was advised here, Writing and reading a dictionary in specific format (Python), to use ZODB in combination with Btree.

Bear with me if this is too naive, my question is, when should one call transaction.commit() to commit the data ? If I call it at the end of the inner loop, the resulting file is extremely large ( not sure why). Here is a snippet:

storage = FileStorage('Data.fs')
db = DB(store)
connection = db.open()
root = connection.root()
btree_container = IOBTree
root[0] = btree_container 
for nodes in G.nodes()
    btree_container[nodes] = PersistentList () ## I was loosing data prior to doing this 

for Gnodes in G.nodes()       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes()   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
        transaction.commit()

What if I call it outside both the loops? Something like:

    ......
       ......
          score = SomeOperation on (Gvalue,Hvalue)
          btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    transaction.commit()

Will all the data be held in the memory till I call transaction.commit()? Again, I am not sure why but this results in a smaller file size on disk.

I want to minimize the data being held in the memory. Any guidance would be appreciated !

like image 820
R.Bahl Avatar asked Jun 28 '12 23:06

R.Bahl


2 Answers

Your goal is to make your process manageable within memory constraints. To be able to do this with the ZODB as a tool you need to understand how ZODB transactions work, and how to use them.

Why your ZODB grows so large

First of all you need to understand what a transaction commit does here, which also explains why your Data.fs is getting so large.

The ZODB writes data out per transaction, where any persistent object that has changed gets written to disk. The important detail here is persistent object that has changed; the ZODB works in units of persistent objects.

Not every python value is a persistent object. If I define a straight-up python class, it will not be persistent, nor are any of the built-in python types such as int or list. On the other hand, any class you define that inherits from persistence.Persistent is a persistent object. The BTrees set of classes, as well as the PeristentList class you use in your code do inherit from Persistent.

Now, on a transaction commit, any persistent object that has changed is written to disk as part of that transaction. So any PersistentList object that has been append to will be written in it's entirety to disk. BTrees handle this a little more efficient; they store Buckets, themselves persistent, which in turn hold the actually stored objects. So for every few new nodes you create, a Bucket is written to the transaction, not the whole BTree structure. Note that because the items held in the tree are themselves persistent objects only references to them are stored in the Bucket records.

Now, the ZODB writes transaction data by appending it to the Data.fs file, and it does not remove old data automatically. It can construct the current state of the database by finding the most recent version of a given object from the store. This is why your Data.fs is growing so much, you are writing out new versions of larger and larger PersistentList instances as transactions are committed.

Removing the old data is called packing, which is similar to the VACUUM command in PostgreSQL and other relational databases. Simply call the .pack() method on the db variable to remove all old revisions, or use the t and days parameters of that method to set limits on how much history to retain, the first is a time.time() timestamp (seconds since the epoch) before which you can pack, and days is the number of days in the past to retain from current time or t if specified. Packing should reduce your data file considerably as the partial lists in older transactions are removed. Do note that packing is an expensive operation and thus can take a while, depending on the size of your dataset.

Using transaction to manage memory

You are trying to build a very large dataset, by using persistence to work around constraints with memory, and are using transactions to try and flush things to disk. Normally, however, using a transaction commit signals you have completed constructing your dataset, something you can use as one atomic whole.

What you need to use here is a savepoint. Savepoints are essentially subtransactions, a point during the whole transaction where you can ask for data to be temporarily stored on disk. They'll be made permanent when you commit the transaction. To create a savepoint, call the .savepoint method on the transaction:

for Gnodes in G.nodes():      # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():  # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes, PersistentList()).append(
            [Hnodes, score, -1 ])
    transaction.savepoint(True)
transaction.commit()

In the above example I set the optimistic flag to True, meaning: I do not intent to roll back to this savepoint; some storages do not support rolling back, and signalling you do not need this makes your code work in such situations.

Also note that the transaction.commit() happens when the whole data set has been processed, which is what a commit is supposed to achieve.

One thing a savepoint does, is call for a garbage collection of the ZODB caches, which means that any data not currently in use is removed from memory.

Note the 'not currently in use' part there; if any of your code holds on to large values in a variable the data cannot be cleared from memory. As far as I can determine from the code you've shown us, this looks fine. But I do not know how your operations work or how you generate the nodes; be careful to avoid building complete lists in memory there when an iterator will do, or build large dictionaries where all your lists of lists are referenced, for example.

You can experiment a little as to where you create your savepoints; you could create one every time you've processed one HNodes, or only when done with a GNodes loop like I've done above. You are constructing a list per GNodes, so it would be kept in memory while looping over all the H.nodes() anyway, and flushing to disk would probably only make sense once you've completed constructing it in full.

If, however, you find that you need to clear memory more often, you should consider using either a BTrees.OOBTree.TreeSet class or a BTrees.IOBTree.BTree class instead of a PersistentList to break up your data into more persistent objects. A TreeSet is ordered but not (easily) indexable, while a BTree could be used as a list by using simple incrementing index keys:

for i, Hnodes in enumerate(H.nodes()):
    ...
    btree_container.setdefault(Gnodes, IOBTree())[i] = [Hnodes, score, -1]
    if i % 100 == 0:
        transaction.savepoint(True)

The above code uses a BTree instead of a PersistentList and creates a savepoint every 100 HNodes processed. Because the BTree uses buckets, which are persistent objects in themselves, the whole structure can be flushed to a savepoint more easily without having to stay in memory for all H.nodes() to be processed.

like image 120
Martijn Pieters Avatar answered Nov 04 '22 08:11

Martijn Pieters


What constitutes a transaction depends on what needs to be 'atomic' in your application. If the transaction fails, it will be rollbacked to its previous state (just after the last commit). It appears from your application code that you want to calculate a value for each Gnodes. So, your commit can go in at the end of Gnodes loop like this:

for Gnodes in G.nodes():       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes in H.nodes():   # Hnodes iterates over 10000 values 
        Hvalue =someoperation(Hnodes)
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    # once we calculate the value for a Gnodes, commit
    transaction.commit()

It appears from your code that "Hvalue" combination does not depend upon Gvalue or Gnodes. If it is an expensive operation, you are calculating it 1000 times for each Gnodes even though it does not affect its calculation. So, I would move it out of the loop.

# Hnodes iterates over 10000 values
hvals = dict((Hnodes, someoperation(Hnodes)) for Hnodes in H.nodes())
# now you have mapping of Hnodes and Hvalues

for Gnodes in G.nodes():       # Gnodes iterates over 10000 values 
    Gvalue = someoperation(Gnodes)
    for Hnodes, Hvalue in hvals.iteritems(): 
        score = SomeOperation on (Gvalue,Hvalue)
        btree_container.setdefault(Gnodes,[]).append([Hnodes, score, -1 ])
    # once we calculate the value for a given Gnodes, commit
    transaction.commit()
like image 1
Salil Avatar answered Nov 04 '22 09:11

Salil