Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How optimize adding new nodes in `django-mptt`?

I am creating a script which will synchronize two databases. There is a data in the database which should be stored as a tree so I use django-mptt for the new DB. When I syncing DB's I select new data from the old DB and should save it in the new one.

I want to know if there is a better way to add new nodes into a tree? Now it looks next way:

...
# Add new data to DB
for new_record in new_records:
    # Find appropriate parent using data in 'new_record'
    parent = get_parent(new_record)

    # Create object which should be added using data in 'new_record'
    new_node = MyMPTTModel(...)
    new_node.insert_at(parent, save = True)
    # Similar to:
    # new_node.insert_at(parent, save = False)
    # new_node.save()

But it works very slow. I think it works in a such way because after each call of the insert_at(..., save = True) method django-mptt should write new node to the DB and modify left and right keys for records which are already in the DB.

Is there any way to make django-mptt modify a query each time I call insert_at and then apply all changes together when I call save? Or do you know any other ways how to reduce execution time?

Thanks in advance.

like image 571
idea.list Avatar asked Sep 30 '12 13:09

idea.list


2 Answers

Firstly, don't use insert_at. It's not the reason for slow performance, but it's unnecessary and looks ugly. Just set node.parent:

for new_record in new_records:
    new_node = MyMPTTModel(..., parent=get_parent(new_record))
    new_node.save()

Now for the performance question. If you're using the latest mptt (git master, not 0.5.4), there's a context manager called delay_mptt_updates to prevent mptt from doing a lot of these updates until you've added all the nodes:

with transaction.atomic():
    with MyMPTTModel.objects.delay_mptt_updates():
        for new_record in new_records:
            new_node = MyMPTTModel(..., parent=get_parent(new_record))
            new_node.save()

Alternatively if you're touching almost the entire tree, you can speed things up even more by using disable_mptt_updates and rebuild the whole tree at the end:

with transaction.atomic():
    with MyMPTTModel.objects.disable_mptt_updates():
        for new_record in new_records:
            new_node = MyMPTTModel(..., parent=get_parent(new_record))
            new_node.save()
    MyMPTTModel.objects.rebuild()
like image 147
craigds Avatar answered Sep 22 '22 20:09

craigds


Django-MPTT maintains a tree structure for you. So at each insert_at, it will modify all the nodes to the right of the inserted one - which is why you are experiencing performance issues.

One way is to manually construct the tree structure without django-mptt.

So you will have to take the new records, and according to them, figure out how old nodes in the tree have to be modified. Since you are only inserting data, only left and right attributes will change, but not level, so that should make it a little easier. Once you will know what nodes to modify, you can then modify them using one update transaction (edit).

Then, you can start to insert new data. Again, the fastest way is to calculate the left, right, and level values for each new entry and then do one bulk_insert (Django>=1.4). Doing this will result in only two db operations which obviously should be much faster in terms of db transactions.

This method however will require some smart way to figure out how to change old nodes in the tree. The easiest way is to dump all of the tree into python structure, and then figure out the changes on that structure. That however will not be feasible if your tree is very large due to memory limitation.

Right now not sure if there is any more efficient way of doing this. Maybe somebody else on StackOverflow has some cool ideas...

EDIT

Sorry about the update confusion. I meant one transaction. In cases like this I usually do raw sql query where I do update tbname set ... where id=1; update tbname set ... where id=2; So I do multiple sql statements in one sql query. From my experience, the expensive part of the db is not execution of a statement, but the transaction itself since there is network latency, db locks, etc. So having one transaction allows the db to be fast as possible. Not sure however how to do that in django using querysets. I usually do raw sql query.

like image 20
miki725 Avatar answered Sep 26 '22 20:09

miki725