Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle Python multiprocessing database concurrency, specifically with django?

So, I'm trying to write an application that uses django as its ORM, since it'll both need to do some behind the scenes processing and an easy to use front-end. It's core functionality will be processing data that's in the database, in a high-CPU process (basically monte carlo simulations) and I want to implement multiprocessing, specifically using Pool (I get 4 processes). Basically my code runs like this, with about 20 children of the parent:

assorted import statements to get the django environment in the script
from multiprocessing import Pool
from random import random
from time import sleep

def test(child):
    x=[]
    print child.id
    for i in range(100):
        print child.id, i
        x.append(child.parent.id) #just to hit the DB
    return x

if __name__ == '__main__':
    parent = Parent.objects.get(id=1)
    pool = Pool()
    results = []
    results = pool.map(test,parent.children.all())
    pool.close()
    pool.join()
    print results

With the code as such, I get intermittent DatabaseErrors or PicklingErrors. The former are usually of the form "malformed database" or "lost connection to MySQL server", the latter are usually "cannot pickle model.DoesNotExist". They are random, occur with any process, and of course there is nothing wrong with the DB itself. If I set pool = Pool(proccesses=1) then it runs, in a single thread just fine. I also throw in various print statements to make sure that most of them are actually running.

I also have been changing test to:

def test(child):
    x=[]
    s= random()
    sleep(random())
    for i in range(100):
        x.append(child.parent.id)
    return x

Which just makes each iteration pause less than a second before running, and it makes everything fine. If I get the random interval down to about 500ms it starts acting up. So, probably a concurrency problem, right? But with only 4 processes hitting. My question is how do I solve this without making large dumps of the data ahead of time? I have tested it with both SQLite and MySQL, and both are having trouble with this.

like image 760
wdahab Avatar asked Aug 15 '13 20:08

wdahab


People also ask

Does Django support multiprocessing?

Django Q is a native Django task queue and worker application using Python multiprocessing.

Does Django close DB connections?

At the end of each request, Django closes the connection if it has reached its maximum age or if it is in an unrecoverable error state. If any database errors have occurred while processing the requests, Django checks whether the connection still works, and closes it if it doesn't.


1 Answers

Okay, so I determined (with help of a friend) that the issue is that django is using the same database connection for all the processes. Normally, when you have concurrent db requests, they are either in the same thread (in which case GIL kicks in) or they are on separate threads, in which case django makes different database connections. But with multiprocessing, python makes deepcopies of everything, so it's passing the same database connection to the subprocesses, and then they step on each other until it breaks.

Solution is to trigger a new db connection from within each subprocess (which is relatively quick).

from django import db
...
def sub_process():
    db.close_connection()
    #the rest of the sub_process' routines

#code that calls sub_process with the pool

Went back and forth from having that line, and not having that line, and definitely fixes everything.

like image 125
wdahab Avatar answered Oct 22 '22 09:10

wdahab