Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django multiprocessing and database connections

Background:

I'm working a project which uses Django with a Postgres database. We're also using mod_wsgi in case that matters, since some of my web searches have made mention of it. On web form submit, the Django view kicks off a job that will take a substantial amount of time (more than the user would want to wait), so we kick off the job via a system call in the background. The job that is now running needs to be able to read and write to the database. Because this job takes so long, we use multiprocessing to run parts of it in parallel.

Problem:

The top level script has a database connection, and when it spawns off child processes, it seems that the parent's connection is available to the children. Then there's an exception about how SET TRANSACTION ISOLATION LEVEL must be called before a query. Research has indicated that this is due to trying to use the same database connection in multiple processes. One thread I found suggested calling connection.close() at the start of the child processes so that Django will automatically create a new connection when it needs one, and therefore each child process will have a unique connection - i.e. not shared. This didn't work for me, as calling connection.close() in the child process caused the parent process to complain that the connection was lost.

Other Findings:

Some stuff I read seemed to indicate you can't really do this, and that multiprocessing, mod_wsgi, and Django don't play well together. That just seems hard to believe I guess.

Some suggested using celery, which might be a long term solution, but I am unable to get celery installed at this time, pending some approval processes, so not an option right now.

Found several references on SO and elsewhere about persistent database connections, which I believe to be a different problem.

Also found references to psycopg2.pool and pgpool and something about bouncer. Admittedly, I didn't understand most of what I was reading on those, but it certainly didn't jump out at me as being what I was looking for.

Current "Work-Around":

For now, I've reverted to just running things serially, and it works, but is slower than I'd like.

Any suggestions as to how I can use multiprocessing to run in parallel? Seems like if I could have the parent and two children all have independent connections to the database, things would be ok, but I can't seem to get that behavior.

Thanks, and sorry for the length!

like image 350
daroo Avatar asked Nov 23 '11 13:11

daroo


People also ask

Does Django close DB connection?

If any database errors have occurred while processing the requests, Django checks whether the connection still works, and closes it if it doesn't. Thus, database errors affect at most one request per each application's worker thread; if the connection becomes unusable, the next request gets a fresh connection.

Can Django project connect to multiple databases?

Django's admin doesn't have any explicit support for multiple databases. If you want to provide an admin interface for a model on a database other than that specified by your router chain, you'll need to write custom ModelAdmin classes that will direct the admin to use a specific database for content.

Does Django use multiprocessing?

Essentially Django serves WSGI request-response cycle which knows nothing of multiprocessing or background tasks.

What is Django Q?

Django Q is a native Django task queue, scheduler and worker application using Python multiprocessing.


1 Answers

Multiprocessing copies connection objects between processes because it forks processes, and therefore copies all the file descriptors of the parent process. That being said, a connection to the SQL server is just a file, you can see it in linux under /proc//fd/.... any open file will be shared between forked processes. You can find more about forking here.

My solution was just simply close db connection just before launching processes, each process recreate connection itself when it will need one (tested in django 1.4):

from django import db db.connections.close_all() def db_worker():           some_paralell_code() Process(target = db_worker,args = ()) 

Pgbouncer/pgpool is not connected with threads in a meaning of multiprocessing. It's rather solution for not closing connection on each request = speeding up connecting to postgres while under high load.

Update:

To completely remove problems with database connection simply move all logic connected with database to db_worker - I wanted to pass QueryDict as an argument... Better idea is simply pass list of ids... See QueryDict and values_list('id', flat=True), and do not forget to turn it to list! list(QueryDict) before passing to db_worker. Thanks to that we do not copy models database connection.

def db_worker(models_ids):             obj = PartModelWorkerClass(model_ids) # here You do Model.objects.filter(id__in = model_ids)     obj.run()   model_ids = Model.objects.all().values_list('id', flat=True) model_ids = list(model_ids) # cast to list process_count = 5 delta = (len(model_ids) / process_count) + 1  # do all the db stuff here ...  # here you can close db connection from django import db db.connections.close_all()  for it in range(0:process_count):     Process(target = db_worker,args = (model_ids[it*delta:(it+1)*delta]))    
like image 82
lechup Avatar answered Sep 24 '22 00:09

lechup