Disclaimer: I do know that there are several similar questions on SO. I think I've read most if not all of them, but did not find an answer to my real question (see later). I also do know that using celery or other asynchronous queue systems is the best way to achieve long running tasks - or at least use a cron-managed script. There's also mod_wsgi doc about processes and threads but I'm not sure I got it all correct. The question is: what are the exact risks/issues involved with using the solutions listed down there? Is any of them viable for long running tasks (ok, even though celery is better suited)? My question is really more about understanding the internals of wsgi and python/django than finding the best overall solution. Issues with blocking threads, unsafe access to variables, zombie processing, etc. Let's say: <ol> <li>my "long_process" is doing something really safe. even if it fails i don't care.</li> <li>python >= 2.6</li> <li>I'm using mod_wsgi with apache (will anything change with uwsgi or gunicorn?) in daemon mode </li> </ol> mod_wsgi conf: <pre class="prettyprint"><code>WSGIDaemonProcess NAME user=www-data group=www-data threads=25 WSGIScriptAlias / /path/to/wsgi.py WSGIProcessGroup %{ENV:VHOST} </code></pre> I figured that these are the options available to launch separate processes (meant in a broad sense) to carry on a long running task while returning quickly a response to the user: <h3>os.fork</h3> <pre class="prettyprint"><code>import os if os.fork()==0: long_process() else: return HttpResponse() </code></pre> <h3>subprocess</h3> <pre class="prettyprint"><code>import subprocess p = subprocess.Popen([sys.executable, '/path/to/script.py'], stdout=subprocess.PIPE, stderr=subprocess.STDOUT) </code></pre> (where the script is likely to be a manage.py command) <h3>threads</h3> <pre class="prettyprint"><code>import threading t = threading.Thread(target=long_process, args=args, kwargs=kwargs) t.setDaemon(True) t.start() return HttpResponse() </code></pre> NB. <blockquote> Due to the Global Interpreter Lock, in CPython only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better of use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously. </blockquote> The main thread will quickly return (the httpresponse). Will the spawned long thread block wsgi from doing something else for another request?! <h3>multiprocessing</h3> <pre class="prettyprint"><code>from multiprocessing import Process p = Process(target=_bulk_action,args=(action,objs)) p.start() return HttpResponse() </code></pre> This should solve the thread concurrency issue, shouldn't it? <hr> So those are the options I could think of. What would work and what not, and why?

<code>os.fork</code> A fork will clone the parent process, which in this case, is your Django stack. Since you're merely wanting to run a separate python script, this seems like an unnecessary amount of bloat. <code>subprocess</code> Using <code>subprocess</code> is expected to be interactive. In other words, while you can use this to effectively spawn off a process, it's expected that at some point you'll terminate it when finished. It's possible Python might clean up for you if you leave one running, but my guess would be that this will actually result in a memory leak. <code>threading</code> Threads are defined units of logic. They start when their <code>run()</code> method is called, and terminate when the <code>run()</code> method's execution ends. This makes them well suited to creating a branch of logic that will run outside the current scope. However, as you mentioned, they are subject to the Global Interpreter Lock. <code>multiprocessing</code> This module allows you to spawn processes, and it has an API similar to that of <code>threading</code>. You could say is like threads on steroids. These processes are not subject to the Global Interpreter Lock, and they can take advantage of multi-core architectures. However, they are more complicated to work with as a result. So, your choices really come down to threads or processes. If you can get by with a thread and it makes sense for your application, go with a thread. Otherwise, use processes.

I have found that using uWSGI Decorators is quite simpler than using Celery if you need just run some long task in background. Think Celery is best solution for serious heavy project, and it's overhead for doing something simple. For start using uWSGI Decorators you just need to update your uWSGI config with <pre class="prettyprint"><code><spooler-processes>1</spooler-processes> <spooler>/here/the/path/to/dir</spooler> </code></pre> write code like: <pre class="prettyprint"><code>@spoolraw def long_task(arguments): try: doing something with arguments['myarg']) except Exception as e: ...something... return uwsgi.SPOOL_OK def myView(request) long_task.spool({'myarg': str(someVar)}) return render_to_response('done.html') </code></pre> Than when you start view in uWSGI log appears: <pre class="prettyprint"><code>[spooler] written 208 bytes to file /here/the/path/to/dir/uwsgi_spoolfile_on_hostname_31139_2_0_1359694428_441414 </code></pre> and when task finished: <pre class="prettyprint"><code>[spooler /here/the/path/to/dir pid: 31138] done with task uwsgi_spoolfile_on_hostname_31139_2_0_1359694428_441414 after 78 seconds </code></pre> There is strange(for me) restrictions: <pre class="prettyprint"><code> - spool can receive as argument only dictionary of strings, look like because it's serialize in file as strings. - spool should be created on start up so "spooled" code it should be contained in separate file which should be defined in uWSGI config as <import>pyFileWithSpooledCode</import> </code></pre>

Django long running asynchronous tasks with threads/processing

Tags:

asynchronous

django

mod-wsgi

Disclaimer: I do know that there are several similar questions on SO. I think I've read most if not all of them, but did not find an answer to my real question (see later). I also do know that using celery or other asynchronous queue systems is the best way to achieve long running tasks - or at least use a cron-managed script. There's also mod_wsgi doc about processes and threads but I'm not sure I got it all correct.

The question is:

what are the exact risks/issues involved with using the solutions listed down there? Is any of them viable for long running tasks (ok, even though celery is better suited)? My question is really more about understanding the internals of wsgi and python/django than finding the best overall solution. Issues with blocking threads, unsafe access to variables, zombie processing, etc.

Let's say:

my "long_process" is doing something really safe. even if it fails i don't care.
python >= 2.6
I'm using mod_wsgi with apache (will anything change with uwsgi or gunicorn?) in daemon mode

mod_wsgi conf:

WSGIDaemonProcess NAME user=www-data group=www-data threads=25 WSGIScriptAlias / /path/to/wsgi.py WSGIProcessGroup %{ENV:VHOST}

I figured that these are the options available to launch separate processes (meant in a broad sense) to carry on a long running task while returning quickly a response to the user:

os.fork

import os  if os.fork()==0:     long_process() else:     return HttpResponse()

subprocess

import subprocess  p = subprocess.Popen([sys.executable, '/path/to/script.py'],                                      stdout=subprocess.PIPE,                                      stderr=subprocess.STDOUT)

(where the script is likely to be a manage.py command)

threads

import threading  t = threading.Thread(target=long_process,                              args=args,                              kwargs=kwargs) t.setDaemon(True) t.start() return HttpResponse()

NB.

Due to the Global Interpreter Lock, in CPython only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better of use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

The main thread will quickly return (the httpresponse). Will the spawned long thread block wsgi from doing something else for another request?!

multiprocessing

from multiprocessing import Process  p = Process(target=_bulk_action,args=(action,objs)) p.start() return HttpResponse()

This should solve the thread concurrency issue, shouldn't it?

So those are the options I could think of. What would work and what not, and why?

596

asked Nov 09 '11 17:11

Stefano

2 Answers

os.fork

A fork will clone the parent process, which in this case, is your Django stack. Since you're merely wanting to run a separate python script, this seems like an unnecessary amount of bloat.

subprocess

Using subprocess is expected to be interactive. In other words, while you can use this to effectively spawn off a process, it's expected that at some point you'll terminate it when finished. It's possible Python might clean up for you if you leave one running, but my guess would be that this will actually result in a memory leak.

threading

Threads are defined units of logic. They start when their run() method is called, and terminate when the run() method's execution ends. This makes them well suited to creating a branch of logic that will run outside the current scope. However, as you mentioned, they are subject to the Global Interpreter Lock.

multiprocessing

This module allows you to spawn processes, and it has an API similar to that of threading. You could say is like threads on steroids. These processes are not subject to the Global Interpreter Lock, and they can take advantage of multi-core architectures. However, they are more complicated to work with as a result.

So, your choices really come down to threads or processes. If you can get by with a thread and it makes sense for your application, go with a thread. Otherwise, use processes.

answered Sep 21 '22 01:09

Chris Pratt

I have found that using uWSGI Decorators is quite simpler than using Celery if you need just run some long task in background. Think Celery is best solution for serious heavy project, and it's overhead for doing something simple.

For start using uWSGI Decorators you just need to update your uWSGI config with

<spooler-processes>1</spooler-processes> <spooler>/here/the/path/to/dir</spooler>

write code like:

@spoolraw def long_task(arguments):     try:         doing something with arguments['myarg'])     except Exception as e:         ...something...     return uwsgi.SPOOL_OK  def myView(request)     long_task.spool({'myarg': str(someVar)})     return render_to_response('done.html')

Than when you start view in uWSGI log appears:

[spooler] written 208 bytes to file /here/the/path/to/dir/uwsgi_spoolfile_on_hostname_31139_2_0_1359694428_441414

and when task finished:

[spooler /here/the/path/to/dir pid: 31138] done with task uwsgi_spoolfile_on_hostname_31139_2_0_1359694428_441414 after 78 seconds

There is strange(for me) restrictions:

    - spool can receive as argument only dictionary of strings, look like because it's serialize in file as strings.     - spool should be created on start up so "spooled" code it should be contained in separate file which should be defined in uWSGI config as <import>pyFileWithSpooledCode</import>

answered Sep 19 '22 01:09

Oleg Neumyvakin

Related questions
                            
                                Full text search: Whoosh Vs SOLR
                            
                                Calling block inside an if condition: django template
                            
                                How do you Require Login for Media Files in Django
                            
                                How to make Django template raise an error if a variable is missing in context
                            
                                Setting up a foreign key to an abstract base class with Django
                            
                                Twitter bootstrap:Popovers are not showing up on first click but show up on second click
                            
                                pip install: How to force a specific package version
                            
                                Django - exception handling best practice and sending customized error message
                            
                                Django: __in query lookup doesn't maintain the order in queryset
                            
                                Usage of .to_representation() and .to_internal_value in django-rest-framework?
                            
                                Same string with different translation
                            
                                Revert Django 1.7 RemoveField migration
                            
                                Why does django 1.7 creates migrations for changes in field choices?
                            
                                What are the differences between setUpClass, setUpTestData and setUp in TestCase class?
                            
                                Setting SECURE_HSTS_SECONDS can irreversibly break your site?
                            
                                Is there any adequate scaffolding for Django? (à la Ruby on Rails)
                            
                                Using Django Managers vs. staticmethod on Model class directly
                            
                                Preventing django from appending "_id" to a foreign key field
                            
                                In-Memory broker for celery unit tests
                            
                                Python Django Errno 54 'Connection reset by peer'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With