I am just upgrading an older project to Python 3.6, and found out that there are these cool new async / await keywords.
My project contains a web crawler, that is not very performant at the moment, and takes about 7 mins to complete. Now, since I have django restframework in place already to access data of my django application, I thought it would be nice to have a REST endpoint where I could start the crawler from remote with a simple POST request.
However, I don't want the client to synchronously wait for the crawler to complete. I just want to straight away send him the message that the crawler has been started and start the crawler in the background.
from rest_framework import status
from rest_framework.decorators import api_view
from rest_framework.response import Response
from django.conf import settings
from mycrawler import tasks
async def update_all_async(deep_crawl=True, season=settings.CURRENT_SEASON, log_to_db=True):
await tasks.update_all(deep_crawl, season, log_to_db)
@api_view(['POST', 'GET'])
def start(request):
"""
Start crawling.
"""
if request.method == 'POST':
print("Crawler: start {}".format(request))
deep = request.data.get('deep', False)
season = request.data.get('season', settings.CURRENT_SEASON)
# this should be called async
update_all_async(season=season, deep_crawl=deep)
return Response({"Success": {"crawl finished"}}, status=status.HTTP_200_OK)
else:
return Response ({"description": "Start the crawler by calling this enpoint via post.", "allowed_parameters": {
"deep": "boolean",
"season": "number"
}}, status.HTTP_200_OK)
I have read some tutorials, also how to use the loops and stuff, but I don't really get it... Where should I start the loop in this case?
[EDIT] 20/10/2017:
I solved it using threading for now, since it really is a "fire and forget" task. However, I still would like to know how to achieve the same thing using async / await.
Here's my current solution:
import threading
@api_view(['POST', 'GET'])
def start(request):
...
t = threading.Thread(target=tasks.update_all, args=(deep, season))
t.start()
...
Django has support for writing asynchronous (“async”) views, along with an entirely async-enabled request stack if you are running under ASGI. Async views will still work under WSGI, but with performance penalties, and without the ability to have efficient long-running requests.
An async function uses the await keyword to denote a coroutine. When using the await keyword, coroutines release the flow of control back to the event loop. To run a coroutine, we need to schedule it on the event loop. After scheduling, coroutines are wrapped in Tasks as a Future object.
Latest version of the popular Python web framework also provides an asynchronous interface for all data access operations. Django 4.1, a new version of the major Python-based web framework, adds capabilities such as asynchronous handlers and an ORM interface but also makes some backward-incompatible changes.
In conclusion, Django is perfect if you want to build robust full-stack web applications because it has several functionalities and works very well in production. On the other hand FastAPI is perfect if you're looking for high performance or scalable applications.
This is possible in Django 3.1+, after introducing asynchronous support.
Regarding the asynchronous running loop, you can make use of it by running Django with uvicorn
or any other ASGI server instead of gunicorn
or other WSGI servers.
The difference is that when using an ASGI server, there's already a running loop, while you would need to create one when using WSGI. With ASGI, you can simply define async
functions directly under views.py
or its View Classes's inherited functions.
Assuming you go with ASGI, you have multiple ways of achieving this, I'll describe a couple (other options could make use of asyncio.Queue
for example):
start()
asyncBy making start()
async, you can make direct use of the existing running loop, and by using asyncio.Task
, you can fire and forget into the existing running loop. And if you want to fire but remember, you can create another Task
to follow up on this one, i.e.:
from rest_framework import status
from rest_framework.decorators import api_view
from rest_framework.response import Response
from django.conf import settings
from mycrawler import tasks
import asyncio
async def update_all_async(deep_crawl=True, season=settings.CURRENT_SEASON, log_to_db=True):
await tasks.update_all(deep_crawl, season, log_to_db)
async def follow_up_task(task: asyncio.Task):
await asyncio.sleep(5) # Or any other reasonable number, or a finite loop...
if task.done():
print('update_all task completed: {}'.format(task.result()))
else:
print('task not completed after 5 seconds, aborting')
task.cancel()
@api_view(['POST', 'GET'])
async def start(request):
"""
Start crawling.
"""
if request.method == 'POST':
print("Crawler: start {}".format(request))
deep = request.data.get('deep', False)
season = request.data.get('season', settings.CURRENT_SEASON)
# Once the task is created, it will begin running in parallel
loop = asyncio.get_running_loop()
task = loop.create_task(update_all_async(season=season, deep_crawl=deep))
# Fire up a task to track previous down
loop.create_task(follow_up_task(task))
return Response({"Success": {"crawl finished"}}, status=status.HTTP_200_OK)
else:
return Response ({"description": "Start the crawler by calling this enpoint via post.", "allowed_parameters": {
"deep": "boolean",
"season": "number"
}}, status.HTTP_200_OK)
Sometimes you can't just have an async
function to route the request to in the first place, as it happens with DRF (as of today).
For this, Django provides some useful async
adapter functions, but be aware that switching from sync to async context or vice versa, comes with a small performance penalty of approximately 1ms. Note that this time, the running loop as gathered in the update_all_sync
function instead:
from rest_framework import status
from rest_framework.decorators import api_view
from rest_framework.response import Response
from django.conf import settings
from mycrawler import tasks
import asyncio
from asgiref.sync import async_to_sync
@async_to_sync
async def update_all_async(deep_crawl=True, season=settings.CURRENT_SEASON, log_to_db=True):
#We can use the running loop here in this use case
loop = asyncio.get_running_loop()
task = loop.create_task(tasks.update_all(deep_crawl, season, log_to_db))
loop.create_task(follow_up_task(task))
async def follow_up_task(task: asyncio.Task):
await asyncio.sleep(5) # Or any other reasonable number, or a finite loop...
if task.done():
print('update_all task completed: {}'.format(task.result()))
else:
print('task not completed after 5 seconds, aborting')
task.cancel()
@api_view(['POST', 'GET'])
def start(request):
"""
Start crawling.
"""
if request.method == 'POST':
print("Crawler: start {}".format(request))
deep = request.data.get('deep', False)
season = request.data.get('season', settings.CURRENT_SEASON)
# Make update all "sync"
sync_update_all_sync = async_to_sync(update_all_async)
sync_update_all_sync(season=season, deep_crawl=deep)
return Response({"Success": {"crawl finished"}}, status=status.HTTP_200_OK)
else:
return Response ({"description": "Start the crawler by calling this enpoint via post.", "allowed_parameters": {
"deep": "boolean",
"season": "number"
}}, status.HTTP_200_OK)
In both cases, the function will quickly return the 200, but technically the 2nd option is slower.
IMPORTANT: When using Django, it is common to have DB operations involved in these async operations. DB operations in Django can only be synchronous, at least for now, so you will have to consider this in asynchronous contexts.
sync_to_async()
becomes very handy for these cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With