Our users are erratically getting CancelledError
for any page in our system. The only pattern we’ve observed is that this happens more often for pages which take more time to load during normal operation. But it is absolutely not limited to such pages, it can happen anywhere in our system, e.g. login page. All of the affected pages do not use any async code or channels, they’re standard django views working in request/response model (we migrated to ASGI
only recently and we only have a single page which uses channels and it works just fine). We cannot reproduce it consistently.
What we see in sentry.io
:
CancelledError: null
File "channels/http.py", line 198, in __call__
await self.handle(scope, async_to_sync(send), body_stream)
File "asgiref/sync.py", line 435, in __call__
ret = await asyncio.wait_for(future, timeout=None)
File "asyncio/tasks.py", line 414, in wait_for
return await fut
Locally
and in Daphne
logs it look like it:
2022-10-12 20:00:00,000 WARNING Application instance <Task pending coro=<ProtocolTypeRouter.__call__() running at /home/deploy/.virtualenvs/…/lib/python3.7/site-packages/channels/routing.py:71> wait_for=<Future pending cb=[_chain_future.._call_check_cancel() at /usr/lib/python3.7/asyncio/futures.py:348, <Task WakeupMethWrapper object at 0x7f1adcbf9610>()]>> for connection <WebRequest at 0x7f1adcc6bb50 method=POST uri=/dajaxice/operations.views.calculate_cost_view/ clientproto=HTTP/1.0> took too long to shut down and was killed. 2022-10-12 20:00:00,000 WARNING Application timed out while sending response
From the user’s POV, the page simply fails to load and they have to re-click a button or refresh the page.
Libraries what we use:
python = 3.7
Django = 2.2.12
channels = 3.0.5
channel-redis = 3.4.1
On server we use:
Nginx, supervisor, Daphne
.
For all requests (HTTP and websockets) we use ASGI
.
Our command for running daphne:
daphne -t 300 project.asgi:application
What we already tried to do:
Daphne
(as you can see above)channels
library from 3.0.4. to 3.0.5 (because we found info that asgiref
3.3.1, that used in channels
3.0.4, could be the culprit for this issue: https://lightrun.com/answers/django-channels-warning---server---application-instance-took-too-long-to-shut-down-and-was-killed)Any idea what this is caused by or how to troubleshoot it?
I had a similar issue before with almost the same tech stacks and it took several days for us to fix.
At that time the cause was that the database server was out of resource. We used AWS RDS (MySQL) and the CPU usage was over 99% whenever we got the error.
Using AWS CloudWatch, you can check the CPU Utilization history. (While there are many other values to watch but CPU Utilization Rate was the only problematic one)
After upgrading the DB instance type, the problems were gone right away.
Read more here about AWS CloudWatch for RDS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With