Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor. The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:

[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',) Traceback (most recent call last):   File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost     human_status(exitcode)), WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL). 

All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).

I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?

These are my celery settings:

CELERY_TIMEZONE = 'UTC' CELERY_TASK_SERIALIZER = 'json' CELERY_ACCEPT_CONTENT = ['json'] BROKER_URL = os.environ['RABBITMQ_URL'] CELERY_IGNORE_RESULT = True CELERY_DISABLE_RATE_LIMITS = False CELERYD_HIJACK_ROOT_LOGGER = False 

And this is my supervisord.conf:

[program:celery_worker] environment=$env_variables directory=/opt/python/current/app command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid startsecs=10 stopwaitsecs=60 stopasgroup=true killasgroup=true autostart=true autorestart=true stdout_logfile=/opt/python/log/celery_worker.stdout.log stdout_logfile_maxbytes=5MB stdout_logfile_backups=10 stderr_logfile=/opt/python/log/celery_worker.stderr.log stderr_logfile_maxbytes=5MB stderr_logfile_backups=10 numprocs=1  [program:celery_beat] environment=$env_variables directory=/opt/python/current/app command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule startsecs=10 stopwaitsecs=300 stopasgroup=true killasgroup=true autostart=false autorestart=true stdout_logfile=/opt/python/log/celery_beat.stdout.log stdout_logfile_maxbytes=5MB stdout_logfile_backups=10 stderr_logfile=/opt/python/log/celery_beat.stderr.log stderr_logfile_maxbytes=5MB stderr_logfile_backups=10 numprocs=1 

Edit 1

After restarting celery beat the problem remains.

Edit 2

Changed killasgroup=true to killasgroup=false and the problem remains.

like image 514
daveoncode Avatar asked Apr 02 '14 08:04

daveoncode


Video Answer


2 Answers

The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.

Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.

grep oom /var/log/messages. If you see messages, that's your problem.

If you don't find anything, try running the periodic process manually in a shell:

MyPeriodicTask().run()

And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.

like image 113
Nino Walker Avatar answered Sep 20 '22 13:09

Nino Walker


This kind of error raises when your asynchronous task (through celery), or the script you are using is storing a lot of data (in the memory). It causes memory leak.

In my case, I was getting data from other system and saving it on a variable, so that I can export all data (into Django model / Excel file) after finishing the whole process.

Here is the catch. My script was gathering 10 Million data, when I was gathering data into my python's variable it was draining memory. Which raised the error.

To overcome the issue, I divided 10 Million data into 20 parts (half million on each part). I checked, when the length of data is half million I stored the data into the my own preferred local file / Django model. then do this for next half million and so on.

No need to do the exact number of partitions. It is an idea of solving complex problem by splitting into multiple subproblem and solve the subproblems one by one. :D

like image 29
Farid Chowdhury Avatar answered Sep 18 '22 13:09

Farid Chowdhury