Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.

My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:

[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL

I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.

At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.

like image 261
Kei Avatar asked Oct 21 '25 16:10

Kei


1 Answers

I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs

  1. celery.sync_parallelism = 1
  2. celery.worker_autoscale = 1,1

This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

like image 87
Kei Avatar answered Oct 23 '25 06:10

Kei