I have a task in airflow which downloads a file from GitHub to the local file system. passes it to spark-submit and then deletes it. I wanted to know if this will create any issues.
Can this be possible that both the workers that are running the same task concurrently on two different dag runs are referencing the same file?
Sample code -->
def python_task_callback():
download_file(file_name='script.py')
spark_submit(path='/temp/script.py')
delete_file(path='/temp/script.py')
For your use case if you do all of the actions you mentioned (download, parse, delete) in a single task then you will have no problems regardless of which executor you are running.
If you are splitting the actions between several tasks then you should use a shared file system like S3, Google Storage etc. In that case it will also work regardless of which executor youa re using. A possible workflow can be:
1st task: copy file from github to S3
2nd task: submit the file to processing
3rd task: delete the file from S3
As for your general question if tasks share disk - that depends on the executor that you are using.
In Local Executor you have only 1 worker thus all tasks run on the same machine and share it's disk.
In Celery Executor/ Kubernetes Executor/others tasks may run on different workers.
However as mentioned - don't assume that tasks share disk, if you will need to scale up the executor from Local to Celery you don't want to find yourself in a case where you need to refactor your code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With