I am using Airflow to see if I can do the same work for my data ingestion, original ingestion is completed by two steps in shell:
In Airflow, I have two tasks with BashOperator:
task1 = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3",
dag=dag)
task2 = BashOperator(
task_id='kickoff_bm3',
bash_command="./bm3.py runjob -p client1 -j ingestion",
dag=dag)
task1 >> task2
The task1 completed as expected, log below:
[2019-03-01 16:50:17,638] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmpkla8w_xd/switch2ALhomeelbcfbxb
[2019-03-01 16:50:17,638] {bash_operator.py:110} INFO - Running command: cd /home/rxie/al2
the task2 failed for the reason shown in log:
[2019-03-01 16:51:19,896] {bash_operator.py:100} INFO - Temporary script location: /tmp/airflowtmp328cvywu/kickoff_al2710f17lm
[2019-03-01 16:51:19,896] {bash_operator.py:110} INFO - Running command: ./bm32.py runjob -p client1 -j ingestion
[2019-03-01 16:51:19,902] {bash_operator.py:119} INFO - Output:
[2019-03-01 16:51:19,903] {bash_operator.py:123} INFO - /tmp/airflowtmp328cvywu/kickoff_al2710f17lm: line 1: ./bm3.py: No such file or directory
So it seems every task is executed from a seemly unique temp folder, which failed the second task.
How can I run the bash command from specific location?
Any thought is highly appreciated if you can share here.
Thank you very much.
UPDATE: Thanks for the suggestion which almost works.
The bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion", works fine in the first place, however the runjob has multiple tasks in it, the first task works, and second task invoke impala-shell.py to run something, the impala-shell.py specifies python2 as its interpreter language while outside it, other parts are using python 3.
This is OK when I just run the bash_command in shell, but in Airflow, for unknown reason, despite I set the correct PATH and make sure in shell:
(base) (venv) [pchoix@hadoop02 ~]$ python
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
The task is still executed within python 3 and uses python 3, which is seen from the log:
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - File "/data/cloudera/parcels/CDH-5.12.0-1.cdh5.12.0.p0.29/bin/../lib/impala-shell/impala_shell.py", line 220
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - print '\tNo options available.'
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - ^
[2019-03-01 21:42:08,040] {bash_operator.py:123} INFO - SyntaxError: Missing parentheses in call to 'print'
Note this issue doesn't exist when I run the job in shell environment:
./bm3.py runjob -p client1 -j ingestion
How about:
task = BashOperator(
task_id='switch2BMhome',
bash_command="cd /home/pchoix/bm3 && ./bm3.py runjob -p client1 -j ingestion",
dag=dag)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With