I'm pretty new to Airflow. I've read through the documentation several times, torn through numerous S/O questions and many random articles online but have yet to fix this issue. I have a feeling its something super simple I'm doing wrong.
I have Docker for Windows and I pulled the puckel/docker-airflow
image and ran a container with ports exposed so I can hit the UI from my host. I have another container running mcr.microsoft.com/mssql/server
on which I restored the WideWorldImporters sample db. From the Airflow UI, I have been able to successfully create the connection to this db and can even query it from the Data Profiling section. Check images below:
Connection Creation
Successful Query to Connection
So while this works, my dag fails at the 2nd task sqlData
. here is the code:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mssql_operator import MsSqlOperator
from datetime import timedelta, datetime
copyData = DAG(
dag_id='copyData',
schedule_interval='@once',
start_date=datetime(2019,1,1)
)
printHelloBash = BashOperator(
task_id = "print_hello_Bash",
bash_command = 'echo "Lets copy some data"',
dag = copyData
)
mssqlConnection = "WWI"
sqlData = MsSqlOperator(sql="select top 100 InvoiceDate, TotalDryItems from sales.invoices",
task_id="select_some_data",
mssql_conn_id=mssqlConnection,
database="WideWorldImporters",
dag = copyData,
depends_on_past=True
)
queryDataSuccess = BashOperator(
task_id = "confirm_data_queried",
bash_command = 'echo "We queried data!"',
dag = copyData
)
printHelloBash >> sqlData >> queryDataSuccess
Initially the error was:
*[2019-02-22 16:13:09,176] {{logging_mixin.py:95}} INFO - [2019-02-22 16:13:09,176] {{base_hook.py:83}} INFO - Using connection to: 172.17.0.3
[2019-02-22 16:13:09,186] {{models.py:1760}} ERROR - Could not create Fernet object: Incorrect padding
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 171, in get_fernet
_fernet = Fernet(fernet_key.encode('utf-8'))
File "/usr/local/lib/python3.6/site-packages/cryptography/fernet.py", line 34, in __init__
key = base64.urlsafe_b64decode(key)
File "/usr/local/lib/python3.6/base64.py", line 133, in urlsafe_b64decode
return b64decode(s)
File "/usr/local/lib/python3.6/base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding*
I noticed that this has to do with cryptography, and I went ahead and ran pip install cryptography
and pip install airflow[crytpo]
, where both returned the exact same results informing me that the requirement has already been satisfied. Finally, I found something that said I just need to generate a fernet_key. The default key in my airflow.cfg file was fernet_key = $FERNET_KEY
. So from the cli in the container I ran:
python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"
And got a code that I replaced $FERNET_KEY
with. I restarted the container and re-ran the dag and now my error is:
[2019-02-22 16:22:13,641] {{models.py:1760}} ERROR -
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/cryptography/fernet.py", line 106, in _verify_signature
h.verify(data[-32:])
File "/usr/local/lib/python3.6/site-packages/cryptography/hazmat/primitives/hmac.py", line 69, in verify
ctx.verify(signature)
File "/usr/local/lib/python3.6/site-packages/cryptography/hazmat/backends/openssl/hmac.py", line 73, in verify
raise InvalidSignature("Signature did not match digest.")
cryptography.exceptions.InvalidSignature: Signature did not match digest.
Which from an initial crypto doc scan has something to do with compatibility?
I'm at a lost now and decided that I'd ask this question to see if I'm potentially going down the wrong path in resolving this. Any help would be greatly appreciated as Airflow seems awesome.
Thanks to some side communication from @Tomasz I finally got my DAG to work. He recommended I try using docker-compose which is also listed in the puckel/docker-airflow github repo. I ended up using the docker-compose-LocalExecutor.yml file instead of the Celery Executor though. There was some small troubleshooting and more configuration I had to go through as well. To begin, I took my existing MSSQL container that had the sample db in it and turned it into an image using docker commit mssql_container_name
. Only reason I did this is to save time having to restore the backup sample dbs; you could always copy the backups into the container and restore them later if you want. Then I added my new image to the existing docker-compose-LocalExecutor.yml file like so:
version: '2.1'
services:
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
mssql:
image: dw:latest
ports:
- "1433:1433"
webserver:
image: puckel/docker-airflow:1.10.2
restart: always
depends_on:
- postgres
- mssql
environment:
- LOAD_EX=n
- EXECUTOR=Local
#volumes:
#- ./dags:/usr/local/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8080:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
Mind you, dw is what I named the new image that was based off of the mssql container. Next, I renamed the file to just docker-compose.yml so that I could easily run docker-compose up
(not sure if there is a command to point directly to a different YAML file). Once everything was up and running, I navigated to the Airflow UI and configured my connection. Note: since you are using docker-compose you don't need to know the IP address of the other containers since they use DNS service discovery which I found out about here. Then to test the connection I went to Data Profiling to do an ad-hoc query, but the connection wasn't there. This is because the puckel/docker-airflow image doesn't have pymssql installed. So just bash into the container docker exec -it airflow_webserver_container bash
and install it pip install pymssql --user
. Exit the container and restart all services using docker-compose restart
. After a minute everything was up and running. My connection showed up in Ad hoc Query and I could successfully select data. Finally, I turned my DAG on, the scheduler picked it up and everything was successful! Super relieved after spending weeks of googling. Thanks to @y2k-shubham for helping out and some super huge appreciation to @Tomasz who I actually reached out to initially after his awesome and thorough post about Airflow on the r/datascience subreddit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With