Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow Fernet_Key issue when trying to query a mssql db

I'm pretty new to Airflow. I've read through the documentation several times, torn through numerous S/O questions and many random articles online but have yet to fix this issue. I have a feeling its something super simple I'm doing wrong. I have Docker for Windows and I pulled the puckel/docker-airflow image and ran a container with ports exposed so I can hit the UI from my host. I have another container running mcr.microsoft.com/mssql/server on which I restored the WideWorldImporters sample db. From the Airflow UI, I have been able to successfully create the connection to this db and can even query it from the Data Profiling section. Check images below: Connection Creation Successful Query to Connection

So while this works, my dag fails at the 2nd task sqlData. here is the code:

from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mssql_operator import MsSqlOperator
from datetime import timedelta, datetime

copyData = DAG(
    dag_id='copyData',
    schedule_interval='@once',
    start_date=datetime(2019,1,1)
)


printHelloBash = BashOperator(
    task_id = "print_hello_Bash",
    bash_command = 'echo "Lets copy some data"',
    dag = copyData
)

mssqlConnection = "WWI"
sqlData = MsSqlOperator(sql="select top 100 InvoiceDate, TotalDryItems from sales.invoices",
                       task_id="select_some_data",
                       mssql_conn_id=mssqlConnection,
                       database="WideWorldImporters",
                       dag = copyData,
                       depends_on_past=True
          )

queryDataSuccess = BashOperator(
    task_id = "confirm_data_queried",
    bash_command = 'echo "We queried data!"',
    dag = copyData
)

printHelloBash >> sqlData >> queryDataSuccess

Initially the error was:

*[2019-02-22 16:13:09,176] {{logging_mixin.py:95}} INFO - [2019-02-22 16:13:09,176] {{base_hook.py:83}} INFO - Using connection to: 172.17.0.3  
[2019-02-22 16:13:09,186] {{models.py:1760}} ERROR - Could not create Fernet object: Incorrect padding  
Traceback (most recent call last):  
  File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 171, in get_fernet
    _fernet = Fernet(fernet_key.encode('utf-8'))  
  File "/usr/local/lib/python3.6/site-packages/cryptography/fernet.py", line 34, in __init__
    key = base64.urlsafe_b64decode(key)  
  File "/usr/local/lib/python3.6/base64.py", line 133, in urlsafe_b64decode
    return b64decode(s)  
  File "/usr/local/lib/python3.6/base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding*

I noticed that this has to do with cryptography, and I went ahead and ran pip install cryptography and pip install airflow[crytpo], where both returned the exact same results informing me that the requirement has already been satisfied. Finally, I found something that said I just need to generate a fernet_key. The default key in my airflow.cfg file was fernet_key = $FERNET_KEY. So from the cli in the container I ran:

python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

And got a code that I replaced $FERNET_KEY with. I restarted the container and re-ran the dag and now my error is:

[2019-02-22 16:22:13,641] {{models.py:1760}} ERROR -   
Traceback (most recent call last):  
  File "/usr/local/lib/python3.6/site-packages/cryptography/fernet.py", line 106, in _verify_signature
    h.verify(data[-32:])  
  File "/usr/local/lib/python3.6/site-packages/cryptography/hazmat/primitives/hmac.py", line 69, in verify
    ctx.verify(signature)  
  File "/usr/local/lib/python3.6/site-packages/cryptography/hazmat/backends/openssl/hmac.py", line 73, in verify
    raise InvalidSignature("Signature did not match digest.")  
cryptography.exceptions.InvalidSignature: Signature did not match digest.

Which from an initial crypto doc scan has something to do with compatibility?

I'm at a lost now and decided that I'd ask this question to see if I'm potentially going down the wrong path in resolving this. Any help would be greatly appreciated as Airflow seems awesome.

like image 563
gclarkjr5 Avatar asked Oct 16 '22 06:10

gclarkjr5


1 Answers

Thanks to some side communication from @Tomasz I finally got my DAG to work. He recommended I try using docker-compose which is also listed in the puckel/docker-airflow github repo. I ended up using the docker-compose-LocalExecutor.yml file instead of the Celery Executor though. There was some small troubleshooting and more configuration I had to go through as well. To begin, I took my existing MSSQL container that had the sample db in it and turned it into an image using docker commit mssql_container_name. Only reason I did this is to save time having to restore the backup sample dbs; you could always copy the backups into the container and restore them later if you want. Then I added my new image to the existing docker-compose-LocalExecutor.yml file like so:

version: '2.1'
services:
    postgres:
        image: postgres:9.6
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=airflow
            - POSTGRES_DB=airflow

    mssql:
        image: dw:latest
        ports:
            - "1433:1433"

    webserver:
        image: puckel/docker-airflow:1.10.2
        restart: always
        depends_on:
            - postgres
            - mssql
        environment:
            - LOAD_EX=n
            - EXECUTOR=Local
        #volumes:
            #- ./dags:/usr/local/airflow/dags
            # Uncomment to include custom plugins
            # - ./plugins:/usr/local/airflow/plugins
        ports:
            - "8080:8080"
        command: webserver
        healthcheck:
            test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
            interval: 30s
            timeout: 30s
            retries: 3

Mind you, dw is what I named the new image that was based off of the mssql container. Next, I renamed the file to just docker-compose.yml so that I could easily run docker-compose up (not sure if there is a command to point directly to a different YAML file). Once everything was up and running, I navigated to the Airflow UI and configured my connection. Note: since you are using docker-compose you don't need to know the IP address of the other containers since they use DNS service discovery which I found out about here. Then to test the connection I went to Data Profiling to do an ad-hoc query, but the connection wasn't there. This is because the puckel/docker-airflow image doesn't have pymssql installed. So just bash into the container docker exec -it airflow_webserver_container bash and install it pip install pymssql --user. Exit the container and restart all services using docker-compose restart. After a minute everything was up and running. My connection showed up in Ad hoc Query and I could successfully select data. Finally, I turned my DAG on, the scheduler picked it up and everything was successful! Super relieved after spending weeks of googling. Thanks to @y2k-shubham for helping out and some super huge appreciation to @Tomasz who I actually reached out to initially after his awesome and thorough post about Airflow on the r/datascience subreddit.

like image 180
gclarkjr5 Avatar answered Oct 19 '22 10:10

gclarkjr5