Airflow s3 connection using UI

Tags:

airflow

I've been trying to use Airflow to schedule a DAG. One of the DAG includes a task which loads data from s3 bucket.

For the purpose above I need to setup s3 connection. But UI provided by airflow isn't that intutive (http://pythonhosted.org/airflow/configuration.html?highlight=connection#connections). Any one succeeded setting up the s3 connection if so are there any best practices you folks follow?

Thanks.

599

asked Oct 07 '22 14:10

Nikhil Reddy

3 Answers

EDIT: This answer stores your secret key in plain text which can be a security risk and is not recommended. The best way is to put access key and secret key in the login/password fields, as mentioned in other answers below. END EDIT

It's hard to find references, but after digging a bit I was able to make it work.

TLDR

Create a new connection with the following attributes:

Conn Id: my_conn_S3

Conn Type: S3

Extra:

{"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}

Long version, setting up UI connection:

On Airflow UI, go to Admin > Connections
Create a new connection with the following attributes:
Conn Id: my_conn_S3
Conn Type: S3
Extra: {"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}
Leave all the other fields (Host, Schema, Login) blank.

To use this connection, below you can find a simple S3 Sensor Test. The idea of this test is to set up a sensor that watches files in S3 (T1 task) and once below condition is satisfied it triggers a bash command (T2 task).

Testing

Before running the DAG, ensure you've an S3 bucket named 'S3-Bucket-To-Watch'.
Add below s3_dag_test.py to airflow dags folder (~/airflow/dags)
Start airflow webserver.
Go to Airflow UI (http://localhost:8383/)
Start airflow scheduler.
Turn on 's3_dag_test' DAG on the main DAGs view.
Select 's3_dag_test' to show the dag details.
On the Graph View you should be able to see it's current state.
'check_s3_for_file_in_s3' task should be active and running.
Now, add a file named 'file-to-watch-1' to your 'S3-Bucket-To-Watch'.
First tasks should have been completed, second should be started and finish.

The schedule_interval in the dag definition is set to '@once', to facilitate debugging.

To run it again, leave everything as it's, remove files in the bucket and try again by selecting the first task (in the graph view) and selecting 'Clear' all 'Past','Future','Upstream','Downstream' .... activity. This should kick off the DAG again.

Let me know how it went.

s3_dag_test.py ;

"""
S3 Sensor Connection Test
"""

from airflow import DAG
from airflow.operators import SimpleHttpOperator, HttpSensor,   BashOperator, EmailOperator, S3KeySensor
from datetime import datetime, timedelta

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime(2016, 11, 1),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG('s3_dag_test', default_args=default_args, schedule_interval= '@once')

t1 = BashOperator(
    task_id='bash_test',
    bash_command='echo "hello, it should work" > s3_conn_test.txt',
    dag=dag)

sensor = S3KeySensor(
    task_id='check_s3_for_file_in_s3',
    bucket_key='file-to-watch-*',
    wildcard_match=True,
    bucket_name='S3-Bucket-To-Watch',
    s3_conn_id='my_conn_S3',
    timeout=18*60*60,
    poke_interval=120,
    dag=dag)

t1.set_upstream(sensor)

Main References:

https://gitter.im/apache/incubator-airflow
https://groups.google.com/forum/#!topic/airbnb_airflow/TXsJNOBBfig
https://github.com/apache/incubator-airflow

178

answered Dec 18 '22 14:12

Anselmo

Assuming airflow is hosted on an EC2 server.

just create the connection as per other answers but leave everything blank in the configuration apart from connection type which should stay as S3

The S3hook will default to boto and this will default to the role of the EC2 server you are running airflow on. assuming this role has rights to S3 your task will be able to access the bucket.

this is a much safer way than using and storing credentials.

answered Dec 18 '22 14:12

ninjasmith

If you are worried about exposing the credentials in the UI, another way is to pass credential file location in the Extra param in UI. Only the functional user has read privileges to the file. It looks something like below

Extra:  {
    "profile": "<profile_name>", 
    "s3_config_file": "/home/<functional_user>/creds/s3_credentials", 
    "s3_config_format": "aws" }

file "/home/<functional_user>/creds/s3_credentials" has below entries

[<profile_name>]
aws_access_key_id = <access_key_id>
aws_secret_access_key = <secret_key>

answered Dec 18 '22 14:12

sagarsar

Related questions
                            
                                Python webbrowser.open() to open Chrome browser
                            
                                ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)
                            
                                Python Binomial Coefficient
                            
                                What is the proper way to work with shared modules in Python development?
                            
                                ValueError: attempted relative import beyond top-level package
                            
                                Is there a formatted byte string literal in Python 3.6+?
                            
                                Why is a `for` loop so much faster to count True values?
                            
                                What is the reason for difference between integer division and float to int conversion in python?
                            
                                Invalid Token when using Octal numbers
                            
                                Why does `None is None is None` return True? [duplicate]
                            
                                Why isn't .ico file defined when setting window's icon?
                            
                                how to add a coroutine to a running asyncio loop?
                            
                                Sum of all counts in a collections.Counter
                            
                                Switch between python 2.7 and python 3.5 on Mac OS X
                            
                                Cannot install numpy from wheel format
                            
                                Changing the text on a label
                            
                                Get total number of hours from a Pandas Timedelta?
                            
                                Conditionally fill column values based on another columns value in pandas
                            
                                Should I Return None or (None, None)?
                            
                                What is the oldest time that can be represented in Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Airflow s3 connection using UI

Tags:

python-3.x

airflow

Nikhil Reddy

People also ask

3 Answers

TLDR

Long version, setting up UI connection:

Testing

s3_dag_test.py ;

Anselmo

ninjasmith

sagarsar

Recent Activity

Donate For Us