Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do we need airflow hooks?

Doc says:

Hooks are interfaces to external platforms and databases like Hive, S3, MySQL, Postgres, HDFS, and Pig. Hooks implement a common interface when possible, and act as a building block for operators. Ref

But why do we need them?

I want to select data from one Postgres DB, and store to another one. Can I use, for example, psycopg2 driver inside python script, which runs by a python operator, or airflow should know for some reason what exactly I'm doing inside script, so, I need to use PostgresHook instead of just psycopg2 driver?

like image 754
user1443993 Avatar asked Jun 22 '20 11:06

user1443993


People also ask

What is a connection hook?

A connection hook wraps the original connection service method, and replaces it during runtime (dynamically). The original functionality of the service remains intact, but the hook enables uses to add code before/after the service method. all connection hooks should inherit the base connections.hooks.ServiceHook class.

Why do we use Airflow?

Airflow enables you to manage your data pipelines by authoring workflows as Directed Acyclic Graphs (DAGs) of tasks. There's no concept of data input or output – just flow. You manage task scheduling as code, and can visualize your data pipelines' dependencies, progress, logs, code, trigger tasks, and success status.

What are Airflow sensors?

Air flow sensors measure either the volume or the mass of air flowing in a channel. In an automobile, the primary use of an air flow sensor is to determine the amount of air drawn into the engine through the intake manifold.

What is hooks in Airflow?

Hooks are one of the fundamental building blocks of Airflow. At a high level, a hook is an abstraction of a specific API that allows Airflow to interact with an external system. Hooks are built into many operators, but they can also be used directly in DAG code.


1 Answers

You should use just PostresHook. Instead of using psycopg2 as so:

conn = f'{pass}:{server}@host etc}'
cur = conn.cursor()
cur.execute(query)
data = cur.fetchall()

You can just type:

postgres = PostgresHook('connection_id')
data = postgres.get_pandas_df(query)

Which can also make use of encryption of connections.

So using hooks is cleaner, safer and easier.

like image 83
Javier Lopez Tomas Avatar answered Oct 18 '22 01:10

Javier Lopez Tomas