My program is unable to create an SSH tunnel while inside of my docker container running apache airflow. Only running the function on my local machine works fine. I have a list of servers which I use to create a tunnel, query the database, and close the connection. Normally, I'd do it the following way:
for server in servers:
server_conn = sshtunnel.SSHTunnelForwarder(
server,
ssh_username=ssh_user,
ssh_password=ssh_password,
remote_bind_address=(localhost, db_port),
local_bind_address=(localhost, localport)
)
This works as expected and I can do whatever I need from there. However, within Docker, it does not work. I realize that docker runs and binds to a port and is not actually apart of the host system, so I used network_mode="host" to help mitigate this issue. However, this does not work because my containers lose the ability to communicate with one another. Here is my docker-compose file
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
- PGDATA=/var/lib/postgresql/data/pgdata
volumes:
- ~/.whale/pgdata:/var/lib/postgresql/data/pgdata
- ./dags/docker/sql/create.sql:/docker-entrypoint-initdb.d/init.sql
ports:
- "5432:5432"
webserver:
image: hawk
build:
context: .
dockerfile: ./dags/docker/Dockerfile-airflow
restart: always
depends_on:
- postgres
# - redis
environment:
- LOAD_EX=n
- FERNET_KEY=46BKJoQYlPPOexq0OhDZnIlNepKFf87WFwLbfzqDDho=
- EXECUTOR=Local
volumes:
- ./dags:/usr/local/airflow/dags
# Uncomment to include custom plugins
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8080:8080"
- "52023:22"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
I also followed the instructions here and got to the point where I can docker exec into my container and manually type the above python snippet and get a working connection.
Additionally, I have read the airflow documentation here which covers SSH connection operators, but those only support bash commands, I will need my python function to run. I am truly confused why the python code would work while exec-ed into the system, but not when I run it via my airflow DAG. At this time, I am unable to manually put all of the connections in because there will be > 100 once this system deploys. Any help would be greatly appreciated. If more depth is needed, please let me know.
I was having this same issue when opening the tunnel and trying to connect to the database in separate tasks, but got it working by doing both in the same task (Airflow doesn't persist state between task runs):
def select_from_tunnel_db():
# Open SSH tunnel
ssh_hook = SSHHook(ssh_conn_id='bastion-ssh-conn', keepalive_interval=60)
tunnel = ssh_hook.get_tunnel(5432, remote_host='<db_host>', local_port=5432)
tunnel.start()
# Connect to DB and run query
pg_hook = PostgresHook(
postgres_conn_id='remote-db-conn', # NOTE: host='localhost'
schema='db_schema'
)
pg_cursor = pg_hook.get_conn().cursor()
pg_cursor.execute('SELECT * FROM table;')
select_val = pg_cursor.fetchall()
return select_val
python_operator = PythonOperator(
task_id='test_tunnel_conn',
python_callable=select_from_tunnel_db,
dag=dag
)
This forwards traffic on port 5432 from the local machine to the same port on the remote db host. The SSHHook requires a working ssh connection to the endpoint you will be tunneling through and PostgresHook requires a postgres connection to 'localhost' on port 5432.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With