Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow psycopg2.OperationalError: FATAL: sorry, too many clients already

I have a four node clustered Airflow environment that's been working fine for me for a few months now.

ec2-instances

  • Server 1: Webserver, Scheduler, Redis Queue, PostgreSQL Database
  • Server 2: Webserver
  • Server 3: Worker
  • Server 4: Worker

Recently I've been working on a more complex DAG that has a few dozen tasks in it compared to my relatively small ones I was working on beforehand. I'm not sure if that's why I'm just now seeing this error pop up or what but I'll sporadically get this error:

On the Airflow UI under the logs for the task:

psycopg2.OperationalError: FATAL: sorry, too many clients already

And on the Webserver (output from running airflow webserver) I get the same error too:

[2018-07-23 17:43:46 -0400] [8116] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 2158, in _wrap_pool_connect
    return fn()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 403, in connect
    return _ConnectionFairy._checkout(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 788, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 532, in checkout
    rec = pool._do_get()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1193, in _do_get
    self._dec_overflow()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 66, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/util/compat.py", line 187, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 1190, in _do_get
    return self._create_connection()
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 350, in _create_connection
    return _ConnectionRecord(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 477, in __init__
    self.__connect(first_connect_check=True)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/pool.py", line 671, in __connect
    connection = pool._invoke_creator(self)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/strategies.py", line 106, in connect
    return dialect.connect(*cargs, **cparams)
  File "/usr/local/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 410, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 130, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL:  sorry, too many clients already

I can fix this by running sudo /etc/init.d/postgresql restart and restarting the DAG but then after about three runs I'll start seeing the error again.

I can't find any specifics on this issue in regards to Airflow but from other posts I've found such as this one they're saying it's because my client (I guess in this case that's Airflow) is trying to open up more connections to PostgreSQL than what PostgreSQL is configured to handle. I ran this command to find that my PostgreSQL can accept 100 connections:

[ec2-user@ip-1-2-3-4 ~]$ sudo su
root@ip-1-2-3-4
[/home/ec2-user]# psql -U postgres
psql (9.2.24)
Type "help" for help.

postgres=# show max_connections;
 max_connections
-----------------
 100
(1 row)

In this solution the post says I can increase my PostgreSQL max connections but I'm wondering if I should instead set a value in my Airflow.cfg file so that I can match the Airflow allowed connections size to my PoastgreSQL max connections size. Does anyone know where I can set this value in Airflow? Here are the fields I think are relevant:

# The SqlAlchemy pool size is the maximum number of database connections
# in the pool.
sql_alchemy_pool_size = 5

# The SqlAlchemy pool recycle is the number of seconds a connection
# can be idle in the pool before it is invalidated. This config does
# not apply to sqlite.
sql_alchemy_pool_recycle = 3600

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 32

# When not using pools, tasks are run in the "default pool",
# whose size is guided by this config element
non_pooled_task_slot_count = 128

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 32

Open to any suggestions for fixing this issue. Is this something related to my Airflow configuration or is it an issue with my PostgreSQL configuration?

Also, because I'm testing a new DAG I'll sometimes terminate the running tasks and start them over. Perhaps doing this is causing some of the processes to not die correctly and they're keeping dead connections open to PostgreSQL?

like image 816
Kyle Bridenstine Avatar asked Jul 23 '18 21:07

Kyle Bridenstine


People also ask

Does airflow deal with connectivity issues?

Airflow should rely on the metadata database being available at all times and loosing connectivity in the middle of transaction should not be handled by Airflow. That adds terrible complexity to your code and IMHO is not needed to deal with this kind of (apparent) instabilities of connectivity.

What does Spring Boot say when running tests 7 airflow psycopg2?

Spring Boot "PSQLException: FATAL: sorry, too many clients already" when running tests 7 Airflow psycopg2.OperationalError: FATAL: sorry, too many clients already

Why are multiple workflows running at the same time?

Root cause: This issue occurs if there are multiple workflows running at the same time and if the threshold value for the connections that these workflows can make to Postgres DB is exceeded. The default value for the number of connections to Postgres DB is configured to 100 (max_connections=100).

Should the airflow in the Heartbeats be falling over?

We're working on weeding out if the spikes are causing issues, but IMHO Airflow should not be falling over in the heartbeats b/c of a first-time missed connection. There should be some intelligent retry logic in the heartbeats... Indeed, we do have some retries in few place, this might not be the one and needs improving.


1 Answers

Ran into similar issue. I changed max_connections in postgres to 10000 and sql_alchemy_pool_size in airflow config to 1000. Now I am able to run hundreds of tasks in parallel.

PS: My machine has 32 cores and 60GB memory. Hence, its taking the load.

like image 111
Zanylytical Scientist Avatar answered Sep 30 '22 07:09

Zanylytical Scientist