I have a Python 3 program that updates a large list of rows based on their ids (in a table in a Postgres 9.5 database).
I use multiprocessing to speed up the process. As Psycopg's connections can’t be shared across processes, I create a connection for each row, then close it.
Overall, multiprocessing is faster than single processing (5 times faster with 8 CPUs). However, creating a connection is slow: I'd like to create just a few connections and keep them open as long as required.
Since .map() chops ids_list into a number of chunks which it submits to the process pool, would it be possible to share a database connection for all ids in the same chunk/process?
Sample code:
from multiprocessing import Pool
import psycopg2
def create_db_connection():
conn = psycopg2.connect(database=database,
user=user,
password=password,
host=host)
return conn
def my_function(item_id):
conn = create_db_connection()
# Other CPU-intensive operations are done here
cur = conn.cursor()
cur.execute("""
UPDATE table
SET
my_column = 1
WHERE id = %s;
""",
(item_id, ))
cur.close()
conn.commit()
if __name__ == '__main__':
ids_list = [] # Long list of ids
pool = Pool() # os.cpu_count() processes
pool.map(my_function, ids_list)
Thanks for any help you can provide.
What is database connection pooling? Database connection pooling is a way to reduce the cost of opening and closing connections by maintaining a “pool” of open connections that can be passed from database operation to database operation as needed.
When not processing a transaction, the connection sits idle. Connection pooling enables the idle connection to be used by some other thread to do useful work. In practice, when a thread needs to do work against a MySQL or other database with JDBC, it requests a connection from the pool.
Default pool size is often 5 up to 10. First make sure you have a problem with the database. Try extremes like 2 or 30 under artificial load and see how it behaves.
You can use the initializer parameter of the Pool constructor. Setup the DB connection in the initializer function. Maybe pass the connection credentials as parameters.
Have a look at the docs: https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With