I am very puzzled as to the behavior of some multiprocessing code that is using psycopg2 to make queries in parallel to a postgres db.
Essentially, I am making the same query (with different params) to various partitions of a larger table. I am using multiprocessing.Pool to fork off a separate query.
My multiprocessing call looks like this:
pool = Pool(processes=num_procs)
results=pool.map(run_sql, params_list)
My run_sql code looks like this:
def run_sql(zip2):
conn = get_connection()
curs = conn.cursor()
print "conn: %s curs:%s pid=%s" % (id(conn), id(curs), os.getpid())
...
curs.execute(qry)
records = curs.fetchall()
def get_connection()
...
conn = psycopg2.connect(user=db_user, host=db_host,
dbname=db_name, password=db_pwd)
return conn
So my expectation is that each process would get a separate db connection via the call to get_connection()
and that print id(conn)
would display a distinct value. However, that doesn't seem to be the case and I am at a loss to explain it. Even print id(curs)
is the same. Only print os.getpid()
shows a difference. Does it somehow use the same connection for each forked process ?
conn: 4614554592 curs:4605160432 pid=46802
conn: 4614554592 curs:4605160432 pid=46808
conn: 4614554592 curs:4605160432 pid=46810
conn: 4614554592 curs:4605160432 pid=46784
conn: 4614554592 curs:4605160432 pid=46811
I think I've figured this out. The answer lies in the fact that multiprocessing in Python is shared-nothing so the entire memory space is copied, functions and all. Hence for each process, even though the pid is different, the memory spaces are copies of each other and the address of the connection within the memory space ends up being the same. The same reason is why declaring a global connection pool as I did initially was useless, each process ended up with its own connection pool with just 1 connection active at a time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With