Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

psycopg2 leaking memory after large query

I'm running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and connection, and even run gc, but the process still consumes a ton of memory (7.3gb to be exact). Am I missing a cleanup step?

import psycopg2 conn = psycopg2.connect("dbname='dbname' user='user' host='host'") cursor = conn.cursor() cursor.execute("""large query""") rows = cursor.fetchall() del rows cursor.close() conn.close() import gc gc.collect() 
like image 689
Adam Berlinsky-Schine Avatar asked Jun 19 '13 18:06

Adam Berlinsky-Schine


People also ask

Is psycopg2 connection thread safe?

Thread and process safety The Psycopg module and the connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors.

Is psycopg2 secure?

Psycopg 2 is mostly implemented in C as a libpq wrapper, resulting in being both efficient and secure. It features client-side and server-side cursors, asynchronous communication and notifications, "COPY TO/COPY FROM" support.

Should I use psycopg2-binary?

The psycopg2-binary package is meant for beginners to start playing with Python and PostgreSQL without the need to meet the build requirements. If you are the maintainer of a published package depending on psycopg2 you shouldn't use psycopg2-binary as a module dependency.

How do I know if psycopg2 is alive?

In order to make sure a connection is still valid, read the property connection. isolation_level . This will raise an OperationalError with pgcode == "57P01" in case the connection is dead. This adds a bit of latency for a roundtrip to the database but should be preferable to a SELECT 1 or similar.


2 Answers

I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.

Instead of

cursor = conn.cursor() 

write

cursor = conn.cursor(name="my_cursor_name") 

or simpler yet

cursor = conn.cursor("my_cursor_name") 

The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors

I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include "DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.

The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

like image 166
joeblog Avatar answered Oct 05 '22 00:10

joeblog


Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.

import datetime as dt import psycopg2 import sys import time  conPG = psycopg2.connect("dbname='myDearDB'") curPG = conPG.cursor('testCursor') curPG.itersize = 100000 # Rows fetched at one time from the server  curPG.execute("SELECT * FROM myBigTable LIMIT 10000000") # Warning: curPG.rowcount == -1 ALWAYS !! cptLigne = 0 for rec in curPG:    cptLigne += 1    if cptLigne % 10000 == 0:       print('.', end='')       sys.stdout.flush() # To see the progression conPG.commit() # Also close the cursor conPG.close() 

As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

like image 30
Le Droid Avatar answered Oct 05 '22 00:10

Le Droid