Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Psycopg2 uses up memory on large select query

I am using psycopg2 to query a Postgresql database and trying to process all rows from a table with about 380M rows. There are only 3 columns (id1, id2, count) all of type integer. However, when I run the straightforward select query below, the Python process starts consuming more and more memory, until it gets killed by the OS.

Minimal working example (assuming that mydatabase exists and contains a table called mytable):

import psycopg2
conn = psycopg2.connect("dbname=mydatabase")
cur = conn.cursor()
cur.execute("SELECT * FROM mytable;")

At this point the program starts consuming memory.

I had a look and the Postgresql process is behaving well. It is using a fair bit of CPU, which is fine, and a very limited amount of memory.

I was expecting psycopg2 to return an iterator without trying to buffer all of the results from the select. I could then use cur.fetchone() repeatedly to process all rows.

So, how do I select from a 380M row table without using up available memory?

like image 625
Carl Avatar asked Feb 05 '15 11:02

Carl


People also ask

Is psycopg2 faster than SQLAlchemy?

The psycopg2 is over 2x faster than SQLAlchemy on small table. This behavior is expected as psycopg2 is a database driver for postgresql while SQLAlchemy is general ORM library.

Is psycopg2 connection thread safe?

Thread and process safety The Psycopg module and the connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors.

Should I use psycopg2-binary?

The psycopg2-binary package is meant for beginners to start playing with Python and PostgreSQL without the need to meet the build requirements. If you are the maintainer of a published package depending on psycopg2 you shouldn't use psycopg2-binary as a module dependency.

Is psycopg2 the same as psycopg2-binary?

psycopg2-binary and psycopg2 both give us the same code that we interact with. The difference between the two is in how that code is installed in our computer.


2 Answers

You can use server side cursors.

cur = conn.cursor('cursor-name') # server side cursor
cur.itersize = 10000 # how much records to buffer on a client
cur.execute("SELECT * FROM mytable;")
like image 149
bav Avatar answered Oct 20 '22 07:10

bav


Another way to use server side cursors:

with psycopg2.connect(database_connection_string) as conn:
    with conn.cursor(name='name_of_cursor') as cursor:

        cursor.itersize = 20000

        query = "SELECT * FROM ..."
        cursor.execute(query)

        for row in cursor:
            # process row 

Psycopg2 will fetch itersize rows to the client at a time. Once the for loop exhausts that batch, it will fetch the next one.

like image 20
Demitri Avatar answered Oct 20 '22 08:10

Demitri