Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tuning Postgresql performance and memory use in a python workflow

I use Postgresql 9.4 for a model database. My table looks somewhat like this:

CREATE TABLE table1 (
sid INTEGER PRIMARY KEY NOT NULL DEFAULT nextval('table1_sid_seq'::regclass),
col1 INT, 
col2 INT,
col3 JSONB);

My Python 2.7 workflow often looks like this:

curs.execute("SELECT sid, col1, col2 FROM table1")
data = curs.fetchall()
putback = []
for i in data: 
    result = do_something(i[1], i[2])
    putback.append((sid, result))
del data
curs.execute("UPDATE table1
              SET col3 = p.result
              FROM unnest(%s) p(sid INT, result JSONB)
              WHERE sid = p.sid", (putback,))

This typically works quite well and efficiently. However, for large queries Postgresql memory use will sometimes go through the roof (>50GB) during the UPDATE command and I believe it is being killed by OS X, because I get the WARNING: terminating connection because of crash of another server process. My Macbook Pro has 16GB of RAM and the query in question has 11M lines with each about 100 charactes of data to write back.

My postgresql.conf:

default_statistics_target = 50
maintenance_work_mem = 512MB 
constraint_exclusion = on 
checkpoint_completion_target = 0.9
effective_cache_size = 4GB 
work_mem = 256MB 
wal_buffers = 16MB 
checkpoint_segments = 128 
shared_buffers = 1024MB 
max_connections = 80

So I wonder

  1. Why is my query consuming sometimes excessive amounts of RAM?
  2. How can I control memory use and still guarantee good performance?
  3. Is there a good guideline or tool for tuning Postgresql?

Update:
I am pretty sure that @wildplasser pinpointed my problem. In the comments he suggests to dump the data into the database first, and unpack it from there. Unfortunately I could not figure out how to implement his proposal. If anyone has an idea how to do that, their answer will be gladly accepted.

like image 742
n1000 Avatar asked Dec 25 '15 15:12

n1000


1 Answers

My workaround is to slice putback with a simple function as proposed here:

def chunk(l, n):
    n = max(1, n)
    return [l[i:i + n] for i in range(0, len(l), n)]

and then

for chunk in chunk(putback, 250000):
    curs.execute("UPDATE table1
                  SET col3 = p.result
                  FROM unnest(%s) p(sid INT, result JSONB)
                  WHERE sid = p.sid", (chunk,))

This works, i.e. keeps the memory footprint in check, but is not very elegant and slower than dumping all data at once, as I usually do.

like image 65
n1000 Avatar answered Sep 21 '22 22:09

n1000