We have a rather specific application that uses PostgreSQL 8.3 as a storage backend (using Python and psycopg2). The operations we perform to the important tables are in the majority of cases inserts or updates (rarely deletes or selects). For sanity reasons we have created our own Data Mapper-like layer that works reasonably well, but it has one big bottleneck, the update performance. Of course, I'm not expecting the update/replace scenario to be as speedy as the 'insert to an empty table' one, but it would be nice to get a bit closer. Note that this system is free from concurrent updates We always set all the fields of each rows on an update, which can be seen in the terminology where I use the word 'replace' in my tests. I've so far tried two approaches to our update problem: <ol> <li> Create a <code>replace()</code> procedure that takes an array of rows to update: <pre class="prettyprint"><code>CREATE OR REPLACE FUNCTION replace_item(data item[]) RETURNS VOID AS $$ BEGIN FOR i IN COALESCE(array_lower(data,1),0) .. COALESCE(array_upper(data,1),-1) LOOP UPDATE item SET a0=data[i].a0,a1=data[i].a1,a2=data[i].a2 WHERE key=data[i].key; END LOOP; END; $$ LANGUAGE plpgsql </code></pre> </li> <li> Create an <code>insert_or_replace</code> rule so that everything but the occasional delete becomes multi-row inserts <pre class="prettyprint"><code>CREATE RULE "insert_or_replace" AS ON INSERT TO "item" WHERE EXISTS(SELECT 1 FROM item WHERE key=NEW.key) DO INSTEAD (UPDATE item SET a0=NEW.a0,a1=NEW.a1,a2=NEW.a2 WHERE key=NEW.key); </code></pre> </li> </ol> These both speeds up the updates a fair bit, although the latter slows down inserts a bit: <pre class="prettyprint"><code>Multi-row insert : 50000 items inserted in 1.32 seconds averaging 37807.84 items/s executemany() update : 50000 items updated in 26.67 seconds averaging 1874.57 items/s update_andres : 50000 items updated in 3.84 seconds averaging 13028.51 items/s update_merlin83 (i/d/i) : 50000 items updated in 1.29 seconds averaging 38780.46 items/s update_merlin83 (i/u) : 50000 items updated in 1.24 seconds averaging 40313.28 items/s replace_item() procedure : 50000 items replaced in 3.10 seconds averaging 16151.42 items/s Multi-row insert_or_replace: 50000 items inserted in 2.73 seconds averaging 18296.30 items/s Multi-row insert_or_replace: 50000 items replaced in 2.02 seconds averaging 24729.94 items/s </code></pre> Random notes about the test run: <ul> <li>All tests are run on the same computer as the database resides; connecting to localhost.</li> <li>Inserts and updates are applied to the database in batches of of 500 items, each sent in its own transaction (UPDATED).</li> <li>All update/replace tests used the same values as were already in the database.</li> <li>All data was escaped using the psycopg2 adapt() function.</li> <li>All tables are truncated and vacuumed before use (ADDED, in previous runs only truncation happened)</li> <li> The table looks like this: <pre class="prettyprint"><code>CREATE TABLE item ( key MACADDR PRIMARY KEY, a0 VARCHAR, a1 VARCHAR, a2 VARCHAR ) </code></pre> </li> </ul> So, the real question is: How can I speed up update/replace operations a bit more? (I think these findings might be 'good enough', but I don't want to give up without tapping the SO crowd :) Also anyones hints towards a more elegant replace_item(), or evidence that my tests are completely broken would be most welcome. The test script is available here if you'd like to attempt to reproduce. Remember to check it first though...it WorksForMe, but... You will need to edit the db.connect() line to suit your setup. EDIT Thanks to andres in #postgresql @ freenode I have another test with a single-query update; much like a multi-row insert (listed as update_andres above). <pre class="prettyprint"><code>UPDATE item SET a0=i.a0, a1=i.a1, a2=i.a2 FROM (VALUES ('00:00:00:00:00:01', 'v0', 'v1', 'v2'), ('00:00:00:00:00:02', 'v3', 'v4', 'v5'), ... ) AS i(key, a0, a1, a2) WHERE item.key=i.key::macaddr </code></pre> EDIT Thanks to merlin83 in #postgresql @ freenode and jug/jwp below I have another test with an insert-to-temp/delete/insert approach (listed as "update_merlin83 (i/d/i)" above). <pre class="prettyprint"><code>INSERT INTO temp_item (key, a0, a1, a2) VALUES ( ('00:00:00:00:00:01', 'v0', 'v1', 'v2'), ('00:00:00:00:00:02', 'v3', 'v4', 'v5'), ...); DELETE FROM item USING temp_item WHERE item.key=temp_item.key; INSERT INTO item (key, a0, a1, a2) SELECT key, a0, a1, a2 FROM temp_item; </code></pre> My gut feeling is that these tests are not very representative to the performance in the real-world scenario, but I think the differences are great enough to give an indication of the most promising approaches for further investigation. The perftest.py script contains all updates as well for those of you who want to check it out. It's fairly ugly though, so don't forget your goggles :) EDIT andres in #postgresql @ freenode pointed out that I should test with an insert-to-temp/update variant (listed as "update_merlin83 (i/u)" above). <pre class="prettyprint"><code>INSERT INTO temp_item (key, a0, a1, a2) VALUES ( ('00:00:00:00:00:01', 'v0', 'v1', 'v2'), ('00:00:00:00:00:02', 'v3', 'v4', 'v5'), ...); UPDATE item SET a0=temp_item.a0, a1=temp_item.a1, a2=temp_item.a2 FROM temp_item WHERE item.key=temp_item.key </code></pre> EDIT Probably final edit: I changed my script to match our load scenario better, and it seems the numbers hold even when scaling things up a bit and adding some randomness. If anyone gets very different numbers from some other scenario I'd be interested in knowing about it.

The usual way I do these things in pg is: load raw data matching target table into temp table (no constraints) using copy, merge(the fun part), profit. I wrote a merge_by_key function specifically for these situations: http://mbk.projects.postgresql.org/ The docs aren't terribly friendly, but I'd suggest giving it a good look.

How can I speed up update/replace operations in PostgreSQL?

Tags:

python

sql

postgresql

psycopg2

We have a rather specific application that uses PostgreSQL 8.3 as a storage backend (using Python and psycopg2). The operations we perform to the important tables are in the majority of cases inserts or updates (rarely deletes or selects).

For sanity reasons we have created our own Data Mapper-like layer that works reasonably well, but it has one big bottleneck, the update performance. Of course, I'm not expecting the update/replace scenario to be as speedy as the 'insert to an empty table' one, but it would be nice to get a bit closer.

Note that this system is free from concurrent updates

We always set all the fields of each rows on an update, which can be seen in the terminology where I use the word 'replace' in my tests. I've so far tried two approaches to our update problem:

Create a replace() procedure that takes an array of rows to update:

CREATE OR REPLACE FUNCTION replace_item(data item[]) RETURNS VOID AS $$
BEGIN
    FOR i IN COALESCE(array_lower(data,1),0) .. COALESCE(array_upper(data,1),-1) LOOP
       UPDATE item SET a0=data[i].a0,a1=data[i].a1,a2=data[i].a2 WHERE key=data[i].key;
    END LOOP;
END;
$$ LANGUAGE plpgsql

Create an insert_or_replace rule so that everything but the occasional delete becomes multi-row inserts

CREATE RULE "insert_or_replace" AS
    ON INSERT TO "item"
    WHERE EXISTS(SELECT 1 FROM item WHERE key=NEW.key)
    DO INSTEAD
        (UPDATE item SET a0=NEW.a0,a1=NEW.a1,a2=NEW.a2 WHERE key=NEW.key);

These both speeds up the updates a fair bit, although the latter slows down inserts a bit:

Multi-row insert           : 50000 items inserted in  1.32 seconds averaging 37807.84 items/s
executemany() update       : 50000 items updated  in 26.67 seconds averaging  1874.57 items/s
update_andres              : 50000 items updated  in  3.84 seconds averaging 13028.51 items/s
update_merlin83 (i/d/i)    : 50000 items updated  in  1.29 seconds averaging 38780.46 items/s
update_merlin83 (i/u)      : 50000 items updated  in  1.24 seconds averaging 40313.28 items/s
replace_item() procedure   : 50000 items replaced in  3.10 seconds averaging 16151.42 items/s
Multi-row insert_or_replace: 50000 items inserted in  2.73 seconds averaging 18296.30 items/s
Multi-row insert_or_replace: 50000 items replaced in  2.02 seconds averaging 24729.94 items/s

Random notes about the test run:

All tests are run on the same computer as the database resides; connecting to localhost.
Inserts and updates are applied to the database in batches of of 500 items, each sent in its own transaction (UPDATED).
All update/replace tests used the same values as were already in the database.
All data was escaped using the psycopg2 adapt() function.
All tables are truncated and vacuumed before use (ADDED, in previous runs only truncation happened)

The table looks like this:

CREATE TABLE item (
    key MACADDR PRIMARY KEY,
    a0 VARCHAR,
    a1 VARCHAR,
    a2 VARCHAR
)

So, the real question is: How can I speed up update/replace operations a bit more? (I think these findings might be 'good enough', but I don't want to give up without tapping the SO crowd :)

Also anyones hints towards a more elegant replace_item(), or evidence that my tests are completely broken would be most welcome.

The test script is available here if you'd like to attempt to reproduce. Remember to check it first though...it WorksForMe, but...

You will need to edit the db.connect() line to suit your setup.

EDIT

Thanks to andres in #postgresql @ freenode I have another test with a single-query update; much like a multi-row insert (listed as update_andres above).

UPDATE item
SET a0=i.a0, a1=i.a1, a2=i.a2 
FROM (VALUES ('00:00:00:00:00:01', 'v0', 'v1', 'v2'), 
             ('00:00:00:00:00:02', 'v3', 'v4', 'v5'),
             ...
      ) AS i(key, a0, a1, a2)
WHERE item.key=i.key::macaddr

EDIT

Thanks to merlin83 in #postgresql @ freenode and jug/jwp below I have another test with an insert-to-temp/delete/insert approach (listed as "update_merlin83 (i/d/i)" above).

INSERT INTO temp_item (key, a0, a1, a2)
    VALUES (
        ('00:00:00:00:00:01', 'v0', 'v1', 'v2'),
        ('00:00:00:00:00:02', 'v3', 'v4', 'v5'),
        ...);

DELETE FROM item
USING temp_item
WHERE item.key=temp_item.key;

INSERT INTO item (key, a0, a1, a2)
    SELECT key, a0, a1, a2
    FROM temp_item;

My gut feeling is that these tests are not very representative to the performance in the real-world scenario, but I think the differences are great enough to give an indication of the most promising approaches for further investigation. The perftest.py script contains all updates as well for those of you who want to check it out. It's fairly ugly though, so don't forget your goggles :)

EDIT

andres in #postgresql @ freenode pointed out that I should test with an insert-to-temp/update variant (listed as "update_merlin83 (i/u)" above).

INSERT INTO temp_item (key, a0, a1, a2)
    VALUES (
        ('00:00:00:00:00:01', 'v0', 'v1', 'v2'),
        ('00:00:00:00:00:02', 'v3', 'v4', 'v5'),
        ...);

UPDATE item
SET a0=temp_item.a0, a1=temp_item.a1, a2=temp_item.a2
FROM temp_item
WHERE item.key=temp_item.key

EDIT

Probably final edit: I changed my script to match our load scenario better, and it seems the numbers hold even when scaling things up a bit and adding some randomness. If anyone gets very different numbers from some other scenario I'd be interested in knowing about it.

727

asked Jun 07 '09 17:06

Henrik Gustafsson

1 Answers

The usual way I do these things in pg is: load raw data matching target table into temp table (no constraints) using copy, merge(the fun part), profit.

I wrote a merge_by_key function specifically for these situations:

http://mbk.projects.postgresql.org/

The docs aren't terribly friendly, but I'd suggest giving it a good look.

187

answered Sep 28 '22 18:09

jwp

Related questions
                            
                                Pip "Could not find a version that satisfies the requirement pygame" [duplicate]
                            
                                Copying a worksheet with xlwings and python
                            
                                signal.alarm replacement in Windows [Python]
                            
                                How do I directly mock a superclass with python mock?
                            
                                How to convert a pitch track from a melody extraction algorithm to a humming like audio signal
                            
                                Representing rings of algebraic integers
                            
                                Does the Python range generator generate all values or yields them progressively? [duplicate]
                            
                                How to override the model.Manager.create() method in Django?
                            
                                Why does Python optimize out "if 0", but not "if None"?
                            
                                Why/How does Pandas use square brackets with .loc and .iloc?
                            
                                Simple get/post request blocked in python 3 but not in python 2
                            
                                Does Python have something as robust as Ruby's rvm?
                            
                                PHP - Equivalent for Python PDB(Debugger) in PHP? [closed]
                            
                                Python - SSL - wrong version number
                            
                                Weird IPython ipdb behaviour
                            
                                How do I upgrade the SQLite version used by Python's SQLite3 module on Mac?
                            
                                Combine two Pandas dataframes, resample on one time column, interpolate
                            
                                Cannot catch KeyboardInterrupt in command prompt twice?
                            
                                Understanding Gradient Policy Deriving
                            
                                Install Opencv from source to conda environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I speed up update/replace operations in PostgreSQL?

Tags:

python

sql

postgresql

psycopg2

Henrik Gustafsson

People also ask

1 Answers

jwp

Recent Activity

Donate For Us