I want to upload a huge number of entries (~600k) into a simple table in a PostgreSQL DB, with one foreign key, a timestamp and 3 float per each entry. However, it takes 60 ms per each entry to execute the core bulk insert described here, thus the whole execution would take 10 h. I have found out, that it is a performance issue of <code>executemany()</code> method, however it has been solved with the <code>execute_values()</code> method in psycopg2 2.7. The code I run is the following: <pre class="prettyprint"><code>#build a huge list of dicts, one dict for each entry engine.execute(SimpleTable.__table__.insert(), values) # around 600k dicts in a list </code></pre> I see that it is a common problem, however I have not managed to find a solution in sqlalchemy itself. Is there any way to tell sqlalchemy to call <code>execute_values()</code> in some occasions? Is there any other way to implement huge inserts without constructing the SQL statements by myself? Thanks for the help!

Meanwhile it became possible (from SqlAlchemy 1.2.0) with the <code>use_batch_mode</code> flag on the <code>create_engine()</code> function. See the docs. It uses the <code>execute_batch()</code> function from <code>psycopg.extras</code>.

How can I use psycopg2.extras in sqlalchemy?

Tags:

python

python-3.x

psycopg2

sqlalchemy

I want to upload a huge number of entries (~600k) into a simple table in a PostgreSQL DB, with one foreign key, a timestamp and 3 float per each entry. However, it takes 60 ms per each entry to execute the core bulk insert described here, thus the whole execution would take 10 h. I have found out, that it is a performance issue of executemany() method, however it has been solved with the execute_values() method in psycopg2 2.7.

The code I run is the following:

#build a huge list of dicts, one dict for each entry
engine.execute(SimpleTable.__table__.insert(),
               values) # around 600k dicts in a list

I see that it is a common problem, however I have not managed to find a solution in sqlalchemy itself. Is there any way to tell sqlalchemy to call execute_values() in some occasions? Is there any other way to implement huge inserts without constructing the SQL statements by myself?

Thanks for the help!

262

asked Apr 10 '17 07:04

Hodossy Szabolcs

2 Answers

Meanwhile it became possible (from SqlAlchemy 1.2.0) with the use_batch_mode flag on the create_engine() function. See the docs. It uses the execute_batch() function from psycopg.extras.

106

answered Oct 22 '22 08:10

Hodossy Szabolcs

Not the answer you are looking for in the sense that this does not address attempting to instruct SQLAlchemy to use the psycopg extras, and requires – sort of – manual SQL, but: you can access the underlying psycopg connections from an engine with raw_connection(), which allows using COPY FROM:

import io
import csv
from psycopg2 import sql

def bulk_copy(engine, table, values):
    csv_file = io.StringIO()
    headers = list(values[0].keys())
    writer = csv.DictWriter(csv_file, headers)
    writer.writerows(values)

    csv_file.seek(0)

    # NOTE: `format()` here is *not* `str.format()`, but
    # `SQL.format()`. Never use plain string formatting.
    copy_stmt = sql.SQL("COPY {} (" +
                        ",".join(["{}"] * len(headers)) +
                        ") FROM STDIN CSV").\
        format(sql.Identifier(str(table.name)),
               *(sql.Identifier(col) for col in headers))

    # Fetch a raw psycopg connection from the SQLAlchemy engine
    conn = engine.raw_connection()
    try:
        with conn.cursor() as cur:
            cur.copy_expert(copy_stmt, csv_file)

        conn.commit()

    except:
        conn.rollback()
        raise

    finally:
        conn.close()

and then

bulk_copy(engine, SimpleTable.__table__, values)

This should be plenty fast compared to executing INSERT statements. Moving 600,000 records on this machine took around 8 seconds, ~13µs/record. You could also use the raw connections and cursor with the extras package.

answered Oct 22 '22 07:10

Ilja Everilä

Related questions
                            
                                How to parse protobuf packets in Wireshark
                            
                                How the function dimshuffle works in Theano
                            
                                Django - Generating random, unique slug field for each model object
                            
                                Showing total on stacked bar Plotly
                            
                                How does Python ensure the return value of __len__ is an integer when len is called?
                            
                                Add a bookmark to a PDF with PyPDF2
                            
                                Python 3D Plots over non-rectangular domain
                            
                                Remove redundant square brackets in a list python [duplicate]
                            
                                Creating gist directly from Jupyper notebook?
                            
                                Python OpenCV - Extrapolating the largest rectangle off of a set of contour points
                            
                                Incremental Word2Vec Model Training in gensim
                            
                                Python - How to generate the Pairwise Hamming Distance Matrix
                            
                                Django CreateView success message not shown
                            
                                Formatting consecutive numbers
                            
                                How do I receive the data coming from IBs API in Python?
                            
                                Pandas .dt.hour formatting
                            
                                Pandas: How to do a boxplot bases in rows values instead of column values?
                            
                                aws CLI unable to be used due to module colorama
                            
                                sqlalchemy table schema autoload
                            
                                Python pandas -> select by condition in columns name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I use psycopg2.extras in sqlalchemy?

Tags:

python

python-3.x

psycopg2

sqlalchemy

Hodossy Szabolcs

People also ask

2 Answers

Hodossy Szabolcs

Ilja Everilä

Recent Activity

Donate For Us