Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently insert massive amount of rows in Psycopg2

I need to efficiently insert about 500k (give or take 100k) rows of data into my PostgreSQL database. After a generous amount of google-ing, I've gotten to this solution, averaging about 150 (wall-clock) seconds.

def db_insert_spectrum(curs, visual_data, recording_id):
    sql = """
        INSERT INTO spectrums (row, col, value, recording_id)
        VALUES %s
    """

    # Mass-insertion technique
    # visual_data is a 2D array (a nx63 matrix)
    values_list = []
    for rowIndex, rowData in enumerate(visual_data):
        for colIndex, colData in enumerate(rowData): # colData is the value
            value = [(rowIndex, colIndex, colData, recording_id)]
            values_list.append(value)

    psycopg2.extras.execute_batch(curs, sql, values_list, page_size=1000)

Is there a faster way?

like image 475
forteddyt Avatar asked Nov 05 '17 00:11

forteddyt


People also ask

How do I add multiple rows in PostgreSQL?

PostgreSQL INSERT Multiple Rows First, specify the name of the table that you want to insert data after the INSERT INTO keywords. Second, list the required columns or all columns of the table in parentheses that follow the table name. Third, supply a comma-separated list of rows after the VALUES keyword.

Is psycopg2 fast?

Due to its C implementation, Psycopg2 is very fast and efficient. You can use Psycopg2 to fetch one or more rows from the database based on a SQL query. If you want to insert some data into the database, that's also possible with this library — with multiple options for single or batch inserting.

Is psycopg2 asynchronous?

Psycopg allows asynchronous interaction with other database sessions using the facilities offered by PostgreSQL commands LISTEN and NOTIFY .

What is cur Mogrify?

mogrify() method returns a query string once the parameters have been bound. If you used the execute() method or anything similar, the string returned is the same as what would be sent to the database. The resulting string is always a bytes string, which is quicker than using the executemany() function. Syntax: cur.


1 Answers

Based on the answers given here, COPY is the fastest method. COPY reads from a file or file-like object.

Since memory I/O is many orders of magnitude faster than disk I/O, it is faster to write the data to a StringIO file-like object than to write to an actual file. The psycopg docs show an example of calling copy_from with a StringIO as input.

Therefore, you could use something like:

try:
    # Python2
    from cStringIO import StringIO 
except ImportError:
    # Python3
    from io import StringIO

def db_insert_spectrum(curs, visual_data, recording_id):
    f = StringIO()
    # visual_data is a 2D array (a nx63 matrix)
    values_list = []
    for rowIndex, rowData in enumerate(visual_data):
        items = []
        for colIndex, colData in enumerate(rowData): 
            value = (rowIndex, colIndex, colData, recording_id)
            items.append('\t'.join(map(str, value))+'\n')
        f.writelines(items)
    f.seek(0)
    cur.copy_from(f, 'spectrums', columns=('row', 'col', 'value', 'recording_id'))
like image 86
unutbu Avatar answered Oct 22 '22 14:10

unutbu