Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use binary COPY table FROM with psycopg2

I have tens of millions of rows to transfer from multidimensional array files into a PostgreSQL database. My tools are Python and psycopg2. The most efficient way to bulk instert data is using copy_from. However, my data are mostly 32-bit floating point numbers (real or float4), so I'd rather not convert from real → text → real. Here is an example database DDL:

CREATE TABLE num_data
(
  id serial PRIMARY KEY NOT NULL,
  node integer NOT NULL,
  ts smallint NOT NULL,
  val1 real,
  val2 double precision
);

Here is where I'm at with Python using strings (text):

# Just one row of data
num_row = [23253, 342, -15.336734, 2494627.949375]

import psycopg2
# Python3:
from io import StringIO
# Python2, use: from cStringIO import StringIO

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# Convert floating point numbers to text, write to COPY input
cpy = StringIO()
cpy.write('\t'.join([repr(x) for x in num_row]) + '\n')

# Insert data; database converts text back to floating point numbers
cpy.seek(0)
curs.copy_from(cpy, 'num_data', columns=('node', 'ts', 'val1', 'val2'))
conn.commit()

Is there an equivalent that could work using a binary mode? I.e., keep the floating point numbers in binary? Not only would this preserve the floating point precision, but it could be faster.

(Note: to see the same precision as the example, use SET extra_float_digits='2')

like image 713
Mike T Avatar asked Nov 15 '11 22:11

Mike T


People also ask

Should I use psycopg2 or psycopg2-binary?

psycopg vs psycopg-binary The psycopg2-binary package is meant for beginners to start playing with Python and PostgreSQL without the need to meet the build requirements. If you are the maintainer of a published package depending on psycopg2 you shouldn't use psycopg2-binary as a module dependency.

Is psycopg2 connection thread safe?

Thread and process safetyThe Psycopg module and the connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors.

Does psycopg2 work with python3?

The current psycopg2 implementation supports: Python 2 versions from 2.6 to 2.7. Python 3 versions from 3.2 to 3.6. PostgreSQL server versions from 7.4 to 9.6.


1 Answers

Here is the binary equivalent of COPY FROM for Python 3:

from io import BytesIO from struct import pack import psycopg2  # Two rows of data; "id" is not in the upstream data source # Columns: node, ts, val1, val2 data = [(23253, 342, -15.336734, 2494627.949375),         (23256, 348, 43.23524, 2494827.949375)]  conn = psycopg2.connect("dbname=mydb user=postgres") curs = conn.cursor()  # Determine starting value for sequence curs.execute("SELECT nextval('num_data_id_seq')") id_seq = curs.fetchone()[0]  # Make a binary file object for COPY FROM cpy = BytesIO() # 11-byte signature, no flags, no header extension cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))  # Columns: id, node, ts, val1, val2 # Zip: (column position, format, size) row_format = list(zip(range(-1, 4),                       ('i', 'i', 'h', 'f', 'd'),                       ( 4,   4,   2,   4,   8 ))) for row in data:     # Number of columns/fields (always 5)     cpy.write(pack('!h', 5))     for col, fmt, size in row_format:         value = (id_seq if col == -1 else row[col])         cpy.write(pack('!i' + fmt, size, value))     id_seq += 1  # manually increment sequence outside of database  # File trailer cpy.write(pack('!h', -1))  # Copy data to database cpy.seek(0) curs.copy_expert("COPY num_data FROM STDIN WITH BINARY", cpy)  # Update sequence on database curs.execute("SELECT setval('num_data_id_seq', %s, false)", (id_seq,)) conn.commit() 

Update

I rewrote the above approach to writing the files for COPY. My data in Python is in NumPy arrays, so it makes sense to use these. Here is some example data with with 1M rows, 7 columns:

import psycopg2 import numpy as np from struct import pack from io import BytesIO from datetime import datetime  conn = psycopg2.connect("dbname=mydb user=postgres") curs = conn.cursor()  # NumPy record array shape = (7, 2000, 500) print('Generating data with %i rows, %i columns' % (shape[1]*shape[2], shape[0]))  dtype = ([('id', 'i4'), ('node', 'i4'), ('ts', 'i2')] +          [('s' + str(x), 'f4') for x in range(shape[0])]) data = np.empty(shape[1]*shape[2], dtype) data['id'] = np.arange(shape[1]*shape[2]) + 1 data['node'] = np.tile(np.arange(shape[1]) + 1, shape[2]) data['ts'] = np.repeat(np.arange(shape[2]) + 1, shape[1]) data['s0'] = np.random.rand(shape[1]*shape[2]) * 100 prv = 's0' for nxt in data.dtype.names[4:]:     data[nxt] = data[prv] + np.random.rand(shape[1]*shape[2]) * 10     prv = nxt 

On my database, I have two tables that look like:

CREATE TABLE num_data_binary (   id integer PRIMARY KEY,   node integer NOT NULL,   ts smallint NOT NULL,   s0 real,   s1 real,   s2 real,   s3 real,   s4 real,   s5 real,   s6 real ) WITH (OIDS=FALSE); 

and another similar table named num_data_text.

Here are some simple helper functions to prepare the data for COPY (both text and binary formats) by using the information in the NumPy record array:

def prepare_text(dat):     cpy = BytesIO()     for row in dat:         cpy.write('\t'.join([repr(x) for x in row]) + '\n')     return(cpy)  def prepare_binary(dat):     pgcopy_dtype = [('num_fields','>i2')]     for field, dtype in dat.dtype.descr:         pgcopy_dtype += [(field + '_length', '>i4'),                          (field, dtype.replace('<', '>'))]     pgcopy = np.empty(dat.shape, pgcopy_dtype)     pgcopy['num_fields'] = len(dat.dtype)     for i in range(len(dat.dtype)):         field = dat.dtype.names[i]         pgcopy[field + '_length'] = dat.dtype[i].alignment         pgcopy[field] = dat[field]     cpy = BytesIO()     cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))     cpy.write(pgcopy.tostring())  # all rows     cpy.write(pack('!h', -1))  # file trailer     return(cpy) 

This how I'm using the helper functions to benchmark the two COPY format methods:

def time_pgcopy(dat, table, binary):     print('Processing copy object for ' + table)     tstart = datetime.now()     if binary:         cpy = prepare_binary(dat)     else:  # text         cpy = prepare_text(dat)     tendw = datetime.now()     print('Copy object prepared in ' + str(tendw - tstart) + '; ' +           str(cpy.tell()) + ' bytes; transfering to database')     cpy.seek(0)     if binary:         curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy)     else:  # text         curs.copy_from(cpy, table)     conn.commit()     tend = datetime.now()     print('Database copy time: ' + str(tend - tendw))     print('        Total time: ' + str(tend - tstart))     return  time_pgcopy(data, 'num_data_text', binary=False) time_pgcopy(data, 'num_data_binary', binary=True) 

Here is the output from the last two time_pgcopy commands:

Processing copy object for num_data_text Copy object prepared in 0:01:15.288695; 84355016 bytes; transfering to database Database copy time: 0:00:37.929166         Total time: 0:01:53.217861 Processing copy object for num_data_binary Copy object prepared in 0:00:01.296143; 80000021 bytes; transfering to database Database copy time: 0:00:23.325952         Total time: 0:00:24.622095 

So both the NumPy → file and file → database steps are way faster with the binary approach. The obvious difference is how Python prepares the COPY file, which is really slow for text. Generally speaking, the binary format loads into the database in 2/3 of the time as the text format for this schema.

Lastly, I compared the values in both tables within the database to see if the numbers were different. About 1.46% of the rows have different values for column s0, and this fraction increases to 6.17% for s6 (probably related on the random method that I used). The non-zero absolute differences between all 70M 32-bit float values range between 9.3132257e-010 and 7.6293945e-006. These small differences between the text and binary loading methods are due to the loss of precision from the float → text → float conversions required for the text format method.

like image 147
Mike T Avatar answered Oct 08 '22 17:10

Mike T