Use binary COPY table FROM with psycopg2

Tags:

I have tens of millions of rows to transfer from multidimensional array files into a PostgreSQL database. My tools are Python and psycopg2. The most efficient way to bulk instert data is using copy_from. However, my data are mostly 32-bit floating point numbers (real or float4), so I'd rather not convert from real → text → real. Here is an example database DDL:

CREATE TABLE num_data
(
  id serial PRIMARY KEY NOT NULL,
  node integer NOT NULL,
  ts smallint NOT NULL,
  val1 real,
  val2 double precision
);

Here is where I'm at with Python using strings (text):

# Just one row of data
num_row = [23253, 342, -15.336734, 2494627.949375]

import psycopg2
# Python3:
from io import StringIO
# Python2, use: from cStringIO import StringIO

conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()

# Convert floating point numbers to text, write to COPY input
cpy = StringIO()
cpy.write('\t'.join([repr(x) for x in num_row]) + '\n')

# Insert data; database converts text back to floating point numbers
cpy.seek(0)
curs.copy_from(cpy, 'num_data', columns=('node', 'ts', 'val1', 'val2'))
conn.commit()

Is there an equivalent that could work using a binary mode? I.e., keep the floating point numbers in binary? Not only would this preserve the floating point precision, but it could be faster.

(Note: to see the same precision as the example, use SET extra_float_digits='2')

713

asked Nov 15 '11 22:11

Mike T

1 Answers

Here is the binary equivalent of COPY FROM for Python 3:

from io import BytesIO from struct import pack import psycopg2  # Two rows of data; "id" is not in the upstream data source # Columns: node, ts, val1, val2 data = [(23253, 342, -15.336734, 2494627.949375),         (23256, 348, 43.23524, 2494827.949375)]  conn = psycopg2.connect("dbname=mydb user=postgres") curs = conn.cursor()  # Determine starting value for sequence curs.execute("SELECT nextval('num_data_id_seq')") id_seq = curs.fetchone()[0]  # Make a binary file object for COPY FROM cpy = BytesIO() # 11-byte signature, no flags, no header extension cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))  # Columns: id, node, ts, val1, val2 # Zip: (column position, format, size) row_format = list(zip(range(-1, 4),                       ('i', 'i', 'h', 'f', 'd'),                       ( 4,   4,   2,   4,   8 ))) for row in data:     # Number of columns/fields (always 5)     cpy.write(pack('!h', 5))     for col, fmt, size in row_format:         value = (id_seq if col == -1 else row[col])         cpy.write(pack('!i' + fmt, size, value))     id_seq += 1  # manually increment sequence outside of database  # File trailer cpy.write(pack('!h', -1))  # Copy data to database cpy.seek(0) curs.copy_expert("COPY num_data FROM STDIN WITH BINARY", cpy)  # Update sequence on database curs.execute("SELECT setval('num_data_id_seq', %s, false)", (id_seq,)) conn.commit()

Update

I rewrote the above approach to writing the files for COPY. My data in Python is in NumPy arrays, so it makes sense to use these. Here is some example data with with 1M rows, 7 columns:

import psycopg2 import numpy as np from struct import pack from io import BytesIO from datetime import datetime  conn = psycopg2.connect("dbname=mydb user=postgres") curs = conn.cursor()  # NumPy record array shape = (7, 2000, 500) print('Generating data with %i rows, %i columns' % (shape[1]*shape[2], shape[0]))  dtype = ([('id', 'i4'), ('node', 'i4'), ('ts', 'i2')] +          [('s' + str(x), 'f4') for x in range(shape[0])]) data = np.empty(shape[1]*shape[2], dtype) data['id'] = np.arange(shape[1]*shape[2]) + 1 data['node'] = np.tile(np.arange(shape[1]) + 1, shape[2]) data['ts'] = np.repeat(np.arange(shape[2]) + 1, shape[1]) data['s0'] = np.random.rand(shape[1]*shape[2]) * 100 prv = 's0' for nxt in data.dtype.names[4:]:     data[nxt] = data[prv] + np.random.rand(shape[1]*shape[2]) * 10     prv = nxt

On my database, I have two tables that look like:

CREATE TABLE num_data_binary (   id integer PRIMARY KEY,   node integer NOT NULL,   ts smallint NOT NULL,   s0 real,   s1 real,   s2 real,   s3 real,   s4 real,   s5 real,   s6 real ) WITH (OIDS=FALSE);

and another similar table named num_data_text.

Here are some simple helper functions to prepare the data for COPY (both text and binary formats) by using the information in the NumPy record array:

def prepare_text(dat):     cpy = BytesIO()     for row in dat:         cpy.write('\t'.join([repr(x) for x in row]) + '\n')     return(cpy)  def prepare_binary(dat):     pgcopy_dtype = [('num_fields','>i2')]     for field, dtype in dat.dtype.descr:         pgcopy_dtype += [(field + '_length', '>i4'),                          (field, dtype.replace('<', '>'))]     pgcopy = np.empty(dat.shape, pgcopy_dtype)     pgcopy['num_fields'] = len(dat.dtype)     for i in range(len(dat.dtype)):         field = dat.dtype.names[i]         pgcopy[field + '_length'] = dat.dtype[i].alignment         pgcopy[field] = dat[field]     cpy = BytesIO()     cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0))     cpy.write(pgcopy.tostring())  # all rows     cpy.write(pack('!h', -1))  # file trailer     return(cpy)

This how I'm using the helper functions to benchmark the two COPY format methods:

def time_pgcopy(dat, table, binary):     print('Processing copy object for ' + table)     tstart = datetime.now()     if binary:         cpy = prepare_binary(dat)     else:  # text         cpy = prepare_text(dat)     tendw = datetime.now()     print('Copy object prepared in ' + str(tendw - tstart) + '; ' +           str(cpy.tell()) + ' bytes; transfering to database')     cpy.seek(0)     if binary:         curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy)     else:  # text         curs.copy_from(cpy, table)     conn.commit()     tend = datetime.now()     print('Database copy time: ' + str(tend - tendw))     print('        Total time: ' + str(tend - tstart))     return  time_pgcopy(data, 'num_data_text', binary=False) time_pgcopy(data, 'num_data_binary', binary=True)

Here is the output from the last two time_pgcopy commands:

Processing copy object for num_data_text Copy object prepared in 0:01:15.288695; 84355016 bytes; transfering to database Database copy time: 0:00:37.929166         Total time: 0:01:53.217861 Processing copy object for num_data_binary Copy object prepared in 0:00:01.296143; 80000021 bytes; transfering to database Database copy time: 0:00:23.325952         Total time: 0:00:24.622095

So both the NumPy → file and file → database steps are way faster with the binary approach. The obvious difference is how Python prepares the COPY file, which is really slow for text. Generally speaking, the binary format loads into the database in 2/3 of the time as the text format for this schema.

Lastly, I compared the values in both tables within the database to see if the numbers were different. About 1.46% of the rows have different values for column s0, and this fraction increases to 6.17% for s6 (probably related on the random method that I used). The non-zero absolute differences between all 70M 32-bit float values range between 9.3132257e-010 and 7.6293945e-006. These small differences between the text and binary loading methods are due to the loss of precision from the float → text → float conversions required for the text format method.

147

answered Oct 08 '22 17:10

Mike T

Related questions
                            
                                What does it mean to have an index to scalar variable error? python
                            
                                Can a Python Abstract Base Class enforce function signatures?
                            
                                How to define a table without primary key with SQLAlchemy?
                            
                                How to get rid of multilevel index after using pivot table pandas?
                            
                                Is the order of results coming from a list comprehension guaranteed?
                            
                                F# vs IronPython: When is one preferred to the other?
                            
                                Is there any direct way to generate pdf from markdown file by python [closed]
                            
                                How do I use data in package_data from source code?
                            
                                How can I run a Makefile in setup.py?
                            
                                In Python is it bad to create an attribute called 'id'?
                            
                                How to get data from command line from within a Python program?
                            
                                How is an empty __init__.py file correct?
                            
                                Set variable point size in matplotlib
                            
                                PyQt on Android
                            
                                'verbose' argument in scikit-learn
                            
                                How to understand loss acc val_loss val_acc in Keras model fitting
                            
                                Interactive console using Pydev in Eclipse?
                            
                                Numpy modify array in place?
                            
                                When to call .join() on a process?
                            
                                Does pandas need to close connection?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use binary COPY table FROM with psycopg2

Tags:

python

postgresql

psycopg2

binary-data

bulkinsert

Mike T

People also ask

1 Answers

Update

Mike T

Recent Activity

Donate For Us