I have tens of millions of rows to transfer from multidimensional array files into a PostgreSQL database. My tools are Python and psycopg2. The most efficient way to bulk instert data is using copy_from
. However, my data are mostly 32-bit floating point numbers (real or float4), so I'd rather not convert from real → text → real. Here is an example database DDL:
CREATE TABLE num_data
(
id serial PRIMARY KEY NOT NULL,
node integer NOT NULL,
ts smallint NOT NULL,
val1 real,
val2 double precision
);
Here is where I'm at with Python using strings (text):
# Just one row of data
num_row = [23253, 342, -15.336734, 2494627.949375]
import psycopg2
# Python3:
from io import StringIO
# Python2, use: from cStringIO import StringIO
conn = psycopg2.connect("dbname=mydb user=postgres")
curs = conn.cursor()
# Convert floating point numbers to text, write to COPY input
cpy = StringIO()
cpy.write('\t'.join([repr(x) for x in num_row]) + '\n')
# Insert data; database converts text back to floating point numbers
cpy.seek(0)
curs.copy_from(cpy, 'num_data', columns=('node', 'ts', 'val1', 'val2'))
conn.commit()
Is there an equivalent that could work using a binary mode? I.e., keep the floating point numbers in binary? Not only would this preserve the floating point precision, but it could be faster.
(Note: to see the same precision as the example, use SET extra_float_digits='2'
)
psycopg vs psycopg-binary The psycopg2-binary package is meant for beginners to start playing with Python and PostgreSQL without the need to meet the build requirements. If you are the maintainer of a published package depending on psycopg2 you shouldn't use psycopg2-binary as a module dependency.
Thread and process safetyThe Psycopg module and the connection objects are thread-safe: many threads can access the same database either using separate sessions and creating a connection per thread or using the same connection and creating separate cursors.
The current psycopg2 implementation supports: Python 2 versions from 2.6 to 2.7. Python 3 versions from 3.2 to 3.6. PostgreSQL server versions from 7.4 to 9.6.
Here is the binary equivalent of COPY FROM for Python 3:
from io import BytesIO from struct import pack import psycopg2 # Two rows of data; "id" is not in the upstream data source # Columns: node, ts, val1, val2 data = [(23253, 342, -15.336734, 2494627.949375), (23256, 348, 43.23524, 2494827.949375)] conn = psycopg2.connect("dbname=mydb user=postgres") curs = conn.cursor() # Determine starting value for sequence curs.execute("SELECT nextval('num_data_id_seq')") id_seq = curs.fetchone()[0] # Make a binary file object for COPY FROM cpy = BytesIO() # 11-byte signature, no flags, no header extension cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0)) # Columns: id, node, ts, val1, val2 # Zip: (column position, format, size) row_format = list(zip(range(-1, 4), ('i', 'i', 'h', 'f', 'd'), ( 4, 4, 2, 4, 8 ))) for row in data: # Number of columns/fields (always 5) cpy.write(pack('!h', 5)) for col, fmt, size in row_format: value = (id_seq if col == -1 else row[col]) cpy.write(pack('!i' + fmt, size, value)) id_seq += 1 # manually increment sequence outside of database # File trailer cpy.write(pack('!h', -1)) # Copy data to database cpy.seek(0) curs.copy_expert("COPY num_data FROM STDIN WITH BINARY", cpy) # Update sequence on database curs.execute("SELECT setval('num_data_id_seq', %s, false)", (id_seq,)) conn.commit()
I rewrote the above approach to writing the files for COPY. My data in Python is in NumPy arrays, so it makes sense to use these. Here is some example data
with with 1M rows, 7 columns:
import psycopg2 import numpy as np from struct import pack from io import BytesIO from datetime import datetime conn = psycopg2.connect("dbname=mydb user=postgres") curs = conn.cursor() # NumPy record array shape = (7, 2000, 500) print('Generating data with %i rows, %i columns' % (shape[1]*shape[2], shape[0])) dtype = ([('id', 'i4'), ('node', 'i4'), ('ts', 'i2')] + [('s' + str(x), 'f4') for x in range(shape[0])]) data = np.empty(shape[1]*shape[2], dtype) data['id'] = np.arange(shape[1]*shape[2]) + 1 data['node'] = np.tile(np.arange(shape[1]) + 1, shape[2]) data['ts'] = np.repeat(np.arange(shape[2]) + 1, shape[1]) data['s0'] = np.random.rand(shape[1]*shape[2]) * 100 prv = 's0' for nxt in data.dtype.names[4:]: data[nxt] = data[prv] + np.random.rand(shape[1]*shape[2]) * 10 prv = nxt
On my database, I have two tables that look like:
CREATE TABLE num_data_binary ( id integer PRIMARY KEY, node integer NOT NULL, ts smallint NOT NULL, s0 real, s1 real, s2 real, s3 real, s4 real, s5 real, s6 real ) WITH (OIDS=FALSE);
and another similar table named num_data_text
.
Here are some simple helper functions to prepare the data for COPY (both text and binary formats) by using the information in the NumPy record array:
def prepare_text(dat): cpy = BytesIO() for row in dat: cpy.write('\t'.join([repr(x) for x in row]) + '\n') return(cpy) def prepare_binary(dat): pgcopy_dtype = [('num_fields','>i2')] for field, dtype in dat.dtype.descr: pgcopy_dtype += [(field + '_length', '>i4'), (field, dtype.replace('<', '>'))] pgcopy = np.empty(dat.shape, pgcopy_dtype) pgcopy['num_fields'] = len(dat.dtype) for i in range(len(dat.dtype)): field = dat.dtype.names[i] pgcopy[field + '_length'] = dat.dtype[i].alignment pgcopy[field] = dat[field] cpy = BytesIO() cpy.write(pack('!11sii', b'PGCOPY\n\377\r\n\0', 0, 0)) cpy.write(pgcopy.tostring()) # all rows cpy.write(pack('!h', -1)) # file trailer return(cpy)
This how I'm using the helper functions to benchmark the two COPY format methods:
def time_pgcopy(dat, table, binary): print('Processing copy object for ' + table) tstart = datetime.now() if binary: cpy = prepare_binary(dat) else: # text cpy = prepare_text(dat) tendw = datetime.now() print('Copy object prepared in ' + str(tendw - tstart) + '; ' + str(cpy.tell()) + ' bytes; transfering to database') cpy.seek(0) if binary: curs.copy_expert('COPY ' + table + ' FROM STDIN WITH BINARY', cpy) else: # text curs.copy_from(cpy, table) conn.commit() tend = datetime.now() print('Database copy time: ' + str(tend - tendw)) print(' Total time: ' + str(tend - tstart)) return time_pgcopy(data, 'num_data_text', binary=False) time_pgcopy(data, 'num_data_binary', binary=True)
Here is the output from the last two time_pgcopy
commands:
Processing copy object for num_data_text Copy object prepared in 0:01:15.288695; 84355016 bytes; transfering to database Database copy time: 0:00:37.929166 Total time: 0:01:53.217861 Processing copy object for num_data_binary Copy object prepared in 0:00:01.296143; 80000021 bytes; transfering to database Database copy time: 0:00:23.325952 Total time: 0:00:24.622095
So both the NumPy → file and file → database steps are way faster with the binary approach. The obvious difference is how Python prepares the COPY file, which is really slow for text. Generally speaking, the binary format loads into the database in 2/3 of the time as the text format for this schema.
Lastly, I compared the values in both tables within the database to see if the numbers were different. About 1.46% of the rows have different values for column s0
, and this fraction increases to 6.17% for s6
(probably related on the random method that I used). The non-zero absolute differences between all 70M 32-bit float values range between 9.3132257e-010 and 7.6293945e-006. These small differences between the text and binary loading methods are due to the loss of precision from the float → text → float conversions required for the text format method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With