I am trying to insert a pandas DataFrame into a Postgresql DB (9.1) in the most efficient way (using Python 2.7).
Using "cursor.execute_many" is really slow, so is "DataFrame.to_csv(buffer,...)" together with "copy_from".
I found an already much! faster solution on the web (http://eatthedots.blogspot.de/2008/08/faking-read-support-for-psycopgs.html) which I adapted to work with pandas.
My code can be found below.
My question is whether the method of this related question (using "copy from stdin with binary") can be easily transferred to work with DataFrames and if this would be much faster.
Use binary COPY table FROM with psycopg2
Unfortunately my Python skills aren't sufficient to understand the implementation of this approach.
This is my approach:
import psycopg2
import connectDB # this is simply a module that returns a connection to the db
from datetime import datetime
class ReadFaker:
"""
This could be extended to include the index column optionally. Right now the index
is not inserted
"""
def __init__(self, data):
self.iter = data.itertuples()
def readline(self, size=None):
try:
line = self.iter.next()[1:] # element 0 is the index
row = '\t'.join(x.encode('utf8') if isinstance(x, unicode) else str(x) for x in line) + '\n'
# in my case all strings in line are unicode objects.
except StopIteration:
return ''
else:
return row
read = readline
def insert(df, table, con=None, columns = None):
time1 = datetime.now()
close_con = False
if not con:
try:
con = connectDB.getCon() ###dbLoader returns a connection with my settings
close_con = True
except psycopg2.Error, e:
print e.pgerror
print e.pgcode
return "failed"
inserted_rows = df.shape[0]
data = ReadFaker(df)
try:
curs = con.cursor()
print 'inserting %s entries into %s ...' % (inserted_rows, table)
if columns is not None:
curs.copy_from(data, table, null='nan', columns=[col for col in columns])
else:
curs.copy_from(data, table, null='nan')
con.commit()
curs.close()
if close_con:
con.close()
except psycopg2.Error, e:
print e.pgerror
print e.pgcode
con.rollback()
if close_con:
con.close()
return "failed"
time2 = datetime.now()
print time2 - time1
return inserted_rows
Pandas dataframes now have a .to_sql method. Postgresql is not supported yet, but there's a patch for it that looks like it works. See the issues here and here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With