Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

psycopg2 equivalent to R dbWriteTable command and getting more performance from python code

I just did a time test on loading a data array from a csv, creating a database on Postgres and writing the table to it in python and R.

I was surprised that the times were very similar:

The python code first: (as an e.g)

import timeit
tic = timeit.default_timer()
tic4 = timeit.default_timer()
import xlrd as xl
import psycopg2 as pq
import os
import pandas as pd 
import numpy as np
import csv
from pprint import pprint as pp

perf_dir = '/myhomedir'
toc4=timeit.default_timer()

# Create the database
tic1= timeit.default_timer()
os.system('dropdb ptest')
os.system('createdb ptest')
# connect to the database
cn = pq.connect('dbname=ptest user=me')
cr = cn.cursor()
toc1=timeit.default_timer()

# Create the tables:
### load csvs
tic2=timeit.default_timer()
id_files = ('di1','di2','di.c')
id_files = [i+r'.csv' for i in id_files]
id1 = csv.reader(open(os.path.join(perf_dir,id_files[0])),delimiter='\t')
id1 = [i for i in id1]

id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
id2 = [i for i in id2]

id3 = csv.reader(open(os.path.join(perf_dir,id_files[2])),delimiter='\t')
id3 = [i for i in id3]
id3 = [i[1:3] for i in id3]
toc2=timeit.default_timer()

# create tables and fill
### id1 fund classifications
tic3=timeit.default_timer()
cr.execute('CREATE TABLE id1 (%s varchar, %s int PRIMARY KEY, %s int, %s int, %s varchar)' % tuple(id1[0]))
FLDS = 'INSERT INTO id1 (%s,%s,%s,%s,%s) VALUES' % tuple(id1[0]) 
SQL = FLDS + ' (%s,%s,%s,%s,%s)'
for i in range(1,len(id1)):
    data = tuple(id1[i])
    cr.execute(SQL,data)

### id2 portfolio group classifications - reference only
cr.execute('CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)' % tuple(id2[0]))
SQL = 'INSERT INTO id2 (%s,%s,%s) VALUES' % tuple(id2[0]) + ' (%s,%s,%s)'
for i in range(1,len(id2)):
    data = tuple(id2[i])
    cr.execute(SQL,data)

### id3 value variable classifications
cr.execute('CREATE TABLE id3 (%s varchar,%s varchar)' % tuple(id3[0]))
SQL = 'INSERT INTO id3 VALUES(%s,%s)'
for i in range(1,len(id3)):
    data = tuple(id3[i])
    cr.execute(SQL,data)

cn.commit()

# Timing block - will be commented out in final code
toc3=timeit.default_timer()
toc = timeit.default_timer()
time = (toc - tic)
time1 = toc1 - tic1
time2 = toc2 - tic2
time3 = toc3 - tic3
time4 = toc4 - tic4
print('Overall time: %s' % time)
print('dB create & connect time: %s' % time1)
print('Load id csvs time: %s' % time2)
print('Create tables and write to db time: %s' % time3)
print('Time to import libraries: %s' % time4)

and the R code (which is much more legible)

tic = proc.time()
library(RPostgreSQL)

tic1 = proc.time()
system('dropdb ptest1')
system('createdb ptest1')
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname='ptest1')
toc1 = proc.time()
time1 = toc1 - tic1

tic2 = proc.time()
id.1 = read.csv('/myhomedir/di1.csv',stringsAsFactors=F,sep='\t')
id.2 = read.csv('/myhomedir/di2.csv',stringsAsFactors=F)
id.3 = read.csv('/myhomedir/di.c.csv',stringsAsFactors=F,sep='\t')
id.3 = id.3[,-1]
toc2 = proc.time()
time2 = toc2 - tic2

tic3 = proc.time()
dbWriteTable(con,'id1',id.1)
dbWriteTable(con,'id2',id.2)
dbWriteTable(con,'id3',id.3)
toc3 = proc.time()
time3 = toc3 - tic3

toc = proc.time()
time = toc - tic
tyme = rbind(time1,time2,time3,time)
tyme = data.frame(Function=c('Create & Connect to DB',"Load CSV's for save","Write Table to DB",'Overall Time'),tyme)

I was very surprised at how close the times for the two were. (I've read an awful lot on R being slow and Python being extremely fast)

For python

>>> Overall time: 2.48381304741
dB create & connect time: 1.96832108498
Load id csvs time: 0.000378847122192
Create tables and write to db time: 0.35303401947
Time to import libraries: 0.162075042725

and for R

                    Function user.self sys.self elapsed user.child sys.child
time1 Create & Connect to DB     0.112    0.016   1.943       0.06     0.004
time2    Load CSV's for save     0.008    0.000   0.006       0.00     0.000
time3      Write Table to DB     0.096    0.004   0.349       0.00     0.000
time            Overall Time     0.376    0.028   2.463       0.06     0.004

I wondered if it had something to with the fact that I'm INSERTing a row at at time into the python version of the table.

Hence the main question - is there an equivalent in python for the dbWriteTable block in the R code and would it speed things up?

A second ancillary question would be is there anything obviously wrong with the code that might be slowing things down.

Happy to provide sample csv's if that would help.

Not looking to start a flame war on R v Python, would just like to know how I can make my code faster.

Thanks

like image 515
Tahnoon Pasha Avatar asked Oct 04 '22 19:10

Tahnoon Pasha


1 Answers

Try this:

    id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
    h = tuple(next(id2))

    create = '''CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)''' % h
    insert = '''INSERT INTO id2 (%s, %s, %s) VALUES (%%s, %%s, %%s)''' % h

    ...

    cr.executemany(insert, id2)

Pass the sql command only once for all the rows. In addition to the performance overhead of issuing additional sql commands, notice that the size of the sql string itself (in bytes) likely dwarfs the size of the real data under test.

Using id2 as a generator without the array should also avoid loading the entire data set into memory.

like image 156
jlubken Avatar answered Oct 07 '22 19:10

jlubken