I just did a time test on loading a data array from a csv, creating a database on Postgres and writing the table to it in python and R.
I was surprised that the times were very similar:
The python code first: (as an e.g)
import timeit
tic = timeit.default_timer()
tic4 = timeit.default_timer()
import xlrd as xl
import psycopg2 as pq
import os
import pandas as pd
import numpy as np
import csv
from pprint import pprint as pp
perf_dir = '/myhomedir'
toc4=timeit.default_timer()
# Create the database
tic1= timeit.default_timer()
os.system('dropdb ptest')
os.system('createdb ptest')
# connect to the database
cn = pq.connect('dbname=ptest user=me')
cr = cn.cursor()
toc1=timeit.default_timer()
# Create the tables:
### load csvs
tic2=timeit.default_timer()
id_files = ('di1','di2','di.c')
id_files = [i+r'.csv' for i in id_files]
id1 = csv.reader(open(os.path.join(perf_dir,id_files[0])),delimiter='\t')
id1 = [i for i in id1]
id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
id2 = [i for i in id2]
id3 = csv.reader(open(os.path.join(perf_dir,id_files[2])),delimiter='\t')
id3 = [i for i in id3]
id3 = [i[1:3] for i in id3]
toc2=timeit.default_timer()
# create tables and fill
### id1 fund classifications
tic3=timeit.default_timer()
cr.execute('CREATE TABLE id1 (%s varchar, %s int PRIMARY KEY, %s int, %s int, %s varchar)' % tuple(id1[0]))
FLDS = 'INSERT INTO id1 (%s,%s,%s,%s,%s) VALUES' % tuple(id1[0])
SQL = FLDS + ' (%s,%s,%s,%s,%s)'
for i in range(1,len(id1)):
data = tuple(id1[i])
cr.execute(SQL,data)
### id2 portfolio group classifications - reference only
cr.execute('CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)' % tuple(id2[0]))
SQL = 'INSERT INTO id2 (%s,%s,%s) VALUES' % tuple(id2[0]) + ' (%s,%s,%s)'
for i in range(1,len(id2)):
data = tuple(id2[i])
cr.execute(SQL,data)
### id3 value variable classifications
cr.execute('CREATE TABLE id3 (%s varchar,%s varchar)' % tuple(id3[0]))
SQL = 'INSERT INTO id3 VALUES(%s,%s)'
for i in range(1,len(id3)):
data = tuple(id3[i])
cr.execute(SQL,data)
cn.commit()
# Timing block - will be commented out in final code
toc3=timeit.default_timer()
toc = timeit.default_timer()
time = (toc - tic)
time1 = toc1 - tic1
time2 = toc2 - tic2
time3 = toc3 - tic3
time4 = toc4 - tic4
print('Overall time: %s' % time)
print('dB create & connect time: %s' % time1)
print('Load id csvs time: %s' % time2)
print('Create tables and write to db time: %s' % time3)
print('Time to import libraries: %s' % time4)
and the R code (which is much more legible)
tic = proc.time()
library(RPostgreSQL)
tic1 = proc.time()
system('dropdb ptest1')
system('createdb ptest1')
drv = dbDriver("PostgreSQL")
con = dbConnect(drv, dbname='ptest1')
toc1 = proc.time()
time1 = toc1 - tic1
tic2 = proc.time()
id.1 = read.csv('/myhomedir/di1.csv',stringsAsFactors=F,sep='\t')
id.2 = read.csv('/myhomedir/di2.csv',stringsAsFactors=F)
id.3 = read.csv('/myhomedir/di.c.csv',stringsAsFactors=F,sep='\t')
id.3 = id.3[,-1]
toc2 = proc.time()
time2 = toc2 - tic2
tic3 = proc.time()
dbWriteTable(con,'id1',id.1)
dbWriteTable(con,'id2',id.2)
dbWriteTable(con,'id3',id.3)
toc3 = proc.time()
time3 = toc3 - tic3
toc = proc.time()
time = toc - tic
tyme = rbind(time1,time2,time3,time)
tyme = data.frame(Function=c('Create & Connect to DB',"Load CSV's for save","Write Table to DB",'Overall Time'),tyme)
I was very surprised at how close the times for the two were. (I've read an awful lot on R being slow and Python being extremely fast)
For python
>>> Overall time: 2.48381304741
dB create & connect time: 1.96832108498
Load id csvs time: 0.000378847122192
Create tables and write to db time: 0.35303401947
Time to import libraries: 0.162075042725
and for R
Function user.self sys.self elapsed user.child sys.child
time1 Create & Connect to DB 0.112 0.016 1.943 0.06 0.004
time2 Load CSV's for save 0.008 0.000 0.006 0.00 0.000
time3 Write Table to DB 0.096 0.004 0.349 0.00 0.000
time Overall Time 0.376 0.028 2.463 0.06 0.004
I wondered if it had something to with the fact that I'm INSERT
ing a row at at time into the python version of the table.
Hence the main question - is there an equivalent in python for the dbWriteTable
block in the R code and would it speed things up?
A second ancillary question would be is there anything obviously wrong with the code that might be slowing things down.
Happy to provide sample csv's if that would help.
Not looking to start a flame war on R v Python, would just like to know how I can make my code faster.
Thanks
Try this:
id2 = csv.reader(open(os.path.join(perf_dir,id_files[1])))
h = tuple(next(id2))
create = '''CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)''' % h
insert = '''INSERT INTO id2 (%s, %s, %s) VALUES (%%s, %%s, %%s)''' % h
...
cr.executemany(insert, id2)
Pass the sql command only once for all the rows. In addition to the performance overhead of issuing additional sql commands, notice that the size of the sql string itself (in bytes) likely dwarfs the size of the real data under test.
Using id2 as a generator without the array should also avoid loading the entire data set into memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With