Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up INSERT of 1 million+ rows into Postgres via R using COPY?

I would like to bulk-INSERT/UPSERT a moderately large amount of rows to a postgreSQL database using R. In order to do so I am preparing a multi-row INSERT string using R.

 query <- sprintf("BEGIN;
                         CREATE TEMPORARY TABLE 
                         md_updates(ts_key varchar, meta_data hstore) ON COMMIT DROP;

                         INSERT INTO md_updates(ts_key, meta_data) VALUES %s;
                         LOCK TABLE %s.meta_data_unlocalized IN EXCLUSIVE MODE;

                         UPDATE %s.meta_data_unlocalized
                         SET meta_data = md_updates.meta_data
                         FROM md_updates
                         WHERE md_updates.ts_key = %s.meta_data_unlocalized.ts_key;
                         COMMIT;", md_values, schema, schema, schema, schema)

DBI::dbGetQuery(con,query)

The entire function can be found here. Surprisingly (at leat to me) I learned that the UPDATE part is not the problem. I left it out and ran a the query again and it wasn't much faster. INSERT a million+ records seems to be the issue here.

I did some research and found quite some information:

bulk inserts

bulk inserts II

what causes large inserts to slow down

answers from @Erwin Brandstetter and @Craig Ringer were particularly helpful. I was able to speed things up quite a bit by dropping indices and following a few other suggestions.

However, I struggled to implement another suggestion which sounded promising: COPY. The problem is I can't get it done from within R.

The following works for me:

sql <- sprintf('CREATE TABLE 
            md_updates(ts_key varchar, meta_data hstore);
            COPY md_updates FROM STDIN;')


 dbGetQuery(sandbox,"COPY md_updates FROM 'test.csv' DELIMITER ';' CSV;")

But I can't get it done without reading from a extra .csv file. So my questions are:

  • Is COPY really a promising approach here (over the multi-row INSERT I got?

  • Is there a way to use COPY from within R without writing data to a file. Data does fit in memory and since it's already in mem why write to disk?

I am using PostgreSQL 9.5 on OS X and 9.5 on RHEL respectively.

like image 604
Matt Bannert Avatar asked Apr 11 '17 15:04

Matt Bannert


1 Answers

RPostgreSQL has a "CopyInDataframe" function that looks like it should do what you want:

install.packages("RPostgreSQL")
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), user="...", password="...", dbname="...", host="...")
dbSendQuery(con, "copy foo from stdin")
postgresqlCopyInDataframe(con, df)

Where table foo has the same columns as dataframe df

like image 105
Neil McGuigan Avatar answered Sep 28 '22 00:09

Neil McGuigan