Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to insert data frame from R to SQL

Tags:

mysql

r

rodbc

I have a data frame with 10 million rows and 5 columns that I want to insert to an existing sql table. Note that I do not have permission to create a table, I can only insert values into an existing table. I'm currently using RODBCext

query_ch <- "insert into [blah].[dbo].[blahblah] 
               (col1, col2, col3, col4, col5)
               values (?,?,?,?,?)"

sqlExecute(channel, query_ch, my_data) 

This takes way too long (more than 10 hours). Is there a way accomplish this faster?

like image 223
user124543131234523 Avatar asked May 10 '17 00:05

user124543131234523


People also ask

Can you use R and SQL together?

Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases.

Which package provides a way to directly use SQL to query data frames in R as if these data frames are tables in relational databases?

sqldf is an open-source library used to run SQL Statements on R data frames. It works with multiple databases such as SQLite, H2, PostgreSQL, and MySQL databases.

How do I write pandas Dataframe in MySQL?

Create a dataframe by calling the pandas dataframe constructor and passing the python dict object as data. Invoke to_sql() method on the pandas dataframe instance and specify the table name and database connection. This creates a table in MySQL database server and populates it with the data from the pandas dataframe.


1 Answers

TL;DR: LOAD DATA INFILE is one order of magnitude faster than multiple INSERT statements, which are themselves one order of magnitude faster than single INSERT statements.

I benchmark below the three main strategies to importing data from R into Mysql:

  1. single insert statements, as in the question:

    INSERT INTO test (col1,col2,col3) VALUES (1,2,3)

  2. multiple insert statements, formated like so:

    INSERT INTO test (col1,col2,col3) VALUES (1,2,3),(4,5,6),(7,8,9)

  3. load data infile statement, i.e. loading a previously written CSV file in mysql:

    LOAD DATA INFILE 'the_dump.csv' INTO TABLE test


I use RMySQL here, but any other mysql driver should lead to similar results. The SQL table was instantiated with:

CREATE TABLE `test` (
  `col1` double, `col2` double, `col3` double, `col4` double, `col5` double
) ENGINE=MyISAM;

The connection and test data were created in R with:

library(RMySQL)
con = dbConnect(MySQL(),
                user = 'the_user',
                password = 'the_password',
                host = '127.0.0.1',
                dbname='test')

n_rows = 1000000 # number of tuples
n_cols = 5 # number of fields
dump = matrix(runif(n_rows*n_cols), ncol=n_cols, nrow=n_rows)
colnames(dump) = paste0('col',1:n_cols)

Benchmarking single insert statements:

before = Sys.time()
for (i in 1:nrow(dump)) {
  query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES (',paste0(dump[i,],collapse = ','),');')
  dbExecute(con, query)
}
time_naive = Sys.time() - before 

=> this takes about 4 minutes on my computer


Benchmarking multiple insert statements:

before = Sys.time()
chunksize = 10000 # arbitrary chunk size
for (i in 1:ceiling(nrow(dump)/chunksize)) {
  query = paste0('INSERT INTO test (',paste0(colnames(dump),collapse = ','),') VALUES ')
  vals = NULL
  for (j in 1:chunksize) {
    k = (i-1)*chunksize+j
    if (k <= nrow(dump)) {
      vals[j] = paste0('(', paste0(dump[k,],collapse = ','), ')')
    }
  }
  query = paste0(query, paste0(vals,collapse=','))
  dbExecute(con, query)
}
time_chunked = Sys.time() - before 

=> this takes about 40 seconds on my computer


Benchmarking load data infile statement:

before = Sys.time()
write.table(dump, 'the_dump.csv',
          row.names = F, col.names=F, sep='\t')
query = "LOAD DATA INFILE 'the_dump.csv' INTO TABLE test"
dbSendStatement(con, query)
time_infile = Sys.time() - before 

=> this takes about 4 seconds on my computer


Crafting your SQL query to handle many insert values is the simplest way to improve the performances. Transitioning to LOAD DATA INFILE will lead to optimal results. Good performance tips can be found in this page of mysql documentation.

like image 154
Jealie Avatar answered Sep 22 '22 15:09

Jealie