Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing SQL database table with transformed data in dplyr

Tags:

sql

sqlite

r

dplyr

I frequently replace data as I transform it in dplyr, especially when working with large datasets. I'm not sure how to do this elegantly when I am working with SQL-backed datasets, at least not with SQLite.

I could not find any discussion of this goal in the dplyr DB vignettes or on SO, which also makes me wonder if there's something wrong with what I'm doing in the first place; however, it seems to be a natural way to work with large datasets.

At any rate, the most intuitive approach does not work:

library(dplyr)
library(RSQLite)

db2 <- src_sqlite("trouble.sqlite", create = TRUE)
trouble <- data.frame(values = c(5, 1, 3))
trouble.db <- copy_to(db2, trouble, temporary = FALSE)

collect(trouble.db) # 5, 3, 1

trouble.db <- trouble.db %>% arrange(values)
collect(trouble.db) # 1, 3, 5

trouble.in <- tbl(db2, sql("SELECT * from trouble"))
collect(trouble.in) # 5, 3, 1

Another intuitive syntax for in-place copy gives a "table already exists" error:

trouble.db <- copy_to(db2, as.data.frame(trouble.db), name="trouble", temporary = FALSE)

One solution is to manually drop the table and rebuild it, which is what I've been doing:

db2$con %>% db_drop_table(table = "trouble")
trouble <- collect(trouble.db)
trouble.db <- copy_to(db2, trouble, temporary = FALSE)

Another one is to give up on replacement and create a series of temporary tables, which I find unaesthetic but which I suppose might be the recommended paradigm:

trouble_temp <- data.frame(values = c(5, 1, 3))
trouble_temp.db <- copy_to(db2, trouble_temp, temporary = TRUE)
trouble <- trouble.db %>% arrange(values)
trouble.db <- copy_to(db2, trouble, temporary = FALSE)

I suspect that "drop and re-copy" is going to wind up being the answer, but out of an abundance of love for beautiful solutions, I thought I would ask if there is a better way.

like image 897
David Bruce Borenstein Avatar asked Feb 11 '16 15:02

David Bruce Borenstein


1 Answers

For anyone finding this years after.

The statement

trouble.db %>% arrange(values)

Creates an SQL query that's sent to the database and executed when you collect the result.

We can see that SQL like this

trouble.db %>% arrange(values) %>% show_query()
SELECT *
FROM `trouble`
ORDER BY `values`

Clearly a query like this can't modify the actual data it's querying.

To modify the data, we can use dbWriteTable function from the DBI package

library(dplyr)
library(RSQLite)

# Will use this connection object for all our DB interactions
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")

trouble <- data.frame(values = c(5, 1, 3))
trouble.db <- copy_to(con, trouble, temporary = FALSE)

collect(trouble.db) # 5, 3, 1

# This is just a query
trouble.db <- trouble.db %>% arrange(values)
collect(trouble.db) # 1, 3, 5

# The data shouldn't be modified yet
trouble.in <- tbl(con, sql("SELECT * from trouble"))
collect(trouble.in) # 5, 3, 1

# Now we are modifying the data
DBI::dbWriteTable(
  con, "trouble", collect(trouble.db),
  overwrite = TRUE
)

tbl(con, sql("SELECT * from trouble")) %>% collect() # 1, 3, 5
like image 86
khvorov45 Avatar answered Oct 02 '22 14:10

khvorov45