I frequently replace data as I transform it in dplyr, especially when working with large datasets. I'm not sure how to do this elegantly when I am working with SQL-backed datasets, at least not with SQLite.
I could not find any discussion of this goal in the dplyr DB vignettes or on SO, which also makes me wonder if there's something wrong with what I'm doing in the first place; however, it seems to be a natural way to work with large datasets.
At any rate, the most intuitive approach does not work:
library(dplyr)
library(RSQLite)
db2 <- src_sqlite("trouble.sqlite", create = TRUE)
trouble <- data.frame(values = c(5, 1, 3))
trouble.db <- copy_to(db2, trouble, temporary = FALSE)
collect(trouble.db) # 5, 3, 1
trouble.db <- trouble.db %>% arrange(values)
collect(trouble.db) # 1, 3, 5
trouble.in <- tbl(db2, sql("SELECT * from trouble"))
collect(trouble.in) # 5, 3, 1
Another intuitive syntax for in-place copy gives a "table already exists" error:
trouble.db <- copy_to(db2, as.data.frame(trouble.db), name="trouble", temporary = FALSE)
One solution is to manually drop the table and rebuild it, which is what I've been doing:
db2$con %>% db_drop_table(table = "trouble")
trouble <- collect(trouble.db)
trouble.db <- copy_to(db2, trouble, temporary = FALSE)
Another one is to give up on replacement and create a series of temporary tables, which I find unaesthetic but which I suppose might be the recommended paradigm:
trouble_temp <- data.frame(values = c(5, 1, 3))
trouble_temp.db <- copy_to(db2, trouble_temp, temporary = TRUE)
trouble <- trouble.db %>% arrange(values)
trouble.db <- copy_to(db2, trouble, temporary = FALSE)
I suspect that "drop and re-copy" is going to wind up being the answer, but out of an abundance of love for beautiful solutions, I thought I would ask if there is a better way.
For anyone finding this years after.
The statement
trouble.db %>% arrange(values)
Creates an SQL query that's sent to the database and executed when you collect
the result.
We can see that SQL like this
trouble.db %>% arrange(values) %>% show_query()
SELECT *
FROM `trouble`
ORDER BY `values`
Clearly a query like this can't modify the actual data it's querying.
To modify the data, we can use dbWriteTable
function from the DBI
package
library(dplyr)
library(RSQLite)
# Will use this connection object for all our DB interactions
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
trouble <- data.frame(values = c(5, 1, 3))
trouble.db <- copy_to(con, trouble, temporary = FALSE)
collect(trouble.db) # 5, 3, 1
# This is just a query
trouble.db <- trouble.db %>% arrange(values)
collect(trouble.db) # 1, 3, 5
# The data shouldn't be modified yet
trouble.in <- tbl(con, sql("SELECT * from trouble"))
collect(trouble.in) # 5, 3, 1
# Now we are modifying the data
DBI::dbWriteTable(
con, "trouble", collect(trouble.db),
overwrite = TRUE
)
tbl(con, sql("SELECT * from trouble")) %>% collect() # 1, 3, 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With