Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to pipe SQL into R's dplyr?

I can use the following code in R to select distinct rows in any generic SQL database. I'd use dplyr::distinct() but it's not supported in SQL syntax. Anyways, this does indeed work:

dbGetQuery(database_name, 
           "SELECT t.* 
           FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM 
           FROM table_name t
           ) t 
           WHERE SEQNUM = 1;")

I've been using it with success, but wonder how I can pipe that same SQL query after other dplyr steps, as opposed to just using it as a first step as shown above. This is best illustrated with an example:

distinct.df <- 
  left_join(sql_table_1, sql_table_2, by = "col5") %>% 
  sql("SELECT t.* 
      FROM (SELECT t.*, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY column_name) AS SEQNUM 
      FROM table_name t
      ) t 
      WHERE SEQNUM = 1;")

So I dplyr::left_join() two SQL tables, then I want to look at distinct rows, and keep all columns. Do I pipe SQL code into R as shown above (simply utilizing the sql() function)? And if so what would I use for the table_name on the line FROM table_name t?

In my first example I use the actual table name that I'm pulling from. It's too obvious! But in this case I am piping and am used to using the magrittr pronoun . or sometimes the .data pronoun from rlang if I were in memory working in R without databases.

I'm in a SQL database though... so how do I handle this situation? How do I properly pipe my known working SQL into my R code (with a proper table name pronoun)? dbplyr's reference page is a good starting point but doesn't really answer this specific question.

like image 404
Display name Avatar asked Dec 30 '19 22:12

Display name


People also ask

How do I pull SQL data into R?

Query using DBI You can query your data with DBI by using the dbGetQuery() function. Simply paste your SQL code into the R function as a quoted string. This method is sometimes referred to as pass through SQL code, and is probably the simplest way to query your data. Care should be used to escape your quotes as needed.

Can you use SQL and R together?

Not only can you easily retrieve data from SQL Sources for analysis and visualisation in R, but you can also use SQL to create, clean, filter, query and otherwise manipulate datasets within R, using a wide choice of relational databases. There is no reason to abandon your hard-earned SQL skills!

Is dplyr better than SQL?

dplyr is a R package that provides a set of grammar based functions to transform data. Compared to using SQL, it's much easier to construct and much easier to read what's constructed.

Is dplyr based on SQL?

dplyr data verbsBased on SQL syntax: select() -> SELECT. mutate() -> user-defined columns. summarize() -> aggregated columns.


2 Answers

It looks like you are wanting to combine custom SQL code with auto-generated SQL code from dbplyr. For this it is important to distinguish between:

  • DBI::db* commands - that execute the provided SQL on the database and return the result.
  • dbplyr translation - where you work with a remote connection to a table

You can only combine these in certain ways. Below I have given several examples depending on your particular use case. All assume that DISTINCT is a command that is accepted in your specific SQL environment.

Reference examples that cover many of the different use cases

If you'll excuse some self-promotion, I recommend you take a look at my dbplyr_helpers GitHub repository (here). This includes:

  • union_all function that takes in two tables accessed via dbplyr and outputs a single table using some custom SQL code.
  • write_to_datebase function that takes a table accessed via dbplyr and converts it to code that can be executed via DBI::dbExecute

Automatic piping

dbplyr automatically pipes your code into the next query for you when you are working with standard dplyr verbs for which there are SQL translations defined. So long as sql translations are defined you can chain together many pipes (I used 10 or more at once) with the (almost) only disadvantage being that the sql translated query gets difficult for a human to read.

For example, consider the following:

library(dbplyr)
library(dplyr)

tmp_df = data.frame(col1 = c(1,2,3), col2 = c("a","b","c"))

df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())

df = left_join(df1, df2, by = "col1") %>%
  distinct()

When you then call show_query(df) R returns the following auto-generated SQL code:

SELECT DISTINCT *
FROM (

SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)

) `dbplyr_002`

But not as nicely formatted. Note that the initial command (left join) appears as a nested query, with a distinct in the outer query. Hence df is an R link to a remote database table defined by the above sql query.

Creating custom SQL functions

You can pipe dbplyr into custom SQL functions. Piping means that the thing being piped becomes the first argument of the receiving function.

custom_distinct <- function(df){
  db_connection <- df$src$con

  sql_query <- build_sql(con = db_connection,
                         "SELECT DISTINCT * FROM (\n",
                         sql_render(df),
                         ") AS nested_tbl"
  )
  return(tbl(db_connection, sql(sql_query)))
}

df = left_join(df1, df2, by = "col1") %>%
  custom_distinct()

When you then call show_query(df) R should return the following SQL code (I say 'should' because I can not get this working with simulated sql connections), but not as nicely formatted:

SELECT DISTINCT * FROM (

SELECT `LHS`.`col1` AS `col1`, `LHS`.`col2` AS `col2.x`, `RHS`.`col2` AS `col2.y`
FROM `df` AS `LHS`
LEFT JOIN `df` AS `RHS`
ON (`LHS`.`col1` = `RHS`.`col1`)

) nested_tbl

As with the previous example, df is an R link to a remote database table defined by the above sql query.

Converting dbplyr to DBI

You can take the code from an existing dbplyr remote table and convert it to a string that can be executed using DBI::db*.

As another way of writing a distinct query:

df1 = tbl_lazy(tmp_df, con = simulate_postgres())
df2 = tbl_lazy(tmp_df, con = simulate_postgres())

df = left_join(df1, df2, by = "col1")

custom_distinct2 = paste0("SELECT DISTINCT * FROM (",
                          as.character(sql_render(df)),
                          ") AS nested_table")

local_table =   dbGetQuery(db_connection, custom_distinct2)

Which will return a local R dataframe with the equivalent sql command as per the previous examples.

like image 96
Simon.S.A. Avatar answered Sep 20 '22 13:09

Simon.S.A.


If you want to do custom SQL processing on the result of a dbplyr operation, it may be useful to compute() first, which creates a new table (temporary or permanent) with the result set on the database. The reprex below shows how to access the name of the newly generated table if you rely on autogeneration. (Note that this relies on dbplyr internals and is subject to change without notice -- perhaps it's better to name the table explicitly.) Then, use dbGetQuery() as usual.

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

lazy_query <-
  memdb_frame(a = 1:3) %>%
  mutate(b = a + 1) %>%
  summarize(c = sum(a * b, na.rm = TRUE))

lazy_query
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#>       c
#>   <dbl>
#> 1    20

lazy_query_computed <-
  lazy_query %>%
  compute()

lazy_query_computed
#> # Source:   table<dbplyr_002> [?? x 1]
#> # Database: sqlite 3.30.1 [:memory:]
#>       c
#>   <dbl>
#> 1    20
lazy_query_computed$ops$x
#> <IDENT> dbplyr_002

Created on 2020-01-01 by the reprex package (v0.3.0)

If your SQL dialect supports CTEs, you could also extract the query string and use this as part of a custom SQL, perhaps similarly to Simon's suggestion.

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

lazy_query <-
  memdb_frame(a = 1:3) %>%
  mutate(b = a + 1) %>%
  summarize(c = sum(a * b, na.rm = TRUE))

sql <-
  lazy_query %>%
  sql_render()

cte_sql <-
  paste0(
    "WITH my_result AS (", sql, ") ",
    "SELECT c + 1 AS d FROM my_result"
  )

cte_sql
#> [1] "WITH my_result AS (SELECT SUM(`a` * `b`) AS `c`\nFROM (SELECT `a`, `a` + 1.0 AS `b`\nFROM `dbplyr_001`)) SELECT c + 1 AS d FROM my_result"

DBI::dbGetQuery(
  lazy_query$src$con,
  cte_sql
)
#>    d
#> 1 21

Created on 2020-01-01 by the reprex package (v0.3.0)

like image 25
krlmlr Avatar answered Sep 20 '22 13:09

krlmlr